AI Safety Fundamentals: Alignment

Jul 19 2024 83 ep. 25 mins 31

AI Safety Fundamentals: Alignment Podcast artwork

Listen to resources from the AI Safety Fundamentals: Alignment course!

https://aisafetyfundamentals.com/alignment

Technology Society & Culture

Copy RSS

Subscribe on Podcast Addict

Constitutional AI Harmlessness from AI Feedback

Jul 19 2024 61 mins

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Jul 19 2024 32 mins

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Jul 19 2024 22 mins

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach' from the previous video.A podcast by BlueDot Impact. Learn more on the AI Safety Fundamen

Eliciting Latent Knowledge

Jun 17 2024 60 mins

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.But some action sequences

Deep Double Descent

Jun 17 2024 8 mins

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view fu

Chinchilla’s Wild Implications

Jun 17 2024 24 mins

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular: Data, not size, is the currently active constraint on language modeling performance. Current returns

Intro to Brain-Like-AGI Safety

Jun 17 2024 62 mins

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line

Gradient Hacking: Definitions and Examples

Jun 17 2024 9 mins

Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term traini

An Investigation of Model-Free Planning

Jun 17 2024 8 mins

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods

Discovering Latent Knowledge in Language Models Without Supervision

Jun 17 2024 37 mins

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the i