Generalization from weak to strong

There are still important differences between our current empirical setup and the fundamental problem of aligning superhuman models. For example, it may be easier for future models to imitate weak human errors than for current strong models to imitate the errors of current weak models, which could make generalization more difficult in the future.

However, we believe our setup captures some key difficulties in aligning future superhuman models, allowing us to begin making empirical progress on this problem today. There are many promising directions for future work, including addressing disanalogies in our setup, developing better scalable methods, and advancing our scientific understanding of when and how we should expect good generalization from weak to strong.

We believe this is an exciting opportunity for the ML research community to advance alignment. To encourage further research in this area,

we are releasing open source code to make it easier to start weak-to-strong generalization experiments today.
We are launching a $10 million grant program for graduate students, academics and other researchers to work on aligning superhuman ai broadly. We are especially excited to support research related to weak-to-strong generalization.

Figuring out how to align future superhuman ai systems to be safe has never been more important, and it is now easier than ever to make empirical progress on this problem. We're excited to see what advances researchers discover.