Our approach to alignment research

There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to find a number of new alignment issues that we haven’t yet seen in current systems. Some of these issues we anticipate now, and some of them will be entirely new.

We believe that finding a solution that is scalable indefinitely is probably very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make alignment research progress faster and better than humans.

As we progress through this, our AI systems can increasingly take over our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work alongside the humans to ensure that their own successors are more aligned with the humans.

We believe that evaluating alignment research is substantially easier than producing it, especially when evaluation assistance is provided. Therefore, human researchers will increasingly focus their effort on reviewing alignment research performed by AI systems rather than generating this research themselves. Our goal is to train models to be so aligned that we can offload almost all of the cognitive work required for alignment research.

Importantly, we only need “tighter” AI systems that have human-level capabilities in the relevant domains to do as well as humans at alignment research. We expect these AI systems to be easier to align than general purpose systems or the systems to be much smarter than humans.

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a wealth of knowledge and information about the human values of Internet reading. Outside of the box, they are not independent agents and therefore do not pursue their own goals in the world. To conduct alignment research, you do not need unlimited Internet access. However, many alignment investigation tasks can be expressed as natural language or coding tasks.

Future versions of WebGPT, InstructGPT, and Codex may provide a foundation as alignment research aids, but they are not yet capable enough. While we don’t know when our models will be capable enough to contribute significantly to alignment research, we think it’s important to start early. Once we train a model that might be useful, we plan to make it accessible to the external alignment research community.