This white paper focuses on (1) our method for converting visual data of all types into a unified representation that enables large-scale training of generative models and (2) qualitative evaluation of Sora's capabilities and limitations. Model and implementation details are not included in this report.
Much previous work has studied generative modeling of video data using a variety of methods, including recurrent networks,(^1)(^2) generative adversarial networks,(^4)(^6) autoregressive transformers,(^8) and diffusion models.(^10)(^12)