Unless your living under a rock, in the world of AI OpenAI has released their first text-to-video model and it is impressive.
Sora is an AI model developed by OpenAI that can create realistic and imaginative scenes from text instructions. It is a text-to-video model capable of generating videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Sora is designed to understand and simulate the physical world in motion, with the goal of training models that help people solve problems requiring real-world interaction. The model can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.
Examples
Sora can generate multiple videos side-by-side at the same time.
Sora can combine videos
Sora can follow up and edit videos.
The Architecture
According to Sora’s technical report, Sora’s architecture involves turning visual data into patches, compressing videos into a lower-dimensional latent space, training a network that reduces the dimensionality of visual data, and extracting a sequence of spacetime patches which act as transformer tokens. Sora is a diffusion model that scales effectively as a video model and can generate videos with variable durations, resolutions, and aspect ratios. It can also be prompted with other inputs, such as pre-existing images or video, enabling a wide range of image and video editing tasks. Additionally, Sora exhibits emerging simulation capabilities, such as 3D consistency, long-range coherence, object permanence, interacting with the world, and simulating digital worlds. However, it also has limitations in accurately modeling the physics of basic interactions and other failure modes.
Sora is a comprehensive diffusion transformer model that processes text or images and generates video pixel output. By analyzing vast volumes of video data using gradient descent, Sora acquires an internal understanding of physical dynamics, essentially forming a trainable simulation or “world model.” While Sora doesn’t directly integrate Unreal Engine 5 (UE5) into its processing loop, it can incorporate text and video pairs created with UE5 into its training data as synthetic examples.
Limitations
Sora’s emergent physics understanding is still fragile and imperfect.
Despite extensive research and testing, OpenAI acknowledges that it cannot predict all the beneficial ways people will use the technology, nor all the ways people will abuse it. The model is based on a diffusion architecture and uses a transformer architecture similar to GPT models. It builds on past research in DALL·E and GPT models, using the recaptioning technique from DALL·E 3 to follow the user’s text instructions in the generated video more faithfully. Sora serves as a foundation for models that can understand and simulate the real world, a capability believed to be an important milestone for achieving AGI.