One Minute Papers
Posts
SORA Is A World Engine

SORA Is A World Engine

The Open AI Paper and A Deep Dive on Vision Models

Mike W
February 25, 2024

Welcome curious enthusiasts, knowledge seekers and hardcore researchers.

The world of AI is moving fast, last week we had Google drop Gemini Pro 1.5, not to be outdone on the Generative AI front, OpenAI upped the ante with SORA, its text-to-video model that generates studio grade scenes from prompts. The SORA webpage boasts of stunning scenes with multiple characters, objects in motion and complex interactions. It wasn’t too long ago that AI proved to be a point of contention for the Screen Actors Guild but it seems with the way that AI is going the movie industry is in for further strife.

But what really has AI proponents excited isn’t the shiny new toy, but it’s the fact that SORA doesn’t just generates output, it’s a world engine that can simulate physical interactions in the real world. For example, for a generated video of a basketball bouncing on the ground, how does the AI know that the ball should bounce instead of falling through the ground? Things us humans take for granted but are tremendously difficult for AI.

Today we look at the SORA technical paper and some of the associated research that make text-to-vision models possible.

The System

Let’s start.

SORA Technical Paper

OpenAI Research Team

Key Topics: Generative Models, Multimodal Models, Digital World Simulation, SORA, Patch-based Training

Link: here | AI Score: 🚀🚀🚀 | Interest Score: 🧲🧲🧲 | Reading Time: ⏰

Result: The paper explores the capabilities of Sora, a diffusion transformer model, in scaling video generation. Sora is trained to predict original "clean" patches from input noisy patches and conditioning information like text prompts. The model demonstrates effective scaling properties for video generation, producing high-quality video samples as training compute increases. Unlike past approaches, Sora trains on data at its native size, allowing for sampling flexibility in various resolutions and aspect ratios. By training on videos at their native aspect ratios, Sora improves framing and composition compared to models trained on square-cropped videos. The paper highlights Sora's ability to generate videos with dynamic camera motion, maintain long-range coherence, object permanence, and simulate interactions in digital worlds. While Sora shows promise as a video simulator, it has limitations in accurately modeling certain physical interactions. The study emphasizes the potential of scaling video models for simulating the physical and digital world effectively.

Recurrent Environment Simulators

Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, Shakir Mohamed

Key Topics: Recurrent Neural Networks, Environment Simulation, Agent Planning, Computational Efficiency, Temporal Coherence

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰

Result: This paper explores the use of recurrent neural networks in simulating environments to help agents plan and act efficiently. The authors introduce models that can make accurate, temporally and spatially coherent predictions for long time periods into the future. They address the issue of computational inefficiency by developing a model that does not need to generate high-dimensional images at each time-step. The study shows that this approach can improve exploration and adapt to diverse environments, including Atari games, a 3D car racing environment, and complex 3D mazes. The paper emphasizes the importance of environment simulation for agent-based systems and highlights the potential applications of these models in various domains.

World Models

David Ha, Jürgen Schmidhuber

Key Topics: Reinforcement Learning, Predictive World Models, Representation Learning, Evolution Strategies, Task Efficiency

Link: here | AI Score: 🚀 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰ ⏰

Result: The paper discusses the concept of world models in the context of reinforcement learning. It highlights the importance of training a predictive world model to extract useful representations of space and time. By using these features as inputs to a controller, a compact and minimal controller can be trained to perform tasks such as learning to drive in a car racing environment. The paper also mentions the use of evolution strategies to train the controller, which simplifies the training process. Overall, the paper emphasizes the effectiveness of world models in training agents to solve complex tasks efficiently.

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas

Key Topics: Video Generation, VQ-VAE, Transformers, ViZDoom Dataset, Autoregressive Models

Link: here | AI Score: 🚀 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰

Result: The paper introduces VideoGPT, a model that combines VQ-VAE and Transformers to generate high-quality videos. The training data is collected by training policies in ViZDoom environments, resulting in a dataset split into train, validation, and test sets. VideoGPT captures complex 3D camera movements and environment interactions, producing visually consistent action-conditioned samples. The model outperforms several baseline models in generating diverse backgrounds and scenarios. The architecture choice of likelihood-based autoregressive models is explained, emphasizing their success in modeling video data. The paper highlights the importance of deep learning advancements in enabling such video generation capabilities.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Key Topics: Transformers, Image Recognition, Computer Vision, Self-Attention Mechanisms, Self-Supervised Pre-Training

Link: here | AI Score: 🚀 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰ ⏰

Result: The paper explores the use of Transformers for image recognition at scale. It compares the performance of large-scale pre-trained Transformers with state-of-the-art CNNs on medium-resolution images. The study shows that Transformers can be competitive or even better than CNNs for image classification tasks. The paper also discusses the combination of CNNs with self-attention mechanisms for various computer vision tasks. Additionally, the authors highlight the importance of self-supervised pre-training methods and suggest that further scaling of Transformers could lead to improved performance.

ViViT: A Video Vision Transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

Key Topics: Video Classification, Transformer Architectures, Regularization Techniques, Ablation Studies, Video Datasets

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰ ⏰

Result: The paper introduces pure-transformer architectures for video classification, presenting efficient variants to handle long sequences of tokens in videos. By leveraging regularization techniques and pretrained image models, the models achieve state-of-the-art results on various video classification benchmarks. The study also conducts thorough ablation studies to analyze design choices and performance across datasets like Kinetics 400 and 600, Epic Kitchens, Something-Something v2, and Moments in Time. Future work includes removing dependence on image-pretrained models and expanding to more complex tasks.

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby

Key Topics: Vision Transformer, NaViT Model, Learning Rate Scheduling, Compute-Matched Comparisons, Responsible AI

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: ⏰

Result: The paper explores the training details of NaViT, a Vision Transformer model, using ViT-B/32, ViT-B/16, and ViT-L/16 configurations. The training process involves a reciprocal square-root learning rate schedule with warmup and cooldown phases. Higher weight decay is applied to the head compared to the body during upstream training. NaViT and ViT models are evaluated with different compute budgets, allowing for "compute-matched" comparisons. The study emphasizes responsible AI use and highlights the importance of considering potential risks and biases in deploying AI tools.

-That’s all for now-

AI Trivia

AlphaFold, developed by DeepMind, made headlines in 2020 by solving a 50-year-old grand challenge in biology: accurately predicting the 3D structure of proteins from their amino acid sequences. Its breakthrough achievement marked a significant advancement in understanding protein folding, essential for drug discovery, disease research, and bioengineering

ChatGPT