Major Announcement! OpenAI Unveils New Model sCM: Image Generation Speed Increased by 50x, Real-Time Video Generation No Longer a Dream

51 min read

image-20241024083225938

OpenAI has just announced a significant technological breakthrough with the introduction of a new model called sCM (Sequential Consistency Model). sCM is set to usher in a new era of real-time, high-quality, cross-domain generative AI for video, images, 3D models, and audio.

While diffusion models have been thriving in the generative AI field, their slow sampling speed has been a major drawback. Generating a single image often requires dozens or even hundreds of steps, making the process painfully inefficient. Although distillation techniques such as direct distillation, adversarial distillation, progressive distillation, and variational score distillation (VSD) can accelerate sampling, they each have their limitations, such as high computational costs, complex training, and reduced sample quality. Now, OpenAI has introduced the new sCM model, which requires only two sampling steps, achieving a 50x speed increase while matching or even surpassing the performance of diffusion models.

As an extension and improvement of its previous consistency model research, sCM simplifies the theoretical framework, enabling stable training on large datasets while maintaining sample quality comparable to leading diffusion models. The generation process is completed in just two sampling steps. OpenAI has also released the related research paper, co-authored by two Chinese researchers who both graduated from Tsinghua University.

image-20241024083249265

Paper: https://arxiv.org/pdf/2410.11081

What is sCM?

sCM is not entirely different from diffusion models; in fact, it is an improved version based on diffusion models. More precisely, sCM is a consistency model that borrows principles from diffusion models and enhances them to generate high-quality samples with fewer sampling steps.

The core of sCM is learning a function ( f_\theta(x_t, t) ) that maps a noisy image ( x_t ) to its clear version at the next timestep on the PF-ODE trajectory. This process does not remove all noise at once but moves the image one step closer to clarity along the PF-ODE direction. With two sampling steps, sCM performs this mapping twice to produce a relatively clear image.

Key Points of sCM vs. Diffusion Models:

  • Improved Based on Diffusion Models: sCM relies on the PF-ODE of diffusion models to define training objectives and sampling paths, not as an entirely independent model.
  • Focus on Single-Step Denoising: sCM's training objective is to learn a function that can effectively denoise within a single timestep, unlike multi-step iterative denoising in diffusion models.
  • Faster Sampling Speed: sCM requires only a few sampling steps (e.g., two), making it much faster than diffusion models.
  • Not One-Step Completion: sCM's single-step denoising does not remove all noise at once but moves the image one step closer to clarity along the PF-ODE trajectory, achieving denoising through multiple iterations.

sCM: Two Steps, Speed Soars!

Building on previous consistency models research and incorporating the strengths of EDM and flow matching models, OpenAI proposed TrigFlow, a unified framework. This framework simplifies theoretical formulas, stabilizes the training process, and integrates diffusion processes, diffusion model parameterization, PF-ODE, diffusion training objectives, and CM parameterization into simpler expressions, laying a solid foundation for subsequent theoretical analysis and improvements.

Based on TrigFlow, OpenAI developed the sCM model, which can even train a 1.5 billion parameter model on ImageNet 512x512 resolution, a first in the field! This is currently the largest continuous-time consistency model.

The most impressive feature of sCM is that it can generate images of comparable quality to diffusion models in just two sampling steps, achieving a 50x speed increase. For example, the largest 1.5 billion parameter model can generate an image in just 0.11 seconds on a single A100 GPU without any optimization. With further system optimization, the speed could be even faster, opening the door to real-time generation! 🚀

Sampling Time Measured on a Single A100 GPU, Batch Size = 1

How Powerful is sCM?

OpenAI evaluated sCM's performance using FID (Fréchet Inception Distance) scores (lower is better) and effective sampling computational cost (total computational cost required to generate each sample). The results showed that sCM's two-step sampling quality is comparable to the best previous methods, but with less than 10% of the computational cost.

On ImageNet 512x512, sCM's FID score is even better than some diffusion models that require 63 steps! On CIFAR-10, it achieved an FID of 2.06, on ImageNet 64x64, it reached 1.48, and on ImageNet 512x512, it scored 1.88, with a difference of less than 10% from the best diffusion models.

Core Improvements in sCM:

In addition to the TrigFlow framework, sCM introduced several key improvements to address the instability issues in training continuous-time consistency models:

  • Improved Time Conditioning Strategy (Identity Time Transformation): Using ( C_{noise}(t) = t ) instead of ( C_{noise}(t) = \log(\sigma_\alpha \tan(t)) ) avoids numerical instability issues when ( t ) approaches ( T ).
  • Positional Time Embeddings: Replacing Fourier embeddings with positional embeddings to avoid instability.
  • Adaptive Double Normalization: Solves instability issues caused by AdaGN layers in CM training while retaining their expressive power.
  • Adaptive Weighting: Automatically adjusts the weights of the training objective based on data distribution and network structure, avoiding the need for manual parameter tuning.
  • Tangent Normalization/Clipping: Controls gradient variance to further improve training stability.
  • JVP Rearrangement and JVP Calculation with Flash Attention: Enhances numerical precision and efficiency in training large-scale models.
  • Progressive Annealing: Stabilizes the training process and makes it easier to scale to large-scale models.
  • Diffusion Fine-Tuning and Tangent Warm-Up: Further accelerates convergence and improves stability by fine-tuning from a pre-trained diffusion model and gradually warming up the second term of the tangent function.

How sCM Works:

The core idea of the sCM model is consistency, aiming to keep the model's output consistent across adjacent timesteps. By learning the single-step solution of PF-ODE, sCM can directly transform noise into a clear image in one step!

The path in the figure above illustrates this difference: The blue line represents the progressive sampling process of diffusion models, while the red curve represents the more direct and faster sampling process of consistency models. Using techniques such as consistency training or consistency distillation, consistency models can be trained to generate high-quality samples with significantly fewer steps, making them highly attractive for practical applications requiring fast sample generation.

The sCM model learns by distilling knowledge from pre-trained diffusion models. A key finding is that as the model size increases, the improvement of the sCM model is proportional to the improvement of the "teacher" diffusion model. Specifically, the relative difference in sample quality (measured by the ratio of FID scores) remains consistent across several orders of magnitude in model size, leading to a decrease in the absolute difference in sample quality as the scale increases.

Additionally, increasing the sampling steps of sCM can further narrow the quality gap. Notably, two-step samples from sCM can already match the samples from the "teacher" diffusion model (with a relative difference in FID scores of less than 10%), while the "teacher" model requires hundreds of steps to generate samples.

Comparison with VSD:

Compared to variational score distillation (VSD), sCM generates more diverse samples and is less prone to mode collapse at high guidance scales, resulting in better FID scores.

Limitations of sCM:

The best sCM models still require pre-trained diffusion models for initialization and distillation, so they are slightly inferior to the "teacher" models in image quality. FID scores are not perfect; sometimes, a close FID score does not mean the actual image quality is close, and vice versa. Therefore, evaluating the quality of sCM should be based on specific application scenarios.

One More Thing

OpenAI clearly stated:

"We believe these advancements will unlock new possibilities for real-time, high-quality generative AI across a wide range of domains."

图片

ChatGPT will turn two years old on November 30, and although Sora has not yet been launched and its development lead has left, the release of sCM indicates that OpenAI is still working on major developments. Sam Altman has also hinted that something should be released for ChatGPT's second birthday, possibly the real-time high-quality video generation powerhouse Sora?