| Photorealistic Video Generation with Diffusion Models | | 36±2 | | | | W.A.L.T-XL (class-conditional) | 2023-12-11 |
| Video-GPT via Next Clip Diffusion | ✓ Link | 53 | | | | Video-GPT | 2025-05-18 |
| LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior | ✓ Link | 57 | | | | LARP | 2024-10-28 |
| Long-Context Autoregressive Video Modeling with Next-Frame Prediction | ✓ Link | 57 | | | | FAR | 2025-03-25 |
| Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ✓ Link | 58±3 | | | | MAGVIT-v2 | 2023-10-09 |
| Hierarchical Patch Diffusion Models for High-Resolution Video Generation | | 66.32 | | 87.68 | | HPDM-L | 2024-06-12 |
| MAGVIT: Masked Generative Video Transformer | ✓ Link | 76±2 | | 89.27±0.15 | | MAGVIT (-L-CG, 128x128, class-conditional) | 2022-12-10 |
| Make-A-Video: Text-to-Video Generation without Text-Video Data | ✓ Link | 81.25 | | 82.55 | | Make-A-Video (Finetuning, 256x256, class-conditional) | 2022-09-29 |
| ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer | ✓ Link | 90 | | | | ACDiT | 2024-12-10 |
| Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ✓ Link | 109 | | | | MAGVIT-v2 (AR) | 2023-10-09 |
| REGIS: Refining Generated Videos via Iterative Stylistic Redesigning | ✓ Link | 141 | | | | REGIS-Fuse (Finetuning, 128x128, text-conditional) | 2023-11-03 |
| MAGVIT: Masked Generative Video Transformer | ✓ Link | 159±2 | | 83.55±0.14 | | MAGVIT (-B-CG, 128x128, class-conditional) | 2022-12-10 |
| LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models | ✓ Link | 164.45 | | | | Latte + LeanVAE | 2025-03-18 |
| VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | ✓ Link | 173 | | 80.03 | | VideoFusion (128x128, class-conditional) | 2023-03-15 |
| OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | ✓ Link | 191 | | | | OmniTokenizer-AR | 2024-06-13 |
| VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | ✓ Link | 220 | | 72.22 | | VideoFusion (128x128, unconditional) | 2023-03-15 |
| Make Pixels Dance: High-Dynamic Video Generation | | 242.82 | | 42.10 | | PixelDance (256x256, text-conditional) | 2023-11-18 |
| Photorealistic Video Generation with Diffusion Models | | 258.1 | | 35.1 | | W.A.L.T 3B (text-conditional) | 2023-12-11 |
| MAGVIT: Masked Generative Video Transformer | ✓ Link | 265 | | | | MAGVIT (AR) | 2022-12-10 |
| Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 280.57 | | 44.26 | | Video-LaVIT | 2024-02-05 |
| VIDM: Video Implicit Diffusion Models | ✓ Link | 294.7 | 1531.9 | | | VIDM (256x256, unconditional) | 2022-12-01 |
| CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ✓ Link | 305 | | 51.11 | | CogVideo (128x128, class-conditional) | 2022-05-29 |
| Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | | 310 | | 60.01 | | PYoCo (Zero-shot, 64x64, unconditional) | 2023-05-17 |
| Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | ✓ Link | 328 | | 73.7 | | MMVG (128x128, class-conditional) | 2022-11-23 |
| Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 332 | | 79.28 | | TATS (128x128, class-conditional) | 2022-04-07 |
| Lumiere: A Space-Time Diffusion Model for Video Generation | ✓ Link | 332.49 | | 37.54 | | Lumiere (Zero-shot. 1024x1024, text-conditional) | 2024-01-23 |
| Grid Diffusion Models for Text-to-Video Generation | | 340.0 | | 62.88 | | GridDiff (Zero-shot) | 2024-03-30 |
| MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing | ✓ Link | 346.84 | | 48.01 | | VideoAssembler (Zero-shot, 256x256, class-conditional) | 2023-11-29 |
| VideoPoet: A Large Language Model for Zero-Shot Video Generation | | 355 | | 38.44 | | VideoPoet (text-conditional) | 2023-12-21 |
| Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | | 355.19 | | 47.76 | | PYoCo (Zero-shot, 64x64, text-conditional) | 2023-05-17 |
| Make-A-Video: Text-to-Video Generation without Text-Video Data | ✓ Link | 367.23 | | 33 | | Make-A-Video (Zero-shot, 256x256, class-conditional) | 2022-09-29 |
| Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 372 | | | 27 | LVDM (256x256, unconditional) | 2022-11-23 |
| Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | ✓ Link | 395 | | 58.3 | | MMVG (128x128, unconditional) | 2022-11-23 |
| Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 420 | | 57.63 | | TATS (128x128, unconditional) | 2022-04-07 |
| Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers | ✓ Link | 438 | 968 | 65.93 | | MeBT (128x128, unconditional) | 2023-03-20 |
| Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks | ✓ Link | 465 | | 59.68 | 39.6 | DIGAN (128x128, class-conditional) | 2022-02-21 |
| LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models | ✓ Link | 526.30 | | | | LAVIE (320x512, text-conditional) | 2023-09-26 |
| Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | ✓ Link | 550.61 | | 33.45 | | Video LDM (320x512, text-conditional) | 2023-04-18 |
| Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 552 | | | 42 | LVDM (256x256, unconditional) | 2022-11-23 |
| Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks | ✓ Link | 577 | | 32.70 | | DIGAN (128x128, unconditional) | 2022-02-21 |
| Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 635 | | | 55 | TATS (256x256) | 2022-04-07 |
| MagicVideo: Efficient Video Generation With Latent Diffusion Models | | 699 | | | | MagicVideo (256x256, text-conditional) | 2022-11-20 |
| A Good Image Generator Is What You Need for High-Resolution Video Synthesis | ✓ Link | 700 | | 33.95 | | MoCoGAN-HD (256x256, unconditional) | 2021-04-30 |
| MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | ✓ Link | 1143 | | | | MCVD (64x64) | 2022-05-19 |
| Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 1209 | | | | TGAN-v2 (128x128) | 2022-11-23 |
| Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 1396 | | | 116 | VDM | 2022-11-23 |
| Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 2460 | | | 148 | MCVD | 2022-11-23 |
| FIFO-Diffusion: Generating Infinite Videos from Text without Training | ✓ Link | | 596.64 | 74.44 | | FIFO-Diffusion | 2024-05-19 |