Photorealistic Video Generation with Diffusion Models | | 36±2 | | | | W.A.L.T-XL (class-conditional) | 2023-12-11 |
Video-GPT via Next Clip Diffusion | ✓ Link | 53 | | | | Video-GPT | 2025-05-18 |
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior | ✓ Link | 57 | | | | LARP | 2024-10-28 |
Long-Context Autoregressive Video Modeling with Next-Frame Prediction | ✓ Link | 57 | | | | FAR | 2025-03-25 |
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ✓ Link | 58±3 | | | | MAGVIT-v2 | 2023-10-09 |
Hierarchical Patch Diffusion Models for High-Resolution Video Generation | | 66.32 | | 87.68 | | HPDM-L | 2024-06-12 |
MAGVIT: Masked Generative Video Transformer | ✓ Link | 76±2 | | 89.27±0.15 | | MAGVIT (-L-CG, 128x128, class-conditional) | 2022-12-10 |
Make-A-Video: Text-to-Video Generation without Text-Video Data | ✓ Link | 81.25 | | 82.55 | | Make-A-Video (Finetuning, 256x256, class-conditional) | 2022-09-29 |
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer | ✓ Link | 90 | | | | ACDiT | 2024-12-10 |
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ✓ Link | 109 | | | | MAGVIT-v2 (AR) | 2023-10-09 |
REGIS: Refining Generated Videos via Iterative Stylistic Redesigning | ✓ Link | 141 | | | | REGIS-Fuse (Finetuning, 128x128, text-conditional) | 2023-11-03 |
MAGVIT: Masked Generative Video Transformer | ✓ Link | 159±2 | | 83.55±0.14 | | MAGVIT (-B-CG, 128x128, class-conditional) | 2022-12-10 |
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models | ✓ Link | 164.45 | | | | Latte + LeanVAE | 2025-03-18 |
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | ✓ Link | 173 | | 80.03 | | VideoFusion (128x128, class-conditional) | 2023-03-15 |
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | ✓ Link | 191 | | | | OmniTokenizer-AR | 2024-06-13 |
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | ✓ Link | 220 | | 72.22 | | VideoFusion (128x128, unconditional) | 2023-03-15 |
Make Pixels Dance: High-Dynamic Video Generation | | 242.82 | | 42.10 | | PixelDance (256x256, text-conditional) | 2023-11-18 |
Photorealistic Video Generation with Diffusion Models | | 258.1 | | 35.1 | | W.A.L.T 3B (text-conditional) | 2023-12-11 |
MAGVIT: Masked Generative Video Transformer | ✓ Link | 265 | | | | MAGVIT (AR) | 2022-12-10 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 280.57 | | 44.26 | | Video-LaVIT | 2024-02-05 |
VIDM: Video Implicit Diffusion Models | ✓ Link | 294.7 | 1531.9 | | | VIDM (256x256, unconditional) | 2022-12-01 |
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ✓ Link | 305 | | 51.11 | | CogVideo (128x128, class-conditional) | 2022-05-29 |
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | | 310 | | 60.01 | | PYoCo (Zero-shot, 64x64, unconditional) | 2023-05-17 |
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | ✓ Link | 328 | | 73.7 | | MMVG (128x128, class-conditional) | 2022-11-23 |
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 332 | | 79.28 | | TATS (128x128, class-conditional) | 2022-04-07 |
Lumiere: A Space-Time Diffusion Model for Video Generation | ✓ Link | 332.49 | | 37.54 | | Lumiere (Zero-shot. 1024x1024, text-conditional) | 2024-01-23 |
Grid Diffusion Models for Text-to-Video Generation | | 340.0 | | 62.88 | | GridDiff (Zero-shot) | 2024-03-30 |
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing | ✓ Link | 346.84 | | 48.01 | | VideoAssembler (Zero-shot, 256x256, class-conditional) | 2023-11-29 |
VideoPoet: A Large Language Model for Zero-Shot Video Generation | | 355 | | 38.44 | | VideoPoet (text-conditional) | 2023-12-21 |
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | | 355.19 | | 47.76 | | PYoCo (Zero-shot, 64x64, text-conditional) | 2023-05-17 |
Make-A-Video: Text-to-Video Generation without Text-Video Data | ✓ Link | 367.23 | | 33 | | Make-A-Video (Zero-shot, 256x256, class-conditional) | 2022-09-29 |
Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 372 | | | 27 | LVDM (256x256, unconditional) | 2022-11-23 |
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | ✓ Link | 395 | | 58.3 | | MMVG (128x128, unconditional) | 2022-11-23 |
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 420 | | 57.63 | | TATS (128x128, unconditional) | 2022-04-07 |
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers | ✓ Link | 438 | 968 | 65.93 | | MeBT (128x128, unconditional) | 2023-03-20 |
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks | ✓ Link | 465 | | 59.68 | 39.6 | DIGAN (128x128, class-conditional) | 2022-02-21 |
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models | ✓ Link | 526.30 | | | | LAVIE (320x512, text-conditional) | 2023-09-26 |
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | ✓ Link | 550.61 | | 33.45 | | Video LDM (320x512, text-conditional) | 2023-04-18 |
Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 552 | | | 42 | LVDM (256x256, unconditional) | 2022-11-23 |
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks | ✓ Link | 577 | | 32.70 | | DIGAN (128x128, unconditional) | 2022-02-21 |
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ✓ Link | 635 | | | 55 | TATS (256x256) | 2022-04-07 |
MagicVideo: Efficient Video Generation With Latent Diffusion Models | | 699 | | | | MagicVideo (256x256, text-conditional) | 2022-11-20 |
A Good Image Generator Is What You Need for High-Resolution Video Synthesis | ✓ Link | 700 | | 33.95 | | MoCoGAN-HD (256x256, unconditional) | 2021-04-30 |
MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | ✓ Link | 1143 | | | | MCVD (64x64) | 2022-05-19 |
Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 1209 | | | | TGAN-v2 (128x128) | 2022-11-23 |
Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 1396 | | | 116 | VDM | 2022-11-23 |
Latent Video Diffusion Models for High-Fidelity Long Video Generation | ✓ Link | 2460 | | | 148 | MCVD | 2022-11-23 |
FIFO-Diffusion: Generating Infinite Videos from Text without Training | ✓ Link | | 596.64 | 74.44 | | FIFO-Diffusion | 2024-05-19 |