Data Extrapolation for Text-to-image Generation on Small Datasets | ✓ Link | 5.00 | | | | | | | | RAT-Diffusion | 2024-10-02 |
Re-Imagen: Retrieval-Augmented Text-to-Image Generator | | 5.25 | | | | | | | | Re-Imagen (Finetuned) | 2022-09-29 |
All are Worth Words: A ViT Backbone for Diffusion Models | ✓ Link | 5.48 | | | | | | | | U-ViT-S/2-Deep | 2022-09-25 |
GLIGEN: Open-Set Grounded Text-to-Image Generation | ✓ Link | 5.61 | | | | | | | | GLIGEN (fine-tuned, Detection + Caption data) | 2023-01-17 |
GLIGEN: Open-Set Grounded Text-to-Image Generation | ✓ Link | 5.82 | | | | | | | | GLIGEN (fine-tuned, Detection data only) | 2023-01-17 |
All are Worth Words: A ViT Backbone for Diffusion Models | ✓ Link | 5.95 | | | | | | | | U-ViT-S/2 | 2022-09-25 |
Improving Diffusion-Based Image Synthesis with Context Prediction | | 6.21 | | | | | | | 6.21 | ConPreDiff | 2024-01-04 |
Truncated Diffusion Probabilistic Models and Diffusion-based Adversarial Auto-Encoders | ✓ Link | 6.29 | | | | | | | | TLDM | 2022-02-19 |
GLIGEN: Open-Set Grounded Text-to-Image Generation | ✓ Link | 6.38 | | | | | | | | GLIGEN (fine-tuned, Grounding data) | 2023-01-17 |
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths | ✓ Link | 6.61 | | | | | | | | RAPHAEL (zero-shot) | 2023-05-29 |
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts | ✓ Link | 6.75 | | | | | | | | ERNIE-ViLG 2.0 (zero-shot) | 2022-10-27 |
Re-Imagen: Retrieval-Augmented Text-to-Image Generator | | 6.88 | | | | | | | | Re-Imagen | 2022-09-29 |
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers | ✓ Link | 6.95 | | | | | | | | eDiff-I (zero-shot) | 2022-11-02 |
Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation | | 7.21 | 31.46 | | | | | | | Swinv2-Imagen | 2022-10-18 |
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding | ✓ Link | 7.27 | | | | | | | | Imagen (zero-shot) | 2022-05-23 |
Scaling up GANs for Text-to-Image Synthesis | ✓ Link | 7.28 | | | | | | | | GigaGAN (Zero-shot, 64x64) | 2023-03-09 |
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis | ✓ Link | 7.3 | | | | | | | | StyleGAN-T (Zero-shot, 64x64) | 2023-01-23 |
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors | ✓ Link | 7.55 | | | | | | | | Make-a-Scene (unfiltered) | 2022-03-24 |
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion | ✓ Link | 8.03 | | | | | | | | Kandinsky | 2023-10-05 |
LAFITE: Towards Language-Free Training for Text-to-Image Generation | ✓ Link | 8.12 | 32.34 | | | | | 61.09 | | Lafite | 2021-11-27 |
Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation | ✓ Link | 8.15 | | | | | | | 8.15 | SiD-LSG (Data-free distillation, zero-shot FID) | 2024-06-03 |
Simple diffusion: End-to-end diffusion for high resolution images | ✓ Link | 8.3 | | | | | | | | simple diffusion (U-ViT) | 2023-01-26 |
Scaling up GANs for Text-to-Image Synthesis | ✓ Link | 9.09 | | | | | | | | GigaGAN (Zero-shot, 256x256) | 2023-03-09 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 9.3 | 30.5 | | | | | | | XMC-GAN (256 x 256) | 2021-11-24 |
Cross-Modal Contrastive Learning for Text-to-Image Generation | ✓ Link | 9.33 | | | | | | | | XMC-GAN | 2021-01-12 |
Hierarchical Text-Conditional Image Generation with CLIP Latents | ✓ Link | 10.39 | | | | | | | | DALL-E 2 | 2022-04-13 |
Shifted Diffusion for Text-to-image Generation | ✓ Link | 10.6 | | | | | | | | Corgi-Semi | 2022-11-24 |
Shifted Diffusion for Text-to-image Generation | ✓ Link | 10.88 | | | | | | | | Corgi | 2022-11-24 |
TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation | ✓ Link | 10.9 | | | | | | | | TR0N (StyleGAN-XL, LAION2BCLIP, BLIP-2, zero-shot) | 2023-04-26 |
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors | ✓ Link | 11.84 | | | | | | | | Make-a-Scene (unfiltered) | 2022-03-24 |
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models | ✓ Link | 12.24 | | | | | | | | GLIDE (zero-shot) | 2021-12-20 |
KNN-Diffusion: Image Generation via Large-Scale Retrieval | | 12.5 | | | | | | | | KNN-Diffusion | 2022-04-06 |
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis | ✓ Link | 12.54 | | | | | | | | GALIP (CC12m) | 2023-01-30 |
High-Resolution Image Synthesis with Latent Diffusion Models | ✓ Link | 12.63 | | | | | | | | Latent Diffusion (LDM-KL-8-G) | 2021-12-20 |
Retrieval-Augmented Multimodal Language Modeling | | 12.63 | | | | | | | | Stable Diffusion | 2022-11-22 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 12.9 | 27.2 | | | | | | | NÜWA (256 x 256) | 2021-11-24 |
Vector Quantized Diffusion Model for Text-to-Image Synthesis | ✓ Link | 13.86 | | | | | | | | VQ-Diffusion-F | 2021-11-29 |
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis | ✓ Link | 13.9 | | | | | | | | StyleGAN-T (Zero-shot, 256x256) | 2023-01-23 |
Recurrent Affine Transformation for Text-to-image Synthesis | ✓ Link | 14.6 | | | | | | | | RAT-GAN | 2022-04-22 |
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation | ✓ Link | 14.7 | | | | | | | | ERNIE-ViLG | 2021-12-31 |
Retrieval-Augmented Multimodal Language Modeling | | 15.7 | | | | | | | | RA-CM3 (2.7B) | 2022-11-22 |
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers | ✓ Link | 17.7 | | | | | | | | CogView2(6B, Finetuned) | 2022-04-28 |
Vector Quantized Diffusion Model for Text-to-Image Synthesis | ✓ Link | 19.75 | | | | | | | | VQ-Diffusion-B | 2021-11-29 |
Improving Text-to-Image Synthesis Using Contrastive Learning | ✓ Link | 20.79 | 33.34 | | | | | | | DM-GAN+CL | 2021-07-06 |
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization | ✓ Link | 21.16 | 34.26 | | | | | | | FuseDream (few-shot, k=5) | 2021-12-02 |
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization | ✓ Link | 21.16 | 34.26 | | | | | | | FuseDream (k=5, 256) | 2021-12-02 |
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization | ✓ Link | 21.89 | 34.67 | | | | | | | FuseDream (k=10, 256) | 2021-12-02 |
Improving Text-to-Image Synthesis Using Contrastive Learning | ✓ Link | 23.93 | 25.70 | | | | | | | AttnGAN+CL | 2021-07-06 |
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers | ✓ Link | 24 | | | | | | | | CogView2(6B, Finetuned) | 2022-04-28 |
Semantic Object Accuracy for Generative Text-to-Image Synthesis | ✓ Link | 24.70 | 27.88 | | | | | 35.85 | | OP-GAN | 2019-10-29 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 26.0 | 32.2 | | | | | | | DM-GAN (256 x 256) | 2021-11-24 |
LAFITE: Towards Language-Free Training for Text-to-Image Generation | ✓ Link | 26.94 | 26.02 | 22.97 | 18.70 | 15.72 | 14.79 | | | Lafite (zero-shot) | 2021-11-27 |
CogView: Mastering Text-to-Image Generation via Transformers | ✓ Link | 27.1 | 18.2 | 19.4 | 13.9 | 19.4 | 23.6 | | | CogView | 2021-05-26 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 27.1 | 18.2 | | | | | | | CogView (256 x 256) | 2021-11-24 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 27.5 | 17.9 | | | | | | | DALL-E (256 x 256) | 2021-11-24 |
Retrieval-Augmented Multimodal Language Modeling | | 28 | | | | | | | | DALL-E (12B) | 2022-11-22 |
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks | ✓ Link | 29.26 | 28.18 | | | | | | | AttnGAN + VICTR | 2020-10-07 |
Retrieval-Augmented Multimodal Language Modeling | | 29.5 | | | | | | | | Vanilla CM3 | 2022-11-22 |
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks | ✓ Link | 32.37 | 32.37 | | | | | | | DM-GAN + VICTR | 2020-10-07 |
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis | ✓ Link | 32.64 | 30.49 | | | | | 33.44 | | DM-GAN | 2019-04-02 |
Generating Multiple Objects at Spatially Distinct Locations | ✓ Link | 33.35 | 24.76 | | | | | 25.46 | | AttnGAN + OP | 2019-01-03 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | 35.2 | 23.3 | | | | | | | AttnGAN (256 x 256) | 2021-11-24 |
L-Verse: Bidirectional Generation Between Image and Text | ✓ Link | 37.2 | | 31.6 | 25.7 | 21.4 | 21.1 | | | L-Verse-CC | 2021-11-22 |
L-Verse: Bidirectional Generation Between Image and Text | ✓ Link | 45.8 | | 41.9 | 35.5 | 30.2 | 29.83 | | | L-Verse | 2021-11-22 |
Generating Multiple Objects at Spatially Distinct Locations | ✓ Link | 55.30 | 12.12 | | | | | | | StackGAN + OP | 2019-01-03 |
StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks | ✓ Link | 74.05 | 8.45 | | | | | | | StackGAN-v1 | 2017-10-19 |
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion | ✓ Link | | 18.7 | | | | | | | DF-GAN (256 x 256) | 2021-11-24 |
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks | ✓ Link | | 10.38 | | | | | | | StackGAN + VICTR | 2020-10-07 |
ChatPainter: Improving Text to Image Generation using Dialogue | | | 9.74 | | | | | | | ChatPainter | 2018-02-22 |