Flow-GRPO: Training Flow Matching Models via Online RL | ✓ Link | 0.95 | | | | | | | SD3.5-Medium+Flow-GRPO | 2025-05-08 |
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation | ✓ Link | 0.84 | 0.98 | 0.93 | 0.71 | 0.90 | 0.81 | 0.74 | UniWorld-V1 (Rewrite) | 2025-06-03 |
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | ✓ Link | 0.83 | 0.99 | 0.94 | 0.71 | 0.90 | 0.71 | 0.71 | MindOmni | 2025-05-19 |
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation | ✓ Link | 0.80 | 0.99 | 0.93 | 0.70 | 0.89 | 0.79 | 0.49 | UniWorld-V1 | 2025-06-03 |
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer | ✓ Link | 0.80 | | | | | | | SANA-1.5 4.8B (+ Inference Scaling) | 2025-01-30 |
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | ✓ Link | 0.80 | | | | | | | Janus-Pro-7B | 2025-01-29 |
Transfer between Modalities with MetaQueries | | 0.80 | | | | | | | MetaQuery-XL (Rewrite) | 2025-04-08 |
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | ✓ Link | 0.77 | | | | | | | Show-o [xie2024show] PARM It. DPO PARM | 2025-01-23 |
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | ✓ Link | 0.75 | | | | | | | Show-o [xie2024show] Ft. ORM It. DPO Ft. ORM | 2025-01-23 |
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | ✓ Link | 0.73 | | | | | | | Janus-Pro-1B | 2025-01-29 |
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework | ✓ Link | 0.73 | | | | | | | Lumina-Image 2.0 | 2025-03-27 |
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer | ✓ Link | 0.72 | 0.99 | 0.85 | | | | | SANA-1.5 4.8B | 2025-01-30 |
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens | ✓ Link | 0.69 | 0.96 | 0.83 | 0.51 | 0.80 | 0.63 | 0.39 | Fluid (10.5B) | 2024-10-17 |
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation | ✓ Link | 0.68 | | | | | | | Und. and Gen. Show-o (Ours) | 2024-08-22 |
Emu3: Next-Token Prediction is All You Need | ✓ Link | 0.66 | | | | | | | Emu3 | 2024-09-27 |
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training | | 0.66 | | | | | | | SnapGen | 2024-12-12 |
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | ✓ Link | 0.63 | | | | | | | JanusFlow | 2024-11-12 |
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | ✓ Link | 0.53 | | | | | | | PixArt-Σ | 2024-03-07 |
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers | | 0.51 | | | | | | | DiffMoE-E16-T2I-Flow (w SFT) | 2025-03-18 |
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models | ✓ Link | 0 | | | | | | | PIXART-δ | 2024-01-10 |