ETTA: Elucidating the Design Space of Text-to-Audio Models | ✓ Link | 61.79 | 2.03 | 10.10 | 1.13 | 14.29 | 0.60 | 0.43 | ETTA-FT-AC-100k | 2024-12-26 |
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization | ✓ Link | 75.1 | | | 1.15 | 12.2 | 0.488 | | TangoFlux | 2024-12-30 |
Stable Audio Open | ✓ Link | 78.24 | | | 2.14 | | 0.35 | 0.34 | Stable Audio Open | 2024-07-19 |
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization | ✓ Link | 79.7 | | | 1.23 | 10.7 | 0.438 | | TangoFlux-base | 2024-12-30 |
ETTA: Elucidating the Design Space of Text-to-Audio Models | ✓ Link | 80.13 | 2.51 | 13.12 | 1.22 | 14.36 | 0.54 | 0.43 | ETTA | 2024-12-26 |
Fast Timing-Conditioned Latent Audio Diffusion | ✓ Link | 103.66 | | | 2.89 | | 0.41 | | Stable Audio | 2024-02-07 |
Long-form music generation with latent diffusion | ✓ Link | 110.62 | | | 2.70 | | | | Stable Audio 2.0 | 2024-04-16 |
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | ✓ Link | 158.04 | 2.02 | 26.18 | 1.68 | 8.55 | 0.53 | 0.37 | AudioLDM2-large | 2023-08-10 |
AudioGen: Textually Guided Audio Generation | ✓ Link | 185.53 | 3.13 | | 1.42 | | | | AudioGen | 2022-09-30 |
Audiobox: Unified Audio Generation with Natural Language Prompts | | | 0.77 | 8.30 | | 12.70 | 0.71 | | Audiobox Sound | 2023-12-25 |
Taming Data and Transformers for Audio Generation | ✓ Link | | 1.21 | 16.51 | | | | 0.668 | GenAu-Large | 2024-06-27 |
Retrieval-Augmented Text-to-Audio Generation | | | 1.37 | | | | | | Re-AudioLDM-L | 2023-09-14 |
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | ✓ Link | | 1.42 | | | | 0.243 | | AudioLDM 2-AC-Large | 2023-08-10 |
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model | ✓ Link | | 1.59 | 24.52 | | | | | TANGO | 2023-04-24 |
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation | ✓ Link | | 1.63 | 21.99 | | | | | Auffusion | 2024-01-02 |
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation | ✓ Link | | 1.76 | 23.08 | | | | | Auffusion-Full | 2024-01-02 |
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | ✓ Link | | 1.80 | 11.75 | | | | | Make-An-Audio 2 | 2023-05-29 |
Any-to-Any Generation via Composable Diffusion | ✓ Link | | 1.80 | 22.90 | | | | | CoDi | 2023-05-19 |
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | ✓ Link | | 1.96 | 23.31 | | | | | AudioLDM-L-Full | 2023-01-29 |
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation | ✓ Link | | 2.18 | 20.44 | | | | | Consistency TTA (Single-step generation) | 2023-09-19 |
Improving Text-To-Audio Models with Synthetic Captions | ✓ Link | | 2.54 | 17.19 | | 11.04 | 0.527 | | Tango-AF&AC-FT-AC | 2024-06-18 |
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | ✓ Link | | 2.66 | 18.32 | | | | | Make-An-Audio | 2023-01-30 |
Diffsound: Discrete Diffusion Model for Text-to-sound Generation | ✓ Link | | 7.75 | 47.68 | | | | | Diffsound | 2022-07-20 |