Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning | | 0.519 | 0.845 | 0.194 | 0.301 | 0.266 | | | | | | | MQ-Cap | 2024-10-14 |
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | ✓ Link | 0.518 | 0.841 | 0.194 | | 0.268 | | 0.668 | 0.515 | | | | SLAM-AAC | 2024-10-12 |
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport | ✓ Link | 0.517 | 0.849 | 0.185 | 0.297 | 0.262 | 0.510 | | | | | | LAVCap | 2025-01-16 |
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance | ✓ Link | 0.510 | 0.823 | 0.197 | | 0.269 | | 0.665 | | | | | EnCLAP++-large | 2024-09-02 |
Taming Data and Transformers for Audio Generation | ✓ Link | 0.507 | 0.832 | 0.182 | | 0.253 | 0.518 | | | | 0.518 | | AutoCap | 2024-06-27 |
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding | ✓ Link | 0.505 | 0.816 | 0.193 | | 0.267 | | 0.664 | | | | 0.664 | LOAE | 2024-06-19 |
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance | ✓ Link | 0.501 | 0.815 | 0.188 | | 0.257 | | 0.661 | | | | | EnCLAP++-base | 2024-09-02 |
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning | ✓ Link | 0.4954 | 0.8029 | 0.1879 | | 0.2554 | | | | | | | EnCLAP-large | 2024-01-31 |
[]() | | 0.4951 | 0.8061 | 0.1841 | | 0.2527 | | 0.6431 | 0.4945 | 40 | | | CNext-trans | |
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning | ✓ Link | 0.4829 | 0.7795 | 0.1863 | | 0.2473 | | | | | | | EnCLAP-base | 2024-01-31 |
[]() | | 0.475 | 0.769 | 0.181 | | | | | | | | | AL-MixGen + Multi-TTA | |
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer | | 0.472 | 0.764 | 0.180 | 0.285 | 0.242 | 0.504 | | | | | | Rethink-ACT (AST + TF + MIL) | 2023-08-20 |
Exploring Train and Test-Time Augmentations for Audio-Language Learning | | 0.466 | 0.755 | 0.177 | | | | | | | | | AL-MixGen | 2022-10-31 |
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS | ✓ Link | 0.465 | 0.753 | 0.176 | | | | | | | | | BART + YAMNet + PANNs | 2021-11-15 |
Audio Captioning Transformer | ✓ Link | 0.426 | 0.693 | 0.159 | | | | | | | | | CNN+Transformer | 2021-07-21 |
AudioCaps: Generating Captions for Audios in The Wild | | 0.369 | 0.593 | 0.144 | | | | | | | | | TopDown-AlignedAtt (1NN) | 2019-06-01 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | | 0.781 | | 0.295 | 0.247 | 0.509 | | | | | | VAST | 2023-05-29 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | | 0.741 | | 0.270 | 0.231 | 0.494 | | | | | | VALOR | 2023-04-17 |