OpenCodePapers

audio-captioning-on-audiocaps

Audio captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeSPIDErCIDErSPICEBLEU-4METEORROUGE-LFENSESPIDEr-FL#params (M)ROUGESentence-BERTModelNameReleaseDate
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning0.5190.8450.1940.3010.266MQ-Cap2024-10-14
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs✓ Link0.5180.8410.1940.2680.6680.515SLAM-AAC2024-10-12
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport✓ Link0.5170.8490.1850.2970.2620.510LAVCap2025-01-16
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance✓ Link0.5100.8230.1970.2690.665EnCLAP++-large2024-09-02
Taming Data and Transformers for Audio Generation✓ Link0.5070.8320.1820.2530.5180.518AutoCap2024-06-27
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding✓ Link0.5050.8160.1930.2670.6640.664LOAE2024-06-19
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance✓ Link0.5010.8150.1880.2570.661EnCLAP++-base2024-09-02
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning✓ Link0.49540.80290.18790.2554EnCLAP-large2024-01-31
[]()0.49510.80610.18410.25270.64310.494540CNext-trans
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning✓ Link0.48290.77950.18630.2473EnCLAP-base2024-01-31
[]()0.4750.7690.181AL-MixGen + Multi-TTA
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer0.4720.7640.1800.2850.2420.504Rethink-ACT (AST + TF + MIL)2023-08-20
Exploring Train and Test-Time Augmentations for Audio-Language Learning0.4660.7550.177AL-MixGen2022-10-31
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS✓ Link0.4650.7530.176BART + YAMNet + PANNs2021-11-15
Audio Captioning Transformer✓ Link0.4260.6930.159CNN+Transformer2021-07-21
AudioCaps: Generating Captions for Audios in The Wild0.3690.5930.144TopDown-AlignedAtt (1NN)2019-06-01
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link0.7810.2950.2470.509VAST2023-05-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link0.7410.2700.2310.494VALOR2023-04-17