OpenCodePapers

audio-classification-on-vggsound

ClassificationAudio Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop 1 AccuracyTop 5 AccuracyMean APAUCd-primeModelNameReleaseDate
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities69.8Mirasol3B2023-11-09
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition68.3CA2ST(B/16)2025-03-30
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link68.2ONE-PEACE (Audio-Visual)2023-05-18
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition68.2CAVA(B/16)2025-03-30
[]()67.1MAViL
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning✓ Link67.1EquiAV2024-03-14
Multiscale Multimodal Transformer for Multimodal Action Recognition66.285.7MMT (Audio-Visual)2022-09-22
Contrastive Audio-Visual Masked Autoencoder✓ Link65.9CAV-MAE (Audio-Visual)2022-10-02
UAVM: Towards Unifying Audio and Visual Models✓ Link65.8UAVM (Audio + Video)2022-07-29
Audiovisual Masked Autoencoders✓ Link65.0Audiovisual Masked Autoencoder (Audiovisual, Single)2022-12-09
AVT: Audio-Video Transformer for Multimodal Action Recognition63.985.0AVT (Audio-Visual)2022-09-22
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link59.6ONE-PEACE (Audio-Only)2023-05-18
Contrastive Audio-Visual Masked Autoencoder✓ Link59.5CAV-MAE (Audio-Only)2022-10-02
Audiovisual Masked Autoencoders✓ Link57.2Audiovisual Masked Autoencoder (Audio-only, Single)2022-12-09
Multiscale Audio Spectrogram Transformer for Efficient Audio Classification57.081.3MAST (Audio Only)2023-03-19
UAVM: Towards Unifying Audio and Visual Models✓ Link56.5UAVM (Audio Only)2022-07-29
Multiscale Multimodal Transformer for Multimodal Action Recognition56.177.9MMT (Video)2022-09-22
Play It Back: Iterative Attention for Audio Recognition✓ Link53.779.256.197.82.846PlayItBackX32022-10-20
AVT: Audio-Video Transformer for Multimodal Action Recognition53.274.8AVT (V)2022-09-22
Attention Bottlenecks for Multimodal Fusion✓ Link52.378.1MBT (A)2021-06-30
Attention Bottlenecks for Multimodal Fusion✓ Link51.272.6MBT (V)2021-06-30
UAVM: Towards Unifying Audio and Visual Models✓ Link49.9UAVM (Video Only)2022-07-29
Attention Bottlenecks for Multimodal Fusion✓ Link85.6MBT (AV)2021-06-30