OpenCodePapers

audio-classification-on-audioset

ClassificationAudio Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTest mAPAUCd-primeModelNameReleaseDate
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning0.558OmniVec22024-01-01
OmniVec: Learning robust representations with cross modal sharing0.548OmniVec2023-11-07
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning✓ Link0.546EquiAV2024-03-14
[]()0.533MAViL (Audio-Visual, single)
Audiovisual Masked Autoencoders✓ Link0.518Audiovisual Masked Autoencoder (Audiovisual, Single)2022-12-09
Contrastive Audio-Visual Masked Autoencoder✓ Link0.512CAV-MAE (Audio-Visual)2022-10-02
BEATs: Audio Pre-Training with Acoustic Tokenizers✓ Link0.506BEATs (Audio-only, Ensemble)2022-12-18
UAVM: Towards Unifying Audio and Visual Models✓ Link0.504UAVM (Audio + Video)2022-07-29
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes✓ Link0.502SSLAM (Audio-Only, Single)2025-06-13
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation✓ Link0.498mn40_as (Ensemble)2022-11-09
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks✓ Link0.497ATST-C2F(Single)2023-06-07
Attention Bottlenecks for Multimodal Fusion✓ Link0.496MBT (AS-500K training + Video)2021-06-30
Efficient Training of Audio Transformers with Patchout✓ Link0.496PaSST (Ensemble)2021-10-11
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models✓ Link0.490DyMN-L (Audio-Only, Single)2023-10-24
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP✓ Link0.490M2D22025-03-28
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection✓ Link0.487HTS-AT (Ensemble)2022-02-02
BEATs: Audio Pre-Training with Acoustic Tokenizers✓ Link0.486BEATs (Audio-only, Single)2022-12-18
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer✓ Link0.486EAT2024-01-07
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification✓ Link0.486DTF-AT (Single)2024-03-24
AST: Audio Spectrogram Transformer✓ Link0.485AST (Ensemble)2021-04-05
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation✓ Link0.485M2D-CLAP/0.72024-06-04
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework✓ Link0.485M2D-AS/0.72024-04-09
[]()0.484MAViL (Audio-only, single)
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation✓ Link0.483mn40_as (Single)2022-11-09
MAX-AST: COMBINING CONVOLUTION, LOCAL AND GLOBAL SELF-ATTENTIONS FOR AUDIO EVENT CLASSIFICATION✓ Link0.481MAX-AST (Single)2024-04-14
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks✓ Link0.480ATST-Frame2023-06-07
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework✓ Link0.479M2D/0.72024-04-09
Play It Back: Iterative Attention for Audio Recognition✓ Link0.477PlayItBackX32022-10-20
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners✓ Link0.476DASS-Medium (Audio-only, single)2024-07-04
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation✓ Link0.4740.9812.936PSLA (Ensemble)2021-02-02
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners✓ Link0.472DASS-Small (Audio-only, single)2024-07-04
Efficient Training of Audio Transformers with Patchout✓ Link0.471PaSST-S (Single)2021-10-11
[]()0.471MaskSpec (AS-2M)
Contrastive Audio-Visual Masked Autoencoder✓ Link0.466CAV-MAE (Audio-Only)2022-10-02
Audiovisual Masked Autoencoders✓ Link0.466Audiovisual Masked Autoencoder (Audio-only, Single)2022-12-09
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data0.4620.975AudioVisual Fusion Net2020-05-29
AST: Audio Spectrogram Transformer✓ Link0.459AST (Single)2021-04-05
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition0.4500.9762.804ERANN-1-62021-06-03
Perceiver: General Perception with Iterative Attention✓ Link0.449Perceiver2021-03-04
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation✓ Link0.4430.9752.778PSLA (Single)2021-02-02
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition✓ Link0.4310.9732.732PANNs-CNN14 (Single)2020-08-23
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network✓ Link0.426EAT-M2022-04-25
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks0.411Conformer (AS-2M)2021-10-14
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network✓ Link0.405EAT-S2022-04-25
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition0.3980.972WEANet-SUSTAIN2020-06-30
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text✓ Link0.3940.9712.895VATT-Base2021-04-22
Multi-Format Contrastive Learning of Audio Representations0.376Multi-Format Contrastive2021-03-11
Self-Supervised MultiModal Versatile Networks✓ Link0.309MMV2020-06-29
Contrastive Audio-Visual Masked Autoencoder✓ Link0.262CAV-MAE (Visual-Only)2022-10-02
Look, Listen and Learn✓ Link0.249L32017-05-23
Unsupervised Learning of Semantic Audio Representations0.244Triplet2017-11-06