OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 99.1 | Multiple | 99.1 | OmniVec2 | 2024-01-01 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 98.6 | Multiple | 98.6 | InternVideo2 | 2024-03-22 |
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP | ✓ Link | 98.5 | AudioSet,WavCaps | 98.5 | M2D2 AS+ | 2025-03-28 |
OmniVec: Learning robust representations with cross modal sharing | | 98.4 | Multiple | 98.4 | OmniVec | 2023-11-07 |
BEATs: Audio Pre-Training with Acoustic Tokenizers | ✓ Link | 98.1 | AudioSet | 98.1 | BEATs | 2022-12-18 |
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation | ✓ Link | 97.45 | AudioSet | 97.45 | mn40_as | 2022-11-09 |
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models | ✓ Link | 97.4 | AudioSet | 97.4 | DyMN-L | 2023-10-24 |
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation | ✓ Link | 97.4 | AudioSet | 97.4 | M2D-CLAP/0.7 | 2024-06-04 |
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework | ✓ Link | 97.2 | AudioSet | 97.2 | M2D-AS/0.7 | 2024-04-09 |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | ✓ Link | 97.0 | AudioSet | 97.0 | HTS-AT | 2022-02-02 |
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network | ✓ Link | 96.3 | AudioSet | 96.3 | EAT-M | 2022-04-25 |
LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging | | 96.2 | | | LHGNN | 2025-01-07 |
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition | | 96.1 | AudioSet | 96.1 | ERANN-2-5 | 2021-06-03 |
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework | ✓ Link | 96.0 | | 96.0 | M2D/0.7 | 2024-04-09 |
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer | ✓ Link | 96.0 | AudioSet | 96.0 | EAT | 2024-01-07 |
AST: Audio Spectrogram Transformer | ✓ Link | 95.7 | AudioSet, ImageNet | 95.7 | Audio Spectrogram Transformer | 2021-04-05 |
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network | ✓ Link | 95.25 | AudioSet | 95.25 | EAT-S | 2022-04-25 |
Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning | ✓ Link | 93.5 | AudioSet | 93.5 | MATPAC (SSL model, linear eval) | 2025-02-17 |
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network | ✓ Link | 92.15 | | 92.15 | EAT-S (scratch) | 2022-04-25 |
Learning Rate Curriculum | ✓ Link | 91.58 | | 91.58 | SepTr + LeRaC | 2022-05-18 |
SepTr: Separable Transformer for Audio Spectrogram Processing | ✓ Link | 91.13 | - | | SepTr | 2022-03-17 |
Multi-Format Contrastive Learning of Audio Representations | | 90.5 | | | Multi-Format Contrastive | 2021-03-11 |
[]() | | 89.5 | EfficientNet | 89.5 | Multi-Channel Audio Feature with CNN | |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 89.2 | | | AVID | 2020-04-27 |
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices | ✓ Link | 87.1 | | 87.1 | ACDNet | 2021-03-05 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 85.4 | IG-Random | | XDC | 2019-11-28 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 84.8 | AudioSet | | XDC | 2019-11-28 |
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization | | 82.3 | | | AVTS | 2018-06-30 |
Look, Listen and Learn | ✓ Link | 79.3 | | | L3 | 2017-05-23 |