OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 0.558 | | | OmniVec2 | 2024-01-01 |
OmniVec: Learning robust representations with cross modal sharing | | 0.548 | | | OmniVec | 2023-11-07 |
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning | ✓ Link | 0.546 | | | EquiAV | 2024-03-14 |
[]() | | 0.533 | | | MAViL (Audio-Visual, single) | |
Audiovisual Masked Autoencoders | ✓ Link | 0.518 | | | Audiovisual Masked Autoencoder (Audiovisual, Single) | 2022-12-09 |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 0.512 | | | CAV-MAE (Audio-Visual) | 2022-10-02 |
BEATs: Audio Pre-Training with Acoustic Tokenizers | ✓ Link | 0.506 | | | BEATs (Audio-only, Ensemble) | 2022-12-18 |
UAVM: Towards Unifying Audio and Visual Models | ✓ Link | 0.504 | | | UAVM (Audio + Video) | 2022-07-29 |
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes | ✓ Link | 0.502 | | | SSLAM (Audio-Only, Single) | 2025-06-13 |
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation | ✓ Link | 0.498 | | | mn40_as (Ensemble) | 2022-11-09 |
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks | ✓ Link | 0.497 | | | ATST-C2F(Single) | 2023-06-07 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 0.496 | | | MBT (AS-500K training + Video) | 2021-06-30 |
Efficient Training of Audio Transformers with Patchout | ✓ Link | 0.496 | | | PaSST (Ensemble) | 2021-10-11 |
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models | ✓ Link | 0.490 | | | DyMN-L (Audio-Only, Single) | 2023-10-24 |
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP | ✓ Link | 0.490 | | | M2D2 | 2025-03-28 |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | ✓ Link | 0.487 | | | HTS-AT (Ensemble) | 2022-02-02 |
BEATs: Audio Pre-Training with Acoustic Tokenizers | ✓ Link | 0.486 | | | BEATs (Audio-only, Single) | 2022-12-18 |
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer | ✓ Link | 0.486 | | | EAT | 2024-01-07 |
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification | ✓ Link | 0.486 | | | DTF-AT (Single) | 2024-03-24 |
AST: Audio Spectrogram Transformer | ✓ Link | 0.485 | | | AST (Ensemble) | 2021-04-05 |
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation | ✓ Link | 0.485 | | | M2D-CLAP/0.7 | 2024-06-04 |
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework | ✓ Link | 0.485 | | | M2D-AS/0.7 | 2024-04-09 |
[]() | | 0.484 | | | MAViL (Audio-only, single) | |
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation | ✓ Link | 0.483 | | | mn40_as (Single) | 2022-11-09 |
MAX-AST: COMBINING CONVOLUTION, LOCAL AND GLOBAL SELF-ATTENTIONS FOR AUDIO EVENT CLASSIFICATION | ✓ Link | 0.481 | | | MAX-AST (Single) | 2024-04-14 |
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks | ✓ Link | 0.480 | | | ATST-Frame | 2023-06-07 |
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework | ✓ Link | 0.479 | | | M2D/0.7 | 2024-04-09 |
Play It Back: Iterative Attention for Audio Recognition | ✓ Link | 0.477 | | | PlayItBackX3 | 2022-10-20 |
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners | ✓ Link | 0.476 | | | DASS-Medium (Audio-only, single) | 2024-07-04 |
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation | ✓ Link | 0.474 | 0.981 | 2.936 | PSLA (Ensemble) | 2021-02-02 |
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners | ✓ Link | 0.472 | | | DASS-Small (Audio-only, single) | 2024-07-04 |
Efficient Training of Audio Transformers with Patchout | ✓ Link | 0.471 | | | PaSST-S (Single) | 2021-10-11 |
[]() | | 0.471 | | | MaskSpec (AS-2M) | |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 0.466 | | | CAV-MAE (Audio-Only) | 2022-10-02 |
Audiovisual Masked Autoencoders | ✓ Link | 0.466 | | | Audiovisual Masked Autoencoder (Audio-only, Single) | 2022-12-09 |
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data | | 0.462 | 0.975 | | AudioVisual Fusion Net | 2020-05-29 |
AST: Audio Spectrogram Transformer | ✓ Link | 0.459 | | | AST (Single) | 2021-04-05 |
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition | | 0.450 | 0.976 | 2.804 | ERANN-1-6 | 2021-06-03 |
Perceiver: General Perception with Iterative Attention | ✓ Link | 0.449 | | | Perceiver | 2021-03-04 |
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation | ✓ Link | 0.443 | 0.975 | 2.778 | PSLA (Single) | 2021-02-02 |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | ✓ Link | 0.431 | 0.973 | 2.732 | PANNs-CNN14 (Single) | 2020-08-23 |
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network | ✓ Link | 0.426 | | | EAT-M | 2022-04-25 |
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks | | 0.411 | | | Conformer (AS-2M) | 2021-10-14 |
End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network | ✓ Link | 0.405 | | | EAT-S | 2022-04-25 |
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | | 0.398 | 0.972 | | WEANet-SUSTAIN | 2020-06-30 |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | ✓ Link | 0.394 | 0.971 | 2.895 | VATT-Base | 2021-04-22 |
Multi-Format Contrastive Learning of Audio Representations | | 0.376 | | | Multi-Format Contrastive | 2021-03-11 |
Self-Supervised MultiModal Versatile Networks | ✓ Link | 0.309 | | | MMV | 2020-06-29 |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 0.262 | | | CAV-MAE (Visual-Only) | 2022-10-02 |
Look, Listen and Learn | ✓ Link | 0.249 | | | L3 | 2017-05-23 |
Unsupervised Learning of Semantic Audio Representations | | 0.244 | | | Triplet | 2017-11-06 |