Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | | 69.8 | | | | | Mirasol3B | 2023-11-09 |
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | | 68.3 | | | | | CA2ST(B/16) | 2025-03-30 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 68.2 | | | | | ONE-PEACE (Audio-Visual) | 2023-05-18 |
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | | 68.2 | | | | | CAVA(B/16) | 2025-03-30 |
[]() | | 67.1 | | | | | MAViL | |
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning | ✓ Link | 67.1 | | | | | EquiAV | 2024-03-14 |
Multiscale Multimodal Transformer for Multimodal Action Recognition | | 66.2 | 85.7 | | | | MMT (Audio-Visual) | 2022-09-22 |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 65.9 | | | | | CAV-MAE (Audio-Visual) | 2022-10-02 |
UAVM: Towards Unifying Audio and Visual Models | ✓ Link | 65.8 | | | | | UAVM (Audio + Video) | 2022-07-29 |
Audiovisual Masked Autoencoders | ✓ Link | 65.0 | | | | | Audiovisual Masked Autoencoder (Audiovisual, Single) | 2022-12-09 |
AVT: Audio-Video Transformer for Multimodal Action Recognition | | 63.9 | 85.0 | | | | AVT (Audio-Visual) | 2022-09-22 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 59.6 | | | | | ONE-PEACE (Audio-Only) | 2023-05-18 |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 59.5 | | | | | CAV-MAE (Audio-Only) | 2022-10-02 |
Audiovisual Masked Autoencoders | ✓ Link | 57.2 | | | | | Audiovisual Masked Autoencoder
(Audio-only, Single) | 2022-12-09 |
Multiscale Audio Spectrogram Transformer for Efficient Audio Classification | | 57.0 | 81.3 | | | | MAST (Audio Only) | 2023-03-19 |
UAVM: Towards Unifying Audio and Visual Models | ✓ Link | 56.5 | | | | | UAVM (Audio Only) | 2022-07-29 |
Multiscale Multimodal Transformer for Multimodal Action Recognition | | 56.1 | 77.9 | | | | MMT (Video) | 2022-09-22 |
Play It Back: Iterative Attention for Audio Recognition | ✓ Link | 53.7 | 79.2 | 56.1 | 97.8 | 2.846 | PlayItBackX3 | 2022-10-20 |
AVT: Audio-Video Transformer for Multimodal Action Recognition | | 53.2 | 74.8 | | | | AVT (V) | 2022-09-22 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 52.3 | 78.1 | | | | MBT (A) | 2021-06-30 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 51.2 | 72.6 | | | | MBT (V) | 2021-06-30 |
UAVM: Towards Unifying Audio and Visual Models | ✓ Link | 49.9 | | | | | UAVM (Video Only) | 2022-07-29 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | | 85.6 | | | | MBT (AV) | 2021-06-30 |