| Paper | Code | Acc | ModelName | ReleaseDate |
|---|---|---|---|---|
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 80.7 | VAST | 2023-05-29 |
| []() | 79.6 | CoQo(Internvideo2) | ||
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 78.9 | VALOR | 2023-04-17 |
| CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | 78.26 | CAD | 2023-10-25 | |
| Vision Transformers are Parameter-Efficient Audio-Visual Learners | ✓ Link | 77.08 | LAVISH | 2022-12-15 |
| Learning to Answer Questions in Dynamic Audio-Visual Scenarios | ✓ Link | 71.52 | ST-AVQA | 2022-03-26 |