Paper | Code | Acc | ModelName | ReleaseDate |
---|---|---|---|---|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 80.7 | VAST | 2023-05-29 |
[]() | 79.6 | CoQo(Internvideo2) | ||
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 78.9 | VALOR | 2023-04-17 |
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | 78.26 | CAD | 2023-10-25 | |
Vision Transformers are Parameter-Efficient Audio-Visual Learners | ✓ Link | 77.08 | LAVISH | 2022-12-15 |
Learning to Answer Questions in Dynamic Audio-Visual Scenarios | ✓ Link | 71.52 | ST-AVQA | 2022-03-26 |