Paper | Code | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
Large Language Models are Temporal and Causal Reasoners for Video Question Answering | ✓ Link | 82.2 | LLaMA-VQA | 2023-10-24 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 82 | FrozenBiLM | 2022-06-16 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 79.0 | VindLU | 2022-12-09 |
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | 76.96 | iPerceive (Chadha et al., 2020) | 2020-11-16 | |
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | ✓ Link | 74.24 | Hero w/ pre-training | 2020-05-01 |
TVQA+: Spatio-Temporal Grounding for Video Question Answering | ✓ Link | 70.50 | STAGE (Lei et al., 2019) | 2019-04-25 |