Paper | Code | Average Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | ✓ Link | 55.08 | GF (sup) - Faster RCNN | 2024-01-03 |
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | ✓ Link | 54.39 | MIST - CLIP | 2022-12-19 |
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | ✓ Link | 53.33 | GF (uns) - S3D | 2024-01-03 |
SViTT: Temporal Learning of Sparse Video-Text Transformers | ✓ Link | 52.7 | SViTT | 2023-04-18 |
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | ✓ Link | 50.96 | MIST - AIO | 2022-12-19 |
Learning Situation Hyper-Graphs for Video Question Answering | ✓ Link | 49.2 | SHG-VQA (trained from scratch) | 2023-04-18 |
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | ✓ Link | 48.59 | AIO - ViT | 2024-01-03 |
MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | 44.36 | MMTF | 2023-10-06 |