Paper | Code | Accuracy (Top-1) | ModelName | ReleaseDate |
---|---|---|---|---|
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution | ✓ Link | 71.4 | Oyrx (34B) | 2024-09-19 |
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | ✓ Link | 68.51 | BIMBA-LLaVA-Qwen2-7B | 2025-03-12 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 63.4 | InternVideo2 (8B) | 2024-03-22 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 57.5 | VideoLLaMA2 (72B) | 2024-06-11 |
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | ✓ Link | 50.2 | TraveLER | 2024-04-01 |
Perception Test: A Diagnostic Benchmark for Multimodal Video Models | ✓ Link | 0.46 | Flamingo | 2023-05-23 |