BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | ✓ Link | 71.14 | BIMBA-LLaVA-Qwen2-7B | 2025-03-12 |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 69.5 | LinVT-Qwen2-VL(7B) | 2024-12-06 |
Qwen2.5-Omni Technical Report | ✓ Link | 68.6 | Qwen2.5-Omni | 2025-03-26 |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | ✓ Link | 67.6 | LongVU (7B) | 2024-10-22 |
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | ✓ Link | 66.7 | Video-RAG (Based on LLaVA-Video) | 2024-11-20 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 63.9 | VideoLLaMA2 (72B) | 2024-06-11 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 61.7 | Tarsier (34B) | 2024-06-30 |
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | ✓ Link | 61.1 | LVNet | 2024-06-13 |
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | ✓ Link | 61.1 | VideoTree (GPT4) | 2024-05-29 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 60.2 | InternVideo2-6B | 2024-03-22 |
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | ✓ Link | 60.0 | VideoChat-T (7B) | 2024-10-25 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 56.7 | VideoChat2_phi3 | 2023-11-28 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 55.8 | VideoChat2_HD_mistral | 2023-11-28 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 54.4 | VideoChat2_mistral | 2023-11-28 |
Vamos: Versatile Action Models for Video Understanding | ✓ Link | 53.6 | Vamos (GPT-4o) | 2023-11-22 |
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | ✓ Link | 53.3 | TraveLER | 2024-04-01 |
A Simple LLM Framework for Long-Range Video Question-Answering | ✓ Link | 50.3 | LLoVi (GPT-3.5) | 2023-12-28 |
Video ReCap: Recursive Captioning of Hour-Long Videos | ✓ Link | 50.23 | Video ReCap | 2024-02-20 |
Vamos: Versatile Action Models for Video Understanding | ✓ Link | 48.3 | Vamos (GPT-4) | 2023-11-22 |
Language Repository for Long Video Understanding | ✓ Link | 41.2 | LangRepo (12B) | 2024-03-21 |
Understanding Long Videos with Multimodal Language Models | ✓ Link | 37.6 | MVU (13B) | 2024-03-25 |
Vamos: Versatile Action Models for Video Understanding | ✓ Link | 36.7 | Vamos (13B) | 2023-11-22 |
A Simple LLM Framework for Long-Range Video Question-Answering | ✓ Link | 33.5 | LLoVi (7B) | 2023-12-28 |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | ✓ Link | 33.0 | TimeChat (7B) | 2023-12-04 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 32.1 | InternVideo | 2022-12-06 |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | ✓ Link | 31.1 | mPLUG-Owl (7B) | 2023-04-27 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 26.9 | FrozenBiLM | 2022-06-16 |
Self-Chained Image-Language Model for Video Localization and Question Answering | ✓ Link | 22.7 | SeViLA (4B) | 2023-05-11 |
[]() | | 20.0 | Random | |