OpenCodePapers
video-question-answering-on-mvbench
Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Avg.
↕
ModelName
ReleaseDate
↕
LinVT: Empower Your Image-level Large Language Model to Understand Videos
✓ Link
69.3
LinVT-Qwen2-VL (7B)
2024-12-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models
✓ Link
67.6
Tarsier (34B)
2024-06-30
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
✓ Link
67.2
InternVideo2
2024-03-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
✓ Link
66.9
LongVU (7B)
2024-10-22
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
✓ Link
64.7
Oryx(34B)
2024-09-19
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
✓ Link
62.0
VideoLLaMA2 (72B)
2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
✓ Link
59.9
VideoChat-T (7B)
2024-10-25
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
✓ Link
59.5
mPLUG-Owl3(7B)
2024-08-09
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
✓ Link
59.2
PPLLaVA (7b)
2024-11-04
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
✓ Link
58.7
VideoGPT+
2024-06-13
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
✓ Link
58.1
PLLaVA
2024-04-25
ST-LLM: Large Language Models Are Effective Temporal Learners
✓ Link
54.9
ST-LLM
2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
✓ Link
51.9
VideoChat2
2023-11-28
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
✓ Link
47.55
HawkEye
2024-03-15
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
✓ Link
39.7
SPHINX-Plus
2024-02-08
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
✓ Link
38.5
TimeChat
2023-12-04
Visual Instruction Tuning
✓ Link
36.0
LLaVa
2023-04-17
VideoChat: Chat-Centric Video Understanding
✓ Link
35.5
VideoChat
2023-05-10
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
✓ Link
34.1
VideoLLaMA
2023-06-05
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
✓ Link
32.7
Video-ChatGPT
2023-06-08
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
✓ Link
32.5
InstructBLIP
2023-05-11
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
✓ Link
18.8
MiniGPT4
2023-04-20