video-question-answering-on-mvbench

Video Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Avg.	ModelName	ReleaseDate
LinVT: Empower Your Image-level Large Language Model to Understand Videos	✓ Link	69.3	LinVT-Qwen2-VL (7B)	2024-12-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	67.6	Tarsier (34B)	2024-06-30
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	67.2	InternVideo2	2024-03-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	✓ Link	66.9	LongVU (7B)	2024-10-22
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	✓ Link	64.7	Oryx(34B)	2024-09-19
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	62.0	VideoLLaMA2 (72B)	2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	✓ Link	59.9	VideoChat-T (7B)	2024-10-25
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	✓ Link	59.5	mPLUG-Owl3(7B)	2024-08-09
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	✓ Link	59.2	PPLLaVA (7b)	2024-11-04
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	✓ Link	58.7	VideoGPT+	2024-06-13
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	✓ Link	58.1	PLLaVA	2024-04-25
ST-LLM: Large Language Models Are Effective Temporal Learners	✓ Link	54.9	ST-LLM	2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	51.9	VideoChat2	2023-11-28
HawkEye: Training Video-Text LLMs for Grounding Text in Videos	✓ Link	47.55	HawkEye	2024-03-15
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	✓ Link	39.7	SPHINX-Plus	2024-02-08
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	✓ Link	38.5	TimeChat	2023-12-04
Visual Instruction Tuning	✓ Link	36.0	LLaVa	2023-04-17
VideoChat: Chat-Centric Video Understanding	✓ Link	35.5	VideoChat	2023-05-10
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	✓ Link	34.1	VideoLLaMA	2023-06-05
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	✓ Link	32.7	Video-ChatGPT	2023-06-08
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	✓ Link	32.5	InstructBLIP	2023-05-11
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	✓ Link	18.8	MiniGPT4	2023-04-20

OpenCodePapers

video-question-answering-on-mvbench