video-question-answering-on-tvbench

Video Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Average Accuracy	ModelName	ReleaseDate
Seed1.5-VL Technical Report		63.6	Seed1.5-VL thinking	2025-05-11
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	63.5	PLM-8B	2025-04-17
Seed1.5-VL Technical Report		61.5	Seed1.5-VL	2025-05-11
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning	✓ Link	60.6	V-JEPA 2 ViT-g 8B	2025-06-11
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	58.9	PLM-3B	2025-04-17
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization		56.5	RRPO	2025-04-16
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	55.5	Tarsier-34B	2024-06-30
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding	✓ Link	54.7	Tarsier2-7B	2025-01-14
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	52.7	Qwen2-VL-72B	2024-09-18
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	✓ Link	51.6	IXC-2.5 7B	2024-07-03
Aria: An Open Multimodal Native Mixture-of-Experts Model	✓ Link	51.0	Aria	2024-10-08
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	50.4	PLM-1B	2025-04-17
Video Instruction Tuning With Synthetic Data		50.0	LLaVA-Video 72B	2024-10-03
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	48.4	VideoLLaMA2 72B	2024-06-11
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	47.6	Gemini 1.5 Pro	2024-03-08
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	46.9	Tarsier-7B	2024-06-30
Video Instruction Tuning With Synthetic Data		45.6	LLaVA-Video 7B	2024-10-03
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	43.8	Qwen2-VL-7B	2024-09-18
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	42.9	VideoLLaMA2 7B	2024-06-11
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	✓ Link	42.3	PLLaVA-34B	2024-04-25
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	✓ Link	42.2	mPLUG-Owl3	2024-08-09
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	42.1	VideoLLaMA2.1	2024-06-11
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	✓ Link	41.7	VideoGPT+	2024-06-13
GPT-4o System Card		39.9	GPT4o 8 frames	2024-10-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	✓ Link	36.4	PLLaVA-13B	2024-04-25
ST-LLM: Large Language Models Are Effective Temporal Learners	✓ Link	35.7	ST-LLM	2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	35.0	VideoChat2	2023-11-28
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	✓ Link	34.9	PLLaVA-7B	2024-04-25

OpenCodePapers

video-question-answering-on-tvbench