OpenCodePapers

video-question-answering-on-tvbench

Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAverage AccuracyModelNameReleaseDate
Seed1.5-VL Technical Report63.6Seed1.5-VL thinking2025-05-11
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link63.5PLM-8B2025-04-17
Seed1.5-VL Technical Report61.5Seed1.5-VL2025-05-11
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning✓ Link60.6V-JEPA 2 ViT-g 8B2025-06-11
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link58.9PLM-3B2025-04-17
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization56.5RRPO2025-04-16
Tarsier: Recipes for Training and Evaluating Large Video Description Models✓ Link55.5Tarsier-34B2024-06-30
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding✓ Link54.7Tarsier2-7B2025-01-14
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link52.7Qwen2-VL-72B2024-09-18
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output✓ Link51.6IXC-2.5 7B2024-07-03
Aria: An Open Multimodal Native Mixture-of-Experts Model✓ Link51.0Aria2024-10-08
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link50.4PLM-1B2025-04-17
Video Instruction Tuning With Synthetic Data50.0LLaVA-Video 72B2024-10-03
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link48.4VideoLLaMA2 72B2024-06-11
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link47.6Gemini 1.5 Pro2024-03-08
Tarsier: Recipes for Training and Evaluating Large Video Description Models✓ Link46.9Tarsier-7B2024-06-30
Video Instruction Tuning With Synthetic Data45.6LLaVA-Video 7B2024-10-03
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link43.8Qwen2-VL-7B2024-09-18
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link42.9VideoLLaMA2 7B2024-06-11
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning✓ Link42.3PLLaVA-34B2024-04-25
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models✓ Link42.2mPLUG-Owl32024-08-09
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link42.1VideoLLaMA2.12024-06-11
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding✓ Link41.7VideoGPT+2024-06-13
GPT-4o System Card39.9GPT4o 8 frames2024-10-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning✓ Link36.4PLLaVA-13B2024-04-25
ST-LLM: Large Language Models Are Effective Temporal Learners✓ Link35.7ST-LLM2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link35.0VideoChat22023-11-28
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning✓ Link34.9PLLaVA-7B2024-04-25