OpenCodePapers

video-question-answering-on-mvbench

Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAvg.ModelNameReleaseDate
LinVT: Empower Your Image-level Large Language Model to Understand Videos✓ Link69.3LinVT-Qwen2-VL (7B)2024-12-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models✓ Link67.6Tarsier (34B)2024-06-30
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link67.2InternVideo22024-03-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding✓ Link66.9LongVU (7B)2024-10-22
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution✓ Link64.7Oryx(34B)2024-09-19
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link62.0VideoLLaMA2 (72B)2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning✓ Link59.9VideoChat-T (7B)2024-10-25
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models✓ Link59.5mPLUG-Owl3(7B)2024-08-09
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance✓ Link59.2PPLLaVA (7b)2024-11-04
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding✓ Link58.7VideoGPT+2024-06-13
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning✓ Link58.1PLLaVA2024-04-25
ST-LLM: Large Language Models Are Effective Temporal Learners✓ Link54.9ST-LLM2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link51.9VideoChat22023-11-28
HawkEye: Training Video-Text LLMs for Grounding Text in Videos✓ Link47.55HawkEye2024-03-15
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models✓ Link39.7SPHINX-Plus2024-02-08
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding✓ Link38.5TimeChat2023-12-04
Visual Instruction Tuning✓ Link36.0LLaVa2023-04-17
VideoChat: Chat-Centric Video Understanding✓ Link35.5VideoChat2023-05-10
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding✓ Link34.1VideoLLaMA2023-06-05
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models✓ Link32.7Video-ChatGPT2023-06-08
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning✓ Link32.5InstructBLIP2023-05-11
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models✓ Link18.8MiniGPT42023-04-20