OpenCodePapers

zero-shot-video-question-answer-on-video-mme-1

Video Question AnsweringZero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracy (%)ModelNameReleaseDate
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link81.3Gemini 1.5 Pro2024-03-08
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension✓ Link77.4Video-RAG (Based on LLaVA-Video)2024-11-20
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding77.2GPT-4o2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link75.0Gemini 1.5 Flash2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding68.9GPT-4o mini2024-06-14
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering✓ Link64.67BIMBA-LLaVA-Qwen2-7B2025-03-12
VILA: On Pre-training for Visual Language Models✓ Link64.1VILA-1.5 (34B)2023-12-12
MiniCPM-V: A GPT-4V Level MLLM on Your Phone✓ Link63.7MiniCPM-V 2.6 (8B)2024-08-03
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link63.1VideoLLaMA2 (72B)2024-06-11
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding✓ Link60.6LongVU (7B)2024-10-22
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning✓ Link55.8VideoChat-T (7B)2024-10-25