OpenCodePapers
zero-shot-video-question-answer-on-video-mme-1
Video Question Answering
Zero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy (%)
↕
ModelName
ReleaseDate
↕
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
✓ Link
81.3
Gemini 1.5 Pro
2024-03-08
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
✓ Link
77.4
Video-RAG (Based on LLaVA-Video)
2024-11-20
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
77.2
GPT-4o
2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
✓ Link
75.0
Gemini 1.5 Flash
2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
68.9
GPT-4o mini
2024-06-14
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
✓ Link
64.67
BIMBA-LLaVA-Qwen2-7B
2025-03-12
VILA: On Pre-training for Visual Language Models
✓ Link
64.1
VILA-1.5 (34B)
2023-12-12
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
✓ Link
63.7
MiniCPM-V 2.6 (8B)
2024-08-03
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
✓ Link
63.1
VideoLLaMA2 (72B)
2024-06-11
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
✓ Link
60.6
LongVU (7B)
2024-10-22
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
✓ Link
55.8
VideoChat-T (7B)
2024-10-25