OpenCodePapers
zero-shot-video-question-answer-on-video-mme
Video Question Answering
Zero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy (%)
↕
ModelName
ReleaseDate
↕
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
✓ Link
77.4
Video-RAG (based on LLaVA-Video)
2024-11-20
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
✓ Link
71.9
Gemini 1.5 Pro
2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
70.3
GPT-4o
2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
✓ Link
66.3
Gemini 1.5 Flash
2024-03-08
[]()
64.8
LLaVA-OneVision (72B)
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
62.3
GPT-4o mini
2024-06-14
VILA: On Pre-training for Visual Language Models
✓ Link
61.4
VILA-1.5 (34B)
2023-12-12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
✓ Link
60.9
VideoLLaMA2 (72B)
2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
✓ Link
46.3
VideoChat-T (7B)
2024-10-25