OpenCodePapers

zero-shot-video-question-answer-on-video-mme

Video Question AnsweringZero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracy (%)ModelNameReleaseDate
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension✓ Link77.4Video-RAG (based on LLaVA-Video)2024-11-20
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link71.9Gemini 1.5 Pro2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding70.3GPT-4o2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link66.3Gemini 1.5 Flash2024-03-08
[]()64.8LLaVA-OneVision (72B)
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding62.3GPT-4o mini2024-06-14
VILA: On Pre-training for Visual Language Models✓ Link61.4VILA-1.5 (34B)2023-12-12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link60.9VideoLLaMA2 (72B)2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning✓ Link46.3VideoChat-T (7B)2024-10-25