OpenCodePapers

zero-shot-video-question-answer-on-video-mme

Video Question AnsweringZero-Shot Video Question Answer

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy (%)	ModelName	ReleaseDate
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension	✓ Link	77.4	Video-RAG (based on LLaVA-Video)	2024-11-20
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	71.9	Gemini 1.5 Pro	2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding		70.3	GPT-4o	2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	66.3	Gemini 1.5 Flash	2024-03-08
[]()		64.8	LLaVA-OneVision (72B)
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding		62.3	GPT-4o mini	2024-06-14
VILA: On Pre-training for Visual Language Models	✓ Link	61.4	VILA-1.5 (34B)	2023-12-12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	60.9	VideoLLaMA2 (72B)	2024-06-11
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	✓ Link	46.3	VideoChat-T (7B)	2024-10-25