OpenCodePapers

zero-shot-video-question-answer-on-video-mme-1

Video Question AnsweringZero-Shot Video Question Answer

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy (%)	ModelName	ReleaseDate
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	81.3	Gemini 1.5 Pro	2024-03-08
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension	✓ Link	77.4	Video-RAG (Based on LLaVA-Video)	2024-11-20
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding		77.2	GPT-4o	2024-06-14
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	75.0	Gemini 1.5 Flash	2024-03-08
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding		68.9	GPT-4o mini	2024-06-14
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering	✓ Link	64.67	BIMBA-LLaVA-Qwen2-7B	2025-03-12
VILA: On Pre-training for Visual Language Models	✓ Link	64.1	VILA-1.5 (34B)	2023-12-12
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	✓ Link	63.7	MiniCPM-V 2.6 (8B)	2024-08-03
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	63.1	VideoLLaMA2 (72B)	2024-06-11
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	✓ Link	60.6	LongVU (7B)	2024-10-22
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	✓ Link	55.8	VideoChat-T (7B)	2024-10-25