zero-shot-video-question-answer-on-egoschema-1

Video Question AnsweringZero-Shot Video Question Answer

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering	✓ Link	71.14	BIMBA-LLaVA-Qwen2-7B	2025-03-12
LinVT: Empower Your Image-level Large Language Model to Understand Videos	✓ Link	69.5	LinVT-Qwen2-VL(7B)	2024-12-06
Qwen2.5-Omni Technical Report	✓ Link	68.6	Qwen2.5-Omni	2025-03-26
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	✓ Link	67.6	LongVU (7B)	2024-10-22
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension	✓ Link	66.7	Video-RAG (Based on LLaVA-Video)	2024-11-20
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	63.9	VideoLLaMA2 (72B)	2024-06-11
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	61.7	Tarsier (34B)	2024-06-30
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos	✓ Link	61.1	VideoTree (GPT4)	2024-05-29
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA	✓ Link	61.1	LVNet	2024-06-13
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	60.2	InternVideo2-6B	2024-03-22
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	✓ Link	60.0	VideoChat-T (7B)	2024-10-25
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	56.7	VideoChat2_phi3	2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	55.8	VideoChat2_HD_mistral	2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	54.4	VideoChat2_mistral	2023-11-28
Vamos: Versatile Action Models for Video Understanding	✓ Link	53.6	Vamos (GPT-4o)	2023-11-22
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering	✓ Link	53.3	TraveLER	2024-04-01
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	50.3	LLoVi (GPT-3.5)	2023-12-28
Video ReCap: Recursive Captioning of Hour-Long Videos	✓ Link	50.23	Video ReCap	2024-02-20
Vamos: Versatile Action Models for Video Understanding	✓ Link	48.3	Vamos (GPT-4)	2023-11-22
Language Repository for Long Video Understanding	✓ Link	41.2	LangRepo (12B)	2024-03-21
Understanding Long Videos with Multimodal Language Models	✓ Link	37.6	MVU (13B)	2024-03-25
Vamos: Versatile Action Models for Video Understanding	✓ Link	36.7	Vamos (13B)	2023-11-22
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	33.5	LLoVi (7B)	2023-12-28
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	✓ Link	33.0	TimeChat (7B)	2023-12-04
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	32.1	InternVideo	2022-12-06
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	✓ Link	31.1	mPLUG-Owl (7B)	2023-04-27
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	✓ Link	26.9	FrozenBiLM	2022-06-16
Self-Chained Image-Language Model for Video Localization and Question Answering	✓ Link	22.7	SeViLA (4B)	2023-05-11
[]()		20.0	Random

OpenCodePapers

zero-shot-video-question-answer-on-egoschema-1