OpenCodePapers

zero-shot-video-question-answer-on-egoschema

Video Question AnsweringZero-Shot Video Question Answer

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	Inference Speed (s)	ModelName	ReleaseDate
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	68.6		Tarsier (34B)	2024-06-30
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	✓ Link	68.4		VideoChat-T (7B)	2024-10-25
Language Repository for Long Video Understanding	✓ Link	66.2		LangRepo (12B)	2024-03-21
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos	✓ Link	66.2		VideoTree (GPT4)	2024-05-29
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA	✓ Link	66.0		LVNet	2024-06-13
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	65.6		VideoChat2_HD_mistral	2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	63.6		VideoChat2_mistral	2023-11-28
Understanding Long Videos with Multimodal Language Models	✓ Link	60.3	2.42	MVU (13B)	2024-03-25
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	✓ Link	57.8		TS-LLaVA-34B	2024-11-17
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	57.6		LLoVi (GPT-3.5)	2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	50.8		LLoVi (7B)	2023-12-28
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	✓ Link	47.2		SlowFast-LLaVA-34B	2024-07-22
Self-Chained Image-Language Model for Video Localization and Question Answering	✓ Link	25.7		SeViLA (4B)	2023-05-11
[]()		20.0		Random