zero-shot-video-question-answer-on-next-qa

Video Question AnsweringZero-Shot Video Question Answer

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering	✓ Link	79.6	VideoMultiAgent (GPT-4o)	2025-04-25
Tarsier: Recipes for Training and Evaluating Large Video Description Models	✓ Link	79.2	Tarsier (34B)	2024-06-30
Agentic Keyframe Search for Video Question Answering	✓ Link	78.1	AKEYS	2025-03-20
ENTER: Event Based Interpretable Reasoning for VideoQA		75.1	ENTER	2025-01-24
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	✓ Link	73.6	TS-LLaVA-34B	2024-11-17
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos	✓ Link	73.5	VideoTree (GPT4)	2024-05-29
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA	✓ Link	72.9	LVNet(GPT-4o)	2024-06-13
VideoAgent: Long-form Video Understanding with Large Language Model as Agent	✓ Link	71.3	VideoAgent (GPT-4)	2024-03-15
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	✓ Link	70.9	IG-VLM(LLaVA v1.6)	2024-03-27
VidCtx: Context-aware Video Question Answering with Image Models	✓ Link	70.7	VidCtx (7B)	2024-12-23
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering		69.2	MoReVQA(PaLM-2)	2024-04-09
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	✓ Link	68.6	IG-VLM (GPT-4)	2024-03-27
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering	✓ Link	68.2	TraveLER (GPT-4)	2024-04-01
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	67.7	LLoVi (GPT-4)	2023-12-28
Long Context Transfer from Language to Vision	✓ Link	67.1	LongVA(32 frames)	2024-06-24
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering	✓ Link	66.3	Q-ViD	2024-02-16
Zero-Shot Video Question Answering with Procedural Programs		64.6	ProViQ	2023-12-01
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	✓ Link	64.2	SlowFast-LLaVA-34B	2024-07-22
Self-Chained Image-Language Model for Video Localization and Question Answering	✓ Link	63.6	Sevila (4B)	2023-05-11
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	61.7	VideoChat2	2023-11-28
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs		61.0	DeepStack-L(7B)	2024-06-06
Language Repository for Long Video Understanding	✓ Link	60.9	LangRepo (12B)	2024-03-21
ViperGPT: Visual Inference via Python Execution for Reasoning	✓ Link	60.0	ViperGPT (GPT-3.5)	2023-03-14
Understanding Long Videos with Multimodal Language Models	✓ Link	55.2	MVU (13B)	2024-03-25
A Simple LLM Framework for Long-Range Video Question-Answering	✓ Link	54.3	LLoVi (7B)	2023-12-28
Verbs in Action: Improving verb understanding in video-language models	✓ Link	51.5	VFC	2023-04-13
Mistral 7B	✓ Link	51.1	Mistral (7B)	2023-10-10

OpenCodePapers

zero-shot-video-question-answer-on-next-qa