OpenCodePapers

zero-shot-video-question-answer-on-next-qa

Video Question AnsweringZero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering✓ Link79.6VideoMultiAgent (GPT-4o)2025-04-25
Tarsier: Recipes for Training and Evaluating Large Video Description Models✓ Link79.2Tarsier (34B)2024-06-30
Agentic Keyframe Search for Video Question Answering✓ Link78.1AKEYS2025-03-20
ENTER: Event Based Interpretable Reasoning for VideoQA75.1ENTER2025-01-24
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models✓ Link73.6TS-LLaVA-34B2024-11-17
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos✓ Link73.5VideoTree (GPT4)2024-05-29
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA✓ Link72.9LVNet(GPT-4o)2024-06-13
VideoAgent: Long-form Video Understanding with Large Language Model as Agent✓ Link71.3VideoAgent (GPT-4)2024-03-15
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM✓ Link70.9IG-VLM(LLaVA v1.6)2024-03-27
VidCtx: Context-aware Video Question Answering with Image Models✓ Link70.7VidCtx (7B)2024-12-23
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering69.2MoReVQA(PaLM-2)2024-04-09
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM✓ Link68.6IG-VLM (GPT-4)2024-03-27
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering✓ Link68.2TraveLER (GPT-4)2024-04-01
A Simple LLM Framework for Long-Range Video Question-Answering✓ Link67.7LLoVi (GPT-4)2023-12-28
Long Context Transfer from Language to Vision✓ Link67.1LongVA(32 frames)2024-06-24
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering✓ Link66.3Q-ViD2024-02-16
Zero-Shot Video Question Answering with Procedural Programs64.6ProViQ2023-12-01
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models✓ Link64.2SlowFast-LLaVA-34B2024-07-22
Self-Chained Image-Language Model for Video Localization and Question Answering✓ Link63.6Sevila (4B)2023-05-11
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link61.7VideoChat22023-11-28
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs61.0DeepStack-L(7B)2024-06-06
Language Repository for Long Video Understanding✓ Link60.9LangRepo (12B)2024-03-21
ViperGPT: Visual Inference via Python Execution for Reasoning✓ Link60.0ViperGPT (GPT-3.5)2023-03-14
Understanding Long Videos with Multimodal Language Models✓ Link55.2MVU (13B)2024-03-25
A Simple LLM Framework for Long-Range Video Question-Answering✓ Link54.3LLoVi (7B)2023-12-28
Verbs in Action: Improving verb understanding in video-language models✓ Link51.5VFC2023-04-13
Mistral 7B✓ Link51.1Mistral (7B)2023-10-10