VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | ✓ Link | 79.6 | VideoMultiAgent (GPT-4o) | 2025-04-25 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 79.2 | Tarsier (34B) | 2024-06-30 |
Agentic Keyframe Search for Video Question Answering | ✓ Link | 78.1 | AKEYS | 2025-03-20 |
ENTER: Event Based Interpretable Reasoning for VideoQA | | 75.1 | ENTER | 2025-01-24 |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | ✓ Link | 73.6 | TS-LLaVA-34B | 2024-11-17 |
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | ✓ Link | 73.5 | VideoTree (GPT4) | 2024-05-29 |
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | ✓ Link | 72.9 | LVNet(GPT-4o) | 2024-06-13 |
VideoAgent: Long-form Video Understanding with Large Language Model as Agent | ✓ Link | 71.3 | VideoAgent (GPT-4) | 2024-03-15 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 70.9 | IG-VLM(LLaVA v1.6) | 2024-03-27 |
VidCtx: Context-aware Video Question Answering with Image Models | ✓ Link | 70.7 | VidCtx (7B) | 2024-12-23 |
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering | | 69.2 | MoReVQA(PaLM-2) | 2024-04-09 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 68.6 | IG-VLM (GPT-4) | 2024-03-27 |
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | ✓ Link | 68.2 | TraveLER (GPT-4) | 2024-04-01 |
A Simple LLM Framework for Long-Range Video Question-Answering | ✓ Link | 67.7 | LLoVi (GPT-4) | 2023-12-28 |
Long Context Transfer from Language to Vision | ✓ Link | 67.1 | LongVA(32 frames) | 2024-06-24 |
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | ✓ Link | 66.3 | Q-ViD | 2024-02-16 |
Zero-Shot Video Question Answering with Procedural Programs | | 64.6 | ProViQ | 2023-12-01 |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | ✓ Link | 64.2 | SlowFast-LLaVA-34B | 2024-07-22 |
Self-Chained Image-Language Model for Video Localization and Question Answering | ✓ Link | 63.6 | Sevila (4B) | 2023-05-11 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 61.7 | VideoChat2 | 2023-11-28 |
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | | 61.0 | DeepStack-L(7B) | 2024-06-06 |
Language Repository for Long Video Understanding | ✓ Link | 60.9 | LangRepo (12B) | 2024-03-21 |
ViperGPT: Visual Inference via Python Execution for Reasoning | ✓ Link | 60.0 | ViperGPT (GPT-3.5) | 2023-03-14 |
Understanding Long Videos with Multimodal Language Models | ✓ Link | 55.2 | MVU (13B) | 2024-03-25 |
A Simple LLM Framework for Long-Range Video Question-Answering | ✓ Link | 54.3 | LLoVi (7B) | 2023-12-28 |
Verbs in Action: Improving verb understanding in video-language models | ✓ Link | 51.5 | VFC | 2023-04-13 |
Mistral 7B | ✓ Link | 51.1 | Mistral (7B) | 2023-10-10 |