video-question-answering-on-next-qa

Video Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
LinVT: Empower Your Image-level Large Language Model to Understand Videos	✓ Link	85.5	LinVT-Qwen2-VL (7B)	2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	85.5	InternVL-2.5(8B)	2024-12-06
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	✓ Link	84.5	VideoLLaMA3(7B)	2025-01-22
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	84.1	PLM-8B	2025-04-17
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering	✓ Link	83.73	BIMBA-LLaVA-Qwen2-7B	2025-03-12
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	83.4	PLM-3B	2025-04-17
Video Instruction Tuning With Synthetic Data		83.2	LLaVA-Video	2024-10-03
NVILA: Efficient Frontier Visual Language Models	✓ Link	82.2	NVILA(8B)	2024-12-05
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	✓ Link	81.8	Oryx-1.5(7B)	2024-09-19
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	81.2	Qwen2-VL(7B)	2024-09-18
LongVILA: Scaling Long-Context Visual Language Models for Long Videos	✓ Link	80.7	LongVILA(7B)	2024-08-19
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	✓ Link	80.3	PLM-1B	2025-04-17
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	80.2	LLaVA-OV(72B)	2024-08-06
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	79.5	VideoChat2_HD_mistral	2023-11-28
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	79.4	LLaVA-OV(7B)	2024-08-06
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models	✓ Link	79.1	LLaVA-NeXT-Interleave(14B)	2024-07-10
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	78.6	VideoChat2_mistral	2023-11-28
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	✓ Link	78.6	mPLUG-Owl3(8B)	2024-08-09
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models	✓ Link	78.2	LLaVA-NeXT-Interleave(7B)	2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models	✓ Link	77.9	LLaVA-NeXT-Interleave(DPO)	2024-07-10
Vamos: Versatile Action Models for Video Understanding	✓ Link	77.3	Vamos	2023-11-22
ViLA: Efficient Video-Language Alignment for Video Question Answering	✓ Link	75.6	ViLA (3B)	2023-12-13
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	75.6	VideoLLaMA2.1(7B)	2024-06-11
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	✓ Link	75.5	LLaMA-VQA (33B)	2023-10-24
ViLA: Efficient Video-Language Alignment for Video Question Answering	✓ Link	74.4	ViLA (3B, 4 frames)	2023-12-13
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion	✓ Link	73.9	CREMA	2024-02-08
Self-Chained Image-Language Model for Video Localization and Question Answering	✓ Link	73.8	SeViLA	2023-05-11
Text-Conditioned Resampler For Long Form Video Understanding		73.5	TCR	2023-12-19
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge	✓ Link	72.1	LSTP	2024-02-25
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities		72	Mirasol3B	2023-11-09
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	68.6	VideoChat2	2023-11-28
RTQ: Rethinking Video-language Understanding Based on Image-text Model	✓ Link	63.2	RTQ	2023-12-01
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		63.1	HiTeA	2022-12-30
Contrastive Video Question Answering via Video Graph Transformer	✓ Link	60.7	CoVGT(PT)	2023-02-27
Semi-Parametric Video-Grounded Text Generation		60.6	SeViT	2023-01-27
ViperGPT: Visual Inference via Python Execution for Reasoning	✓ Link	60.0	ViperGPT(0-shot)	2023-03-14
Contrastive Video Question Answering via Video Graph Transformer	✓ Link	60.0	CoVGT	2023-02-27
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering	✓ Link	58.83	GF	2024-01-03
Verbs in Action: Improving verb understanding in video-language models	✓ Link	58.6	VFC	2023-04-13
ATM: Action Temporality Modeling for Video Question Answering		58.3	ATM	2023-09-05
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering	✓ Link	57.2	MIST	2022-12-19
Video Graph Transformer for Video Question Answering	✓ Link	56.9	VGT(PT)	2022-07-12
Paxion: Patching Action Knowledge in Video-Language Foundation Models	✓ Link	56.9	PAXION	2023-05-18
Video Graph Transformer for Video Question Answering	✓ Link	55.0	VGT	2022-07-12
Revisiting the "Video" in Video-Language Understanding	✓ Link	54.3	ATP	2022-06-03
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering		53.4	P3D-G	2022-02-18
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering	✓ Link	51.4	HQGA	2021-12-12

OpenCodePapers

video-question-answering-on-next-qa