OpenCodePapers

video-question-answering-on-next-qa

Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
LinVT: Empower Your Image-level Large Language Model to Understand Videos✓ Link85.5LinVT-Qwen2-VL (7B)2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link85.5InternVL-2.5(8B)2024-12-06
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding✓ Link84.5VideoLLaMA3(7B)2025-01-22
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link84.1PLM-8B2025-04-17
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering✓ Link83.73BIMBA-LLaVA-Qwen2-7B2025-03-12
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link83.4PLM-3B2025-04-17
Video Instruction Tuning With Synthetic Data83.2LLaVA-Video2024-10-03
NVILA: Efficient Frontier Visual Language Models✓ Link82.2NVILA(8B)2024-12-05
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution✓ Link81.8Oryx-1.5(7B)2024-09-19
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link81.2Qwen2-VL(7B)2024-09-18
LongVILA: Scaling Long-Context Visual Language Models for Long Videos✓ Link80.7LongVILA(7B)2024-08-19
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding✓ Link80.3PLM-1B2025-04-17
LLaVA-OneVision: Easy Visual Task Transfer✓ Link80.2LLaVA-OV(72B)2024-08-06
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link79.5VideoChat2_HD_mistral2023-11-28
LLaVA-OneVision: Easy Visual Task Transfer✓ Link79.4LLaVA-OV(7B)2024-08-06
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models✓ Link79.1LLaVA-NeXT-Interleave(14B)2024-07-10
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link78.6VideoChat2_mistral2023-11-28
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models✓ Link78.6mPLUG-Owl3(8B)2024-08-09
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models✓ Link78.2LLaVA-NeXT-Interleave(7B)2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models✓ Link77.9LLaVA-NeXT-Interleave(DPO)2024-07-10
Vamos: Versatile Action Models for Video Understanding✓ Link77.3Vamos2023-11-22
ViLA: Efficient Video-Language Alignment for Video Question Answering✓ Link75.6ViLA (3B)2023-12-13
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link75.6VideoLLaMA2.1(7B)2024-06-11
Large Language Models are Temporal and Causal Reasoners for Video Question Answering✓ Link75.5LLaMA-VQA (33B)2023-10-24
ViLA: Efficient Video-Language Alignment for Video Question Answering✓ Link74.4ViLA (3B, 4 frames)2023-12-13
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion✓ Link73.9CREMA2024-02-08
Self-Chained Image-Language Model for Video Localization and Question Answering✓ Link73.8SeViLA2023-05-11
Text-Conditioned Resampler For Long Form Video Understanding73.5TCR2023-12-19
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge✓ Link72.1LSTP2024-02-25
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities72Mirasol3B2023-11-09
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link68.6VideoChat22023-11-28
RTQ: Rethinking Video-language Understanding Based on Image-text Model✓ Link63.2RTQ2023-12-01
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training63.1HiTeA2022-12-30
Contrastive Video Question Answering via Video Graph Transformer✓ Link60.7CoVGT(PT)2023-02-27
Semi-Parametric Video-Grounded Text Generation60.6SeViT2023-01-27
ViperGPT: Visual Inference via Python Execution for Reasoning✓ Link60.0ViperGPT(0-shot)2023-03-14
Contrastive Video Question Answering via Video Graph Transformer✓ Link60.0CoVGT2023-02-27
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering✓ Link58.83GF2024-01-03
Verbs in Action: Improving verb understanding in video-language models✓ Link58.6VFC2023-04-13
ATM: Action Temporality Modeling for Video Question Answering58.3ATM2023-09-05
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering✓ Link57.2MIST2022-12-19
Video Graph Transformer for Video Question Answering✓ Link56.9VGT(PT)2022-07-12
Paxion: Patching Action Knowledge in Video-Language Foundation Models✓ Link56.9PAXION2023-05-18
Video Graph Transformer for Video Question Answering✓ Link55.0VGT2022-07-12
Revisiting the "Video" in Video-Language Understanding✓ Link54.3ATP2022-06-03
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering53.4P3D-G2022-02-18
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering✓ Link51.4HQGA2021-12-12