OpenCodePapers

zero-shot-video-question-answer-on-egoschema-1

Video Question AnsweringZero-Shot Video Question Answer
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering✓ Link71.14BIMBA-LLaVA-Qwen2-7B2025-03-12
LinVT: Empower Your Image-level Large Language Model to Understand Videos✓ Link69.5LinVT-Qwen2-VL(7B)2024-12-06
Qwen2.5-Omni Technical Report✓ Link68.6Qwen2.5-Omni2025-03-26
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding✓ Link67.6LongVU (7B)2024-10-22
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension✓ Link66.7Video-RAG (Based on LLaVA-Video)2024-11-20
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link63.9VideoLLaMA2 (72B)2024-06-11
Tarsier: Recipes for Training and Evaluating Large Video Description Models✓ Link61.7Tarsier (34B)2024-06-30
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA✓ Link61.1LVNet2024-06-13
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos✓ Link61.1VideoTree (GPT4)2024-05-29
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link60.2InternVideo2-6B2024-03-22
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning✓ Link60.0VideoChat-T (7B)2024-10-25
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link56.7VideoChat2_phi32023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link55.8VideoChat2_HD_mistral2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link54.4VideoChat2_mistral2023-11-28
Vamos: Versatile Action Models for Video Understanding✓ Link53.6Vamos (GPT-4o)2023-11-22
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering✓ Link53.3TraveLER2024-04-01
A Simple LLM Framework for Long-Range Video Question-Answering✓ Link50.3LLoVi (GPT-3.5)2023-12-28
Video ReCap: Recursive Captioning of Hour-Long Videos✓ Link50.23Video ReCap2024-02-20
Vamos: Versatile Action Models for Video Understanding✓ Link48.3Vamos (GPT-4)2023-11-22
Language Repository for Long Video Understanding✓ Link41.2LangRepo (12B)2024-03-21
Understanding Long Videos with Multimodal Language Models✓ Link37.6MVU (13B)2024-03-25
Vamos: Versatile Action Models for Video Understanding✓ Link36.7Vamos (13B)2023-11-22
A Simple LLM Framework for Long-Range Video Question-Answering✓ Link33.5LLoVi (7B)2023-12-28
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding✓ Link33.0TimeChat (7B)2023-12-04
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link32.1InternVideo2022-12-06
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality✓ Link31.1mPLUG-Owl (7B)2023-04-27
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models✓ Link26.9FrozenBiLM2022-06-16
Self-Chained Image-Language Model for Video Localization and Question Answering✓ Link22.7SeViLA (4B)2023-05-11
[]()20.0Random