OpenCodePapers

video-question-answering-on-activitynet-qa

Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyConfidence scoreModelNameReleaseDate
Composing Ensembles of Pre-trained Models via Iterative Consensus61.2GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus58.4GPT-2 + CLIP-32 (Zero-Shot)2022-10-20
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners56.1VideoCoCa2022-12-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities51.13Mirasol3B2023-11-09
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link50.4VAST2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link49.9COSA2023-06-15
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding✓ Link49.8MA-LMM2024-04-08
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link49.13.3VideoChat22023-11-28
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link48.6VALOR2023-04-17
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link47.9UMT-L (ViT-L/16)2023-03-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models✓ Link47.53.3LLaMA-VID-13B (2 Token)2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models✓ Link47.43.3LLaMA-VID-7B (2 Token)2023-11-28
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding✓ Link46.43.3Chat-UniVi-13B2023-11-14
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link46.13.6BT-Adapter (zero-shot)2023-09-27
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding✓ Link45.73.1MovieChat2023-07-31
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection✓ Link45.33.3Video-LLaVA2023-11-16
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding✓ Link45TESTA (ViT-B/16)2023-10-29
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link44.8FrozenBiLM+2023-08-18
VindLU: A Recipe for Effective Video-and-Language Pretraining✓ Link44.7VindLU2022-12-09
Revealing Single Frame Bias for Video-and-Language Learning✓ Link44.1Singularity-temporal2022-06-07
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models✓ Link43.2FrozenBiLM2022-06-16
Revealing Single Frame Bias for Video-and-Language Learning✓ Link43.1Singularity2022-06-07
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval✓ Link41.4Text + Text (no Multimodal Pretext Training)2022-06-05
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link40.0All-in-one+2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link39.7VIOLET+2023-08-18
Just Ask: Learning to Answer Questions from Millions of Narrated Videos✓ Link38.9Just Ask (fine-tune)2020-12-01
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs✓ Link38.2LocVLM-Vid-B+2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs✓ Link37.4LocVLM-Vid-B2024-04-11
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models✓ Link35.22.7Video-ChatGPT2023-06-08
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model✓ Link34.22.7LLaMA Adapter V22023-04-28
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering✓ Link31.8E-SA2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering✓ Link27.1E-MN2019-06-06
VideoChat: Chat-Centric Video Understanding✓ Link26.52.2Video Chat2023-05-10
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models✓ Link25.9FrozenBiLM (0-shot)2022-06-16
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering✓ Link25.1E-VQA2019-06-06
Just Ask: Learning to Answer Questions from Millions of Narrated Videos✓ Link12.2Just Ask (0-shot)2020-12-01