video-question-answering-on-activitynet-qa

Video Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	Confidence score	ModelName	ReleaseDate
Composing Ensembles of Pre-trained Models via Iterative Consensus		61.2		GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus		58.4		GPT-2 + CLIP-32 (Zero-Shot)	2022-10-20
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		56.1		VideoCoCa	2022-12-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities		51.13		Mirasol3B	2023-11-09
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	50.4		VAST	2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	49.9		COSA	2023-06-15
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	✓ Link	49.8		MA-LMM	2024-04-08
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	✓ Link	49.1	3.3	VideoChat2	2023-11-28
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	48.6		VALOR	2023-04-17
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	47.9		UMT-L (ViT-L/16)	2023-03-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	✓ Link	47.5	3.3	LLaMA-VID-13B (2 Token)	2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	✓ Link	47.4	3.3	LLaMA-VID-7B (2 Token)	2023-11-28
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	✓ Link	46.4	3.3	Chat-UniVi-13B	2023-11-14
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	✓ Link	46.1	3.6	BT-Adapter (zero-shot)	2023-09-27
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	✓ Link	45.7	3.1	MovieChat	2023-07-31
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	✓ Link	45.3	3.3	Video-LLaVA	2023-11-16
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	✓ Link	45		TESTA (ViT-B/16)	2023-10-29
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	44.8		FrozenBiLM+	2023-08-18
VindLU: A Recipe for Effective Video-and-Language Pretraining	✓ Link	44.7		VindLU	2022-12-09
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	44.1		Singularity-temporal	2022-06-07
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	✓ Link	43.2		FrozenBiLM	2022-06-16
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	43.1		Singularity	2022-06-07
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval	✓ Link	41.4		Text + Text (no Multimodal Pretext Training)	2022-06-05
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	40.0		All-in-one+	2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	39.7		VIOLET+	2023-08-18
Just Ask: Learning to Answer Questions from Millions of Narrated Videos	✓ Link	38.9		Just Ask (fine-tune)	2020-12-01
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	✓ Link	38.2		LocVLM-Vid-B+	2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	✓ Link	37.4		LocVLM-Vid-B	2024-04-11
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	✓ Link	35.2	2.7	Video-ChatGPT	2023-06-08
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	✓ Link	34.2	2.7	LLaMA Adapter V2	2023-04-28
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	✓ Link	31.8		E-SA	2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	✓ Link	27.1		E-MN	2019-06-06
VideoChat: Chat-Centric Video Understanding	✓ Link	26.5	2.2	Video Chat	2023-05-10
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	✓ Link	25.9		FrozenBiLM (0-shot)	2022-06-16
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	✓ Link	25.1		E-VQA	2019-06-06
Just Ask: Learning to Answer Questions from Millions of Narrated Videos	✓ Link	12.2		Just Ask (0-shot)	2020-12-01

OpenCodePapers

video-question-answering-on-activitynet-qa