visual-question-answering-on-msrvtt-qa-1

Visual Question Answering (VQA)

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		0.496	VLAB	2023-05-22
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	✓ Link	0.495	MaMMUT	2023-03-29
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	0.480	mPLUG-2	2023-02-01
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling		0.478	MuLTI	2023-03-10
Flamingo: a Visual Language Model for Few-Shot Learning	✓ Link	0.474	Flamingo	2022-04-29
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	0.471	InternVideo	2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	0.471	UMT-L (ViT-L/16)	2023-03-28
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.470	FrozenBiLM+	2023-08-18
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	0.470	vid-TLDR (UMT-L)	2024-03-20
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		0.463	VideoCoCa	2022-12-09
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	✓ Link	0.462	HBI	2023-03-25
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		0.459	HiTeA	2022-12-30
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	0.458	EMCL-Net	2022-11-21
Video Question Answering with Iterative Video-Text Co-Tokenization		.457	Co-Tokenization	2022-08-01
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	0.455	X2-VLM (large)	2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	0.45	X2-VLM (base)	2022-11-22
All in One: Exploring Unified Video-Language Pre-training	✓ Link	0.443	All-in-one-B	2022-03-14
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		0.441	OmniVL	2022-09-15
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	0.441	Clover	2022-07-16
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models	✓ Link	0.440	AIO+MIF	2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models	✓ Link	0.438	AIO+MDF	2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models	✓ Link	0.423	GIT+MDF	2023-07-09
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	✓ Link	0.421	ALPRO	2021-12-17
Lightweight Recurrent Cross-modal Encoder for Video Question Answering	✓ Link	0.42	LRCE	2023-06-30
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.418	JustAsk+	2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.395	All-in-one+	2023-08-18
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling	✓ Link	0.374	CLIPBERT	2021-02-11
Hierarchical Conditional Relation Networks for Video Question Answering	✓ Link	0.356	HCRN	2020-02-25
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering	✓ Link	0.355	DualVGR	2021-07-10
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering	✓ Link	0.33	HMEMA	2019-04-08
Motion-Appearance Co-Memory Networks for Video Question Answering		0.32	Co-Mem	2018-03-29
Flamingo: a Visual Language Model for Few-Shot Learning	✓ Link	0.310	Flamingo (32-shot)	2022-04-29
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	✓ Link	0.309	ST-VQA	2017-04-14
Flamingo: a Visual Language Model for Few-Shot Learning	✓ Link	0.174	Flamingo (0-shot)	2022-04-29

OpenCodePapers

visual-question-answering-on-msrvtt-qa-1