visual-question-answering-on-msvd-qa-1

Visual Question Answering (VQA)

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		0.61	VLAB	2023-05-22
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	✓ Link	0.606	MA-LMM	2024-04-08
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	✓ Link	.602	MaMMUT (ours)	2023-03-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	0.60	VALOR	2023-04-17
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	0.60	VAST	2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	0.60	COSA	2023-06-15
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	0.581	mPLUG-2	2023-02-01
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		0.569	VideoCoCa	2022-12-09
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	0.568	GIT	2022-05-27
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.558	FrozenBiLM+	2023-08-18
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		0.556	HiTeA	2022-12-30
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	0.555	InternVideo	2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	0.552	UMT-L (ViT-L/16)	2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	0.549	vid-TLDR (UMT-L)	2024-03-20
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	0.547	VIOLETv2	2022-09-04
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling		0.547	MuLTI	2023-03-10
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	0.546	X2-VLM (large)	2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	0.528	X2-VLM (base)	2022-11-22
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	0.524	Clover	2022-07-16
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	0.517	VIOLET + MELTR	2023-03-23
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		0.510	OmniVL	2022-09-15
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.495	VIOLET+	2023-08-18
Video Question Answering with Iterative Video-Text Co-Tokenization		.486	Co-Tokenization	2022-08-01
All in One: Exploring Unified Video-Language Pre-training	✓ Link	0.483	All-in-one-B	2022-03-14
Lightweight Recurrent Cross-modal Encoder for Video Question Answering	✓ Link	0.478	LRCE	2023-06-30
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.477	JustAsk+	2023-08-18
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models	✓ Link	0.469	GIT+MDF	2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models	✓ Link	0.467	AIO+MIF	2023-07-09
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	✓ Link	0.459	ALPRO	2021-12-17
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	✓ Link	0.438	All-in-one+	2023-08-18
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering	✓ Link	0.390	DualVGR	2021-07-10
Hierarchical Conditional Relation Networks for Video Question Answering	✓ Link	0.361	HCRN	2020-02-25
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning	✓ Link	0.351	SSML	2020-03-06
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering	✓ Link	0.337	HMEMA	2019-04-08
Motion-Appearance Co-Memory Networks for Video Question Answering		0.317	Co-Mem	2018-03-29
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	✓ Link	0.313	ST-VQA	2017-04-14

OpenCodePapers

visual-question-answering-on-msvd-qa-1