visual-question-answering-on-vqa-v2-test-std

Visual Question Answering (VQA)

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	overall	yes/no	number	other	ModelName	ReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	84.03				BEiT-3	2022-08-22
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections	✓ Link	83.62	94.83	69.82	77.02	mPLUG-Huge	2022-05-24
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	✓ Link	82.52	94.85	72.24	74.15	ONE-PEACE	2023-05-18
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	81.8				X2-VLM (large)	2022-11-22
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts	✓ Link	81.30	94.68	67.26	72.87	VLMo	2021-11-03
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	✓ Link	80.34				SimVLM	2021-08-24
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	80.2				X2-VLM (base)	2022-11-22
[]()		80.19				VAST
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	78.62				VALOR	2023-04-17
Prompt Tuning for Generative Multimodal Pretrained Models	✓ Link	78.53				Prompt Tuning	2022-08-04
Prismer: A Vision-Language Model with Multi-Task Experts	✓ Link	78.49	93.09	61.39	69.70	Prismer	2023-03-04
VinVL: Revisiting Visual Representations in Vision-Language Models	✓ Link	77.45	92.38	62.55	67.87	MSR + MS Cog. Svcs., X10 models	2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models	✓ Link	76.63	92.04	61.5	66.68	MSR + MS Cog. Svcs.	2021-01-02
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	✓ Link	76.04				ALBEF (14M)	2021-07-16
Bilinear Graph Networks for Visual Question Answering		75.92	90.89	61.13	66.28	BGN, ensemble	2019-07-23
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph		74.93	90.83	56.79	65.24	ERNIE-ViL-single model	2020-06-30
In Defense of Grid Features for Visual Question Answering	✓ Link	74.16	89.18	58.01	64.77	Single, w/o VLP	2020-01-10
Deep Multimodal Neural Architecture Search	✓ Link	73.86	89.46	58.62	63.78	Single, w/o VLP	2020-04-25
UNITER: UNiversal Image-TExt Representation Learning	✓ Link	73.4				UNITER (Large)	2019-09-25
In Defense of Grid Features for Visual Question Answering	✓ Link	72.71				X-101 grid features + MCAN	2020-01-10
LXMERT: Learning Cross-Modality Encoder Representations from Transformers	✓ Link	72.5				LXMERT	2019-08-20
VL-BERT: Pre-training of Generic Visual-Linguistic Representations	✓ Link	72.2				VL-BERTLARGE	2019-08-22
Visual Commonsense R-CNN	✓ Link	71.49				MCAN+VC	2020-02-27
VisualBERT: A Simple and Performant Baseline for Vision and Language	✓ Link	71				VisualBERT	2019-08-09
Deep Modular Co-Attention Networks for Visual Question Answering	✓ Link	70.9				MCANed-6	2019-06-25
Unified Vision-Language Pre-Training for Image Captioning and VQA	✓ Link	70.7				Unified VLP	2019-09-24
Bilinear Attention Networks	✓ Link	70.4				BAN+Glove+Counter	2018-05-21
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering	✓ Link	70.34				Up-Down	2017-07-25
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge	✓ Link	70.3				Image features from bottom-up attention (adaptive K, ensemble)	2017-08-09
Generating Question Relevant Captions to Aid Visual Question Answering		69.7				Caption VQA	2019-06-03
MUREL: Multimodal Relational Reasoning for Visual Question Answering	✓ Link	68.4				MuRel	2019-02-25
Learning to Count Objects in Natural Images for Visual Question Answering	✓ Link	68.4				DMN	2018-02-15
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection	✓ Link	67.9				BLOCK	2019-01-31
MUTAN: Multimodal Tucker Fusion for Visual Question Answering	✓ Link	67.4				MUTAN	2017-05-18
Sparse and Continuous Attention Mechanisms	✓ Link	66.27				2D continuous softmax	2020-06-12
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering	✓ Link	62.27				MCB [11, 12]	2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering	✓ Link	44.26				Language-only	2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering	✓ Link	25.98				Prior	2016-12-02

OpenCodePapers

visual-question-answering-on-vqa-v2-test-std