OpenCodePapers

visual-question-answering-on-vqa-v2-test-std

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeoverallyes/nonumberotherModelNameReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link84.03BEiT-32022-08-22
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections✓ Link83.6294.8369.8277.02mPLUG-Huge2022-05-24
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link82.5294.8572.2474.15ONE-PEACE2023-05-18
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link81.8X2-VLM (large)2022-11-22
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts✓ Link81.3094.6867.2672.87VLMo2021-11-03
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link80.34SimVLM2021-08-24
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link80.2X2-VLM (base)2022-11-22
[]()80.19VAST
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link78.62VALOR2023-04-17
Prompt Tuning for Generative Multimodal Pretrained Models✓ Link78.53Prompt Tuning2022-08-04
Prismer: A Vision-Language Model with Multi-Task Experts✓ Link78.4993.0961.3969.70Prismer2023-03-04
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link77.4592.3862.5567.87MSR + MS Cog. Svcs., X10 models2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link76.6392.0461.566.68MSR + MS Cog. Svcs.2021-01-02
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link76.04ALBEF (14M)2021-07-16
Bilinear Graph Networks for Visual Question Answering75.9290.8961.1366.28BGN, ensemble2019-07-23
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph74.9390.8356.7965.24ERNIE-ViL-single model2020-06-30
In Defense of Grid Features for Visual Question Answering✓ Link74.1689.1858.0164.77Single, w/o VLP2020-01-10
Deep Multimodal Neural Architecture Search✓ Link73.8689.4658.6263.78Single, w/o VLP2020-04-25
UNITER: UNiversal Image-TExt Representation Learning✓ Link73.4UNITER (Large)2019-09-25
In Defense of Grid Features for Visual Question Answering✓ Link72.71X-101 grid features + MCAN2020-01-10
LXMERT: Learning Cross-Modality Encoder Representations from Transformers✓ Link72.5LXMERT2019-08-20
VL-BERT: Pre-training of Generic Visual-Linguistic Representations✓ Link72.2VL-BERTLARGE2019-08-22
Visual Commonsense R-CNN✓ Link71.49MCAN+VC2020-02-27
VisualBERT: A Simple and Performant Baseline for Vision and Language✓ Link71VisualBERT2019-08-09
Deep Modular Co-Attention Networks for Visual Question Answering✓ Link70.9MCANed-62019-06-25
Unified Vision-Language Pre-Training for Image Captioning and VQA✓ Link70.7Unified VLP2019-09-24
Bilinear Attention Networks✓ Link70.4BAN+Glove+Counter2018-05-21
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering✓ Link70.34Up-Down2017-07-25
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge✓ Link70.3Image features from bottom-up attention (adaptive K, ensemble)2017-08-09
Generating Question Relevant Captions to Aid Visual Question Answering69.7Caption VQA2019-06-03
MUREL: Multimodal Relational Reasoning for Visual Question Answering✓ Link68.4MuRel2019-02-25
Learning to Count Objects in Natural Images for Visual Question Answering✓ Link68.4DMN2018-02-15
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection✓ Link67.9BLOCK2019-01-31
MUTAN: Multimodal Tucker Fusion for Visual Question Answering✓ Link67.4MUTAN2017-05-18
Sparse and Continuous Attention Mechanisms✓ Link66.272D continuous softmax2020-06-12
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering✓ Link62.27MCB [11, 12]2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering✓ Link44.26Language-only2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering✓ Link25.98Prior2016-12-02