OpenCodePapers

visual-question-answering-on-vqa-v2-test-dev

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
PaLI: A Jointly-Scaled Multilingual Language-Image Model✓ Link84.3PaLI2022-09-14
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link84.19BEiT-32022-08-22
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts✓ Link82.78VLMo2021-11-03
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link82.6ONE-PEACE2023-05-18
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections✓ Link82.43mPLUG (Huge)2022-05-24
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts✓ Link82.2CuMo-7B2024-05-09
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link81.9X2-VLM (large)2022-11-22
Achieving Human Parity on Visual Question Answering81.26MMU2021-11-17
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects81.2Lyrics2023-12-08
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks✓ Link81.2InternVL-C2023-12-21
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link80.4X2-VLM (base)2022-11-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks✓ Link80.4XFM (base)2023-01-12
[]()80.23VAST
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link80.03SimVLM2021-08-24
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link78.46VALOR2023-04-17
Prismer: A Vision-Language Model with Multi-Task Experts✓ Link78.43Prismer2023-03-04
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link78.22X-VLM (base)2021-11-16
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link77.9VK-OOD2023-09-21
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link75.84ALBEF (14M)2021-07-16
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks✓ Link73.82Oscar2020-04-13
UNITER: UNiversal Image-TExt Representation Learning✓ Link73.24UNITER (Large)2019-09-25
In Defense of Grid Features for Visual Question Answering✓ Link72.59X-101 grid features + MCAN2020-01-10
Coarse-to-Fine Reasoning for Visual Question Answering✓ Link72.5CFR2021-10-06
VL-BERT: Pre-training of Generic Visual-Linguistic Representations✓ Link71.79VL-BERTLARGE2019-08-22
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link71.26ViLT-B/322021-02-05
Visual Commonsense R-CNN✓ Link71.21MCAN+VC2020-02-27
VL-BERT: Pre-training of Generic Visual-Linguistic Representations✓ Link71.16VL-BERTBASE2019-08-22
VisualBERT: A Simple and Performant Baseline for Vision and Language✓ Link70.8VisualBERT2019-08-09
Deep Modular Co-Attention Networks for Visual Question Answering✓ Link70.63MCANed-62019-06-25
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks✓ Link70.55ViLBERT2019-08-06
Bilinear Attention Networks✓ Link70.04BAN+Glove+Counter2018-05-21
LXMERT: Learning Cross-Modality Encoder Representations from Transformers✓ Link69.9LXMERT (Pre-train + scratch)2019-08-20
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge✓ Link69.87Image features from bottom-up attention (adaptive K, ensemble)2017-08-09
Towards VQA Models That Can Read✓ Link69.21Pythia v0.3 + LoRRA2019-04-18
Learning to Count Objects in Natural Images for Visual Question Answering✓ Link68.09DMN2018-02-15
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection✓ Link68.07LaKo2022-07-26
MUREL: Multimodal Relational Reasoning for Visual Question Answering✓ Link68.03MuRel2019-02-25
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection✓ Link67.58BLOCK2019-01-31
MUTAN: Multimodal Tucker Fusion for Visual Question Answering✓ Link67.42MUTAN2017-05-18
Compact Trilinear Interaction for Visual Question Answering✓ Link67.4BAN2-CTI2019-09-26
Sparse and Continuous Attention Mechanisms✓ Link65.962D continuous softmax2020-06-12
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link65BLIP-2 ViT-G FlanT5 XXL (zero-shot)2023-01-30
Learning to Reason: End-to-End Module Networks for Visual Question Answering✓ Link64.9N2NMN (ResNet-152, policy search)2017-04-18
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training✓ Link64.8PNP-VQA2022-10-17
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding✓ Link64.7MCB2016-06-06
RUBi: Reducing Unimodal Biases in Visual Question Answering✓ Link63.18RUBi2019-06-24
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link63BLIP-2 ViT-G FlanT5 XL (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link62.3BLIP-2 ViT-L FlanT5 XL (zero-shot)2023-01-30
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link56.3Flamingo 80B2022-04-29
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link52.6BLIP-2 ViT-G OPT 6.7B (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link52.3BLIP-2 ViT-G OPT 2.7B (zero-shot)2023-01-30
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link51.8Flamingo 9B2022-04-29
[]()51.0KOSMOS-1 1.6B (zero-shot)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link49.7BLIP-2 ViT-L OPT 2.7B (zero-shot)2023-01-30
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link49.2Flamingo 3B2022-04-29
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation44.5VLKD2021-11-16