Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 84.03 | | | | BEiT-3 | 2022-08-22 |
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | ✓ Link | 83.62 | 94.83 | 69.82 | 77.02 | mPLUG-Huge | 2022-05-24 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 82.52 | 94.85 | 72.24 | 74.15 | ONE-PEACE | 2023-05-18 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 81.8 | | | | X2-VLM (large) | 2022-11-22 |
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | ✓ Link | 81.30 | 94.68 | 67.26 | 72.87 | VLMo | 2021-11-03 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 80.34 | | | | SimVLM | 2021-08-24 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 80.2 | | | | X2-VLM (base) | 2022-11-22 |
[]() | | 80.19 | | | | VAST | |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 78.62 | | | | VALOR | 2023-04-17 |
Prompt Tuning for Generative Multimodal Pretrained Models | ✓ Link | 78.53 | | | | Prompt Tuning | 2022-08-04 |
Prismer: A Vision-Language Model with Multi-Task Experts | ✓ Link | 78.49 | 93.09 | 61.39 | 69.70 | Prismer | 2023-03-04 |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 77.45 | 92.38 | 62.55 | 67.87 | MSR + MS Cog. Svcs., X10 models | 2021-01-02 |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 76.63 | 92.04 | 61.5 | 66.68 | MSR + MS Cog. Svcs. | 2021-01-02 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 76.04 | | | | ALBEF (14M) | 2021-07-16 |
Bilinear Graph Networks for Visual Question Answering | | 75.92 | 90.89 | 61.13 | 66.28 | BGN, ensemble | 2019-07-23 |
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph | | 74.93 | 90.83 | 56.79 | 65.24 | ERNIE-ViL-single model | 2020-06-30 |
In Defense of Grid Features for Visual Question Answering | ✓ Link | 74.16 | 89.18 | 58.01 | 64.77 | Single, w/o VLP | 2020-01-10 |
Deep Multimodal Neural Architecture Search | ✓ Link | 73.86 | 89.46 | 58.62 | 63.78 | Single, w/o VLP | 2020-04-25 |
UNITER: UNiversal Image-TExt Representation Learning | ✓ Link | 73.4 | | | | UNITER (Large) | 2019-09-25 |
In Defense of Grid Features for Visual Question Answering | ✓ Link | 72.71 | | | | X-101 grid features + MCAN | 2020-01-10 |
LXMERT: Learning Cross-Modality Encoder Representations from Transformers | ✓ Link | 72.5 | | | | LXMERT | 2019-08-20 |
VL-BERT: Pre-training of Generic Visual-Linguistic Representations | ✓ Link | 72.2 | | | | VL-BERTLARGE | 2019-08-22 |
Visual Commonsense R-CNN | ✓ Link | 71.49 | | | | MCAN+VC | 2020-02-27 |
VisualBERT: A Simple and Performant Baseline for Vision and Language | ✓ Link | 71 | | | | VisualBERT | 2019-08-09 |
Deep Modular Co-Attention Networks for Visual Question Answering | ✓ Link | 70.9 | | | | MCANed-6 | 2019-06-25 |
Unified Vision-Language Pre-Training for Image Captioning and VQA | ✓ Link | 70.7 | | | | Unified VLP | 2019-09-24 |
Bilinear Attention Networks | ✓ Link | 70.4 | | | | BAN+Glove+Counter | 2018-05-21 |
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | ✓ Link | 70.34 | | | | Up-Down | 2017-07-25 |
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge | ✓ Link | 70.3 | | | | Image features from bottom-up attention (adaptive K, ensemble) | 2017-08-09 |
Generating Question Relevant Captions to Aid Visual Question Answering | | 69.7 | | | | Caption VQA | 2019-06-03 |
MUREL: Multimodal Relational Reasoning for Visual Question Answering | ✓ Link | 68.4 | | | | MuRel | 2019-02-25 |
Learning to Count Objects in Natural Images for Visual Question Answering | ✓ Link | 68.4 | | | | DMN | 2018-02-15 |
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection | ✓ Link | 67.9 | | | | BLOCK | 2019-01-31 |
MUTAN: Multimodal Tucker Fusion for Visual Question Answering | ✓ Link | 67.4 | | | | MUTAN | 2017-05-18 |
Sparse and Continuous Attention Mechanisms | ✓ Link | 66.27 | | | | 2D continuous softmax | 2020-06-12 |
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | ✓ Link | 62.27 | | | | MCB [11, 12] | 2016-12-02 |
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | ✓ Link | 44.26 | | | | Language-only | 2016-12-02 |
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | ✓ Link | 25.98 | | | | Prior | 2016-12-02 |