PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 84.3 | PaLI | 2022-09-14 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 84.19 | BEiT-3 | 2022-08-22 |
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | ✓ Link | 82.78 | VLMo | 2021-11-03 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 82.6 | ONE-PEACE | 2023-05-18 |
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | ✓ Link | 82.43 | mPLUG (Huge) | 2022-05-24 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | ✓ Link | 82.2 | CuMo-7B | 2024-05-09 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 81.9 | X2-VLM (large) | 2022-11-22 |
Achieving Human Parity on Visual Question Answering | | 81.26 | MMU | 2021-11-17 |
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects | | 81.2 | Lyrics | 2023-12-08 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 81.2 | InternVL-C | 2023-12-21 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 80.4 | X2-VLM (base) | 2022-11-22 |
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | ✓ Link | 80.4 | XFM (base) | 2023-01-12 |
[]() | | 80.23 | VAST | |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 80.03 | SimVLM | 2021-08-24 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 78.46 | VALOR | 2023-04-17 |
Prismer: A Vision-Language Model with Multi-Task Experts | ✓ Link | 78.43 | Prismer | 2023-03-04 |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 78.22 | X-VLM (base) | 2021-11-16 |
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | ✓ Link | 77.9 | VK-OOD | 2023-09-21 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 75.84 | ALBEF (14M) | 2021-07-16 |
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | ✓ Link | 73.82 | Oscar | 2020-04-13 |
UNITER: UNiversal Image-TExt Representation Learning | ✓ Link | 73.24 | UNITER (Large) | 2019-09-25 |
In Defense of Grid Features for Visual Question Answering | ✓ Link | 72.59 | X-101 grid features + MCAN | 2020-01-10 |
Coarse-to-Fine Reasoning for Visual Question Answering | ✓ Link | 72.5 | CFR | 2021-10-06 |
VL-BERT: Pre-training of Generic Visual-Linguistic Representations | ✓ Link | 71.79 | VL-BERTLARGE | 2019-08-22 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 71.26 | ViLT-B/32 | 2021-02-05 |
Visual Commonsense R-CNN | ✓ Link | 71.21 | MCAN+VC | 2020-02-27 |
VL-BERT: Pre-training of Generic Visual-Linguistic Representations | ✓ Link | 71.16 | VL-BERTBASE | 2019-08-22 |
VisualBERT: A Simple and Performant Baseline for Vision and Language | ✓ Link | 70.8 | VisualBERT | 2019-08-09 |
Deep Modular Co-Attention Networks for Visual Question Answering | ✓ Link | 70.63 | MCANed-6 | 2019-06-25 |
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | ✓ Link | 70.55 | ViLBERT | 2019-08-06 |
Bilinear Attention Networks | ✓ Link | 70.04 | BAN+Glove+Counter | 2018-05-21 |
LXMERT: Learning Cross-Modality Encoder Representations from Transformers | ✓ Link | 69.9 | LXMERT (Pre-train + scratch) | 2019-08-20 |
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge | ✓ Link | 69.87 | Image features from bottom-up attention (adaptive K, ensemble) | 2017-08-09 |
Towards VQA Models That Can Read | ✓ Link | 69.21 | Pythia v0.3 + LoRRA | 2019-04-18 |
Learning to Count Objects in Natural Images for Visual Question Answering | ✓ Link | 68.09 | DMN | 2018-02-15 |
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | ✓ Link | 68.07 | LaKo | 2022-07-26 |
MUREL: Multimodal Relational Reasoning for Visual Question Answering | ✓ Link | 68.03 | MuRel | 2019-02-25 |
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection | ✓ Link | 67.58 | BLOCK | 2019-01-31 |
MUTAN: Multimodal Tucker Fusion for Visual Question Answering | ✓ Link | 67.42 | MUTAN | 2017-05-18 |
Compact Trilinear Interaction for Visual Question Answering | ✓ Link | 67.4 | BAN2-CTI | 2019-09-26 |
Sparse and Continuous Attention Mechanisms | ✓ Link | 65.96 | 2D continuous softmax | 2020-06-12 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 65 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | 2023-01-30 |
Learning to Reason: End-to-End Module Networks for Visual Question Answering | ✓ Link | 64.9 | N2NMN (ResNet-152, policy search) | 2017-04-18 |
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | ✓ Link | 64.8 | PNP-VQA | 2022-10-17 |
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding | ✓ Link | 64.7 | MCB | 2016-06-06 |
RUBi: Reducing Unimodal Biases in Visual Question Answering | ✓ Link | 63.18 | RUBi | 2019-06-24 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 63 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 62.3 | BLIP-2 ViT-L FlanT5 XL (zero-shot) | 2023-01-30 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 56.3 | Flamingo 80B | 2022-04-29 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 52.6 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 52.3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 2023-01-30 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 51.8 | Flamingo 9B | 2022-04-29 |
[]() | | 51.0 | KOSMOS-1 1.6B (zero-shot) | |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 49.7 | BLIP-2 ViT-L OPT 2.7B (zero-shot) | 2023-01-30 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 49.2 | Flamingo 3B | 2022-04-29 |
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | | 44.5 | VLKD | 2021-11-16 |