OpenCodePapers
visual-reasoning-on-nlvr2-test
Visual Reasoning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy
↕
ModelName
ReleaseDate
↕
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
✓ Link
92.58
BEiT-3
2022-08-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
✓ Link
89.4
X2-VLM (large)
2022-11-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
✓ Link
88.4
XFM (base)
2023-01-12
CoCa: Contrastive Captioners are Image-Text Foundation Models
✓ Link
87.0
CoCa
2022-05-04
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
✓ Link
87.0
X2-VLM (base)
2022-11-22
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
✓ Link
86.86
VLMo
2021-11-03
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
✓ Link
85.15
SimVLM
2021-08-24
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
✓ Link
84.76
X-VLM (base)
2021-11-16
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
✓ Link
83.09
BLIP-129M
2022-01-28
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
✓ Link
82.55
ALBEF (14M)
2021-07-16
UNITER: UNiversal Image-TExt Representation Learning
✓ Link
79.5
UNITER (Large)
2019-09-25
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
✓ Link
77.32
SOHO
2021-04-07
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
✓ Link
76.2
LXMERT
2019-08-20
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
✓ Link
76.13
ViLT-B/32
2021-02-05