OpenCodePapers

visual-reasoning-on-nlvr2-dev

Visual Reasoning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	91.51	BEiT-3	2022-08-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	88.7	X2-VLM (large)	2022-11-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	✓ Link	87.6	XFM (base)	2023-01-12
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	86.2	X2-VLM (base)	2022-11-22
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	86.1	CoCa	2022-05-04
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts	✓ Link	85.64	VLMo	2021-11-03
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	✓ Link	84.6	VK-OOD	2023-09-21
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	✓ Link	84.53	SimVLM	2021-08-24
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	✓ Link	84.41	X-VLM (base)	2021-11-16
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	✓ Link	83.9	VK-OOD	2023-02-11
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	✓ Link	83.14	ALBEF (14M)	2021-07-16
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning	✓ Link	76.37	SOHO	2021-04-07
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	✓ Link	75.7	ViLT-B/32	2021-02-05
LXMERT: Learning Cross-Modality Encoder Representations from Transformers	✓ Link	74.9	LXMERT (Pre-train + scratch)	2019-08-20
VisualBERT: A Simple and Performant Baseline for Vision and Language	✓ Link	66.7	VisualBERT	2019-08-09