OpenCodePapers

visual-reasoning-on-nlvr2-test

Visual Reasoning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link92.58BEiT-32022-08-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link89.4X2-VLM (large)2022-11-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks✓ Link88.4XFM (base)2023-01-12
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link87.0CoCa2022-05-04
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link87.0X2-VLM (base)2022-11-22
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts✓ Link86.86VLMo2021-11-03
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link85.15SimVLM2021-08-24
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link84.76X-VLM (base)2021-11-16
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation✓ Link83.09BLIP-129M2022-01-28
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link82.55ALBEF (14M)2021-07-16
UNITER: UNiversal Image-TExt Representation Learning✓ Link79.5UNITER (Large)2019-09-25
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning✓ Link77.32SOHO2021-04-07
LXMERT: Learning Cross-Modality Encoder Representations from Transformers✓ Link76.2LXMERT2019-08-20
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link76.13ViLT-B/322021-02-05