OpenCodePapers

visual-reasoning-on-nlvr2-dev

Visual Reasoning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link91.51BEiT-32022-08-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link88.7X2-VLM (large)2022-11-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks✓ Link87.6XFM (base)2023-01-12
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link86.2X2-VLM (base)2022-11-22
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link86.1CoCa2022-05-04
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts✓ Link85.64VLMo2021-11-03
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link84.6VK-OOD2023-09-21
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link84.53SimVLM2021-08-24
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link84.41X-VLM (base)2021-11-16
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link83.9VK-OOD2023-02-11
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link83.14ALBEF (14M)2021-07-16
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning✓ Link76.37SOHO2021-04-07
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link75.7ViLT-B/322021-02-05
LXMERT: Learning Cross-Modality Encoder Representations from Transformers✓ Link74.9LXMERT (Pre-train + scratch)2019-08-20
VisualBERT: A Simple and Performant Baseline for Vision and Language✓ Link66.7VisualBERT2019-08-09