OpenCodePapers

visual-question-answering-on-vqa-v2-test-dev-1

Visual Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	✓ Link	82.30	BLIP-2 ViT-G OPT 6.7B (fine-tuned)	2023-01-30
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	82.3	CoCa	2022-05-04
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	✓ Link	82.0	OFA	2022-02-07
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	✓ Link	81.74	BLIP-2 ViT-G OPT 2.7B (fine-tuned)	2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	✓ Link	81.66	BLIP-2 ViT-G FlanT5 XL (fine-tuned)	2023-01-30
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	81.11	mPLUG-2	2023-02-01
Florence: A New Foundation Model for Computer Vision	✓ Link	80.16	Florence	2021-11-22
[]()		77.69	Aurora (ours, r=64)
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	✓ Link	76.8	VK-OOD	2023-02-11
LXMERT Model Compression for Visual Question Answering	✓ Link	70.72	LXMERT (low-magnitude pruning)	2023-10-23
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	✓ Link	56.2	LocVLM-L	2024-04-11