OpenCodePapers
visual-question-answering-on-vqa-v2-test-dev-1
Visual Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy
↕
ModelName
ReleaseDate
↕
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
✓ Link
82.30
BLIP-2 ViT-G OPT 6.7B (fine-tuned)
2023-01-30
CoCa: Contrastive Captioners are Image-Text Foundation Models
✓ Link
82.3
CoCa
2022-05-04
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
✓ Link
82.0
OFA
2022-02-07
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
✓ Link
81.74
BLIP-2 ViT-G OPT 2.7B (fine-tuned)
2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
✓ Link
81.66
BLIP-2 ViT-G FlanT5 XL (fine-tuned)
2023-01-30
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
✓ Link
81.11
mPLUG-2
2023-02-01
Florence: A New Foundation Model for Computer Vision
✓ Link
80.16
Florence
2021-11-22
[]()
77.69
Aurora (ours, r=64)
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
✓ Link
76.8
VK-OOD
2023-02-11
LXMERT Model Compression for Visual Question Answering
✓ Link
70.72
LXMERT (low-magnitude pruning)
2023-10-23
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
✓ Link
56.2
LocVLM-L
2024-04-11