visual-question-answering-vqa-on

Visual Question Answering (VQA)

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	ANLS	ModelName	ReleaseDate
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	80.3	Gemini Ultra (pixel only)	2023-12-19
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts		66.2	SMoLA-PaLI-X Specialist	2023-12-01
ScreenAI: A Vision-Language Model for UI and Infographics Understanding	✓ Link	65.90	ScreenAI 5B (4.62 B params, w/ OCR)	2024-02-07
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts		65.6	SMoLA-PaLI-X Generalist	2023-12-01
Unifying Vision, Text, and Layout for Universal Document Processing	✓ Link	63.0	UDOP (aux)	2022-12-05
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	✓ Link	62.4	PaLI-3 (w/ OCR)	2023-10-13
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer	✓ Link	61.20	TILT-Large	2021-02-18
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	✓ Link	57.8	PaLI-3	2023-10-13
LAPDoc: Layout-Aware Prompting for Documents		54.9	ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)	2024-02-15
PaLI-X: On Scaling up a Multilingual Vision and Language Model	✓ Link	54.8	PaLI-X (Single-task FT w/ OCR)	2023-05-29
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering	✓ Link	54.51	Claude + LATIN-Prompt	2023-06-01
PaLI-X: On Scaling up a Multilingual Vision and Language Model	✓ Link	50.7	PaLI-X (Multi-task FT)	2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model	✓ Link	49.2	PaLI-X (Single-task FT)	2023-05-29
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering	✓ Link	48.98	GPT-3.5 + LATIN-Prompt	2023-06-01
DocFormerv2: Local Features for Document Understanding	✓ Link	48.8	DocFormerv2-large	2023-06-02
Unifying Vision, Text, and Layout for Universal Document Processing	✓ Link	47.4	UDOP	2022-12-05
DUBLIN -- Document Understanding By Language-Image Network		42.6	DUBLIN (variable resolution)	2023-05-23
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	✓ Link	40	Pix2Struct-large	2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	✓ Link	38.2	Pix2Struct-base	2022-10-07
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering	✓ Link	37.2	MatCha	2022-12-19
DUBLIN -- Document Understanding By Language-Image Network		36.82	DUBLIN	2023-05-23

OpenCodePapers

visual-question-answering-vqa-on