OpenCodePapers

visual-question-answering-on-docvqa-test

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeANLSAccuracyModelNameReleaseDate
DocVQA: A Dataset for VQA on Document Images✓ Link0.9436Human2020-07-01
Multi-label Cluster Discrimination for Visual Representation Learning✓ Link0.916MLCD-Embodied-7B2024-07-24
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts0.908SMoLA-PaLI-X Specialist2023-12-01
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts0.906SMoLA-PaLI-X Generalist2023-12-01
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link0.9024Qwen-VL-Plus2023-08-24
ScreenAI: A Vision-Language Model for UI and Infographics Understanding✓ Link0.8988ScreenAI 5B (4.62 B params, w/OCR)2024-02-07
PaLI-3 Vision Language Models: Smaller, Faster, Stronger✓ Link0.886PaLI-3 (w/ OCR)2023-10-13
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding✓ Link0.8841ERNIE-Layout large (ensemble)2022-10-12
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering✓ Link0.884GPT-42023-06-01
DocFormerv2: Local Features for Document Understanding✓ Link0.8784DocFormerv2-large2023-06-02
Unifying Vision, Text, and Layout for Universal Document Processing✓ Link0.878UDOP (aux)2022-12-05
PaLI-3 Vision Language Models: Smaller, Faster, Stronger✓ Link0.876PaLI-32023-10-13
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer✓ Link0.8705TILT-Large2021-02-18
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link0.868PaLI-X (Single-task FT w/ OCR)2023-05-29
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding✓ Link0.8672LayoutLMv2LARGE2020-12-29
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding✓ Link0.8486ERNIE-Layout large2022-10-12
Unifying Vision, Text, and Layout for Universal Document Processing✓ Link0.847UDOP2022-12-05
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer✓ Link0.8392TILT-Base2021-02-18
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering✓ Link0.8336Claude + LATIN-Prompt2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering✓ Link0.8255GPT-3.5 + LATIN-Prompt2023-06-01
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link0.809PaLI-X (Multi-task FT)2023-05-29
DUBLIN -- Document Understanding By Language-Image Network0.803DUBLIN (variable resolution)2023-05-23
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link0.80PaLI-X (Single-task FT)2023-05-29
DUBLIN -- Document Understanding By Language-Image Network0.782DUBLIN2023-05-23
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding✓ Link0.7808LayoutLMv2BASE2020-12-29
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding✓ Link0.766Pix2Struct-large2022-10-07
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering✓ Link0.742MatCha2022-12-19
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding✓ Link0.721Pix2Struct-base2022-10-07
OCR-free Document Understanding Transformer✓ Link0.675Donut2021-11-30
DocVQA: A Dataset for VQA on Document Images✓ Link0.66555.77BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline2020-07-01
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link0.651Qwen-VL2023-08-24
End-to-end Document Recognition and Understanding with Dessurt✓ Link0.632Dessurt2022-03-30
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link0.626Qwen-VL-Chat2023-08-24