DocVQA: A Dataset for VQA on Document Images | ✓ Link | 0.9436 | | Human | 2020-07-01 |
Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 0.916 | | MLCD-Embodied-7B | 2024-07-24 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 0.908 | | SMoLA-PaLI-X Specialist | 2023-12-01 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 0.906 | | SMoLA-PaLI-X Generalist | 2023-12-01 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 0.9024 | | Qwen-VL-Plus | 2023-08-24 |
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | ✓ Link | 0.8988 | | ScreenAI 5B (4.62 B params, w/OCR) | 2024-02-07 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 0.886 | | PaLI-3 (w/ OCR) | 2023-10-13 |
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | ✓ Link | 0.8841 | | ERNIE-Layout large (ensemble) | 2022-10-12 |
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering | ✓ Link | 0.884 | | GPT-4 | 2023-06-01 |
DocFormerv2: Local Features for Document Understanding | ✓ Link | 0.8784 | | DocFormerv2-large | 2023-06-02 |
Unifying Vision, Text, and Layout for Universal Document Processing | ✓ Link | 0.878 | | UDOP (aux) | 2022-12-05 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 0.876 | | PaLI-3 | 2023-10-13 |
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | ✓ Link | 0.8705 | | TILT-Large | 2021-02-18 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 0.868 | | PaLI-X (Single-task FT w/ OCR) | 2023-05-29 |
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | ✓ Link | 0.8672 | | LayoutLMv2LARGE | 2020-12-29 |
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | ✓ Link | 0.8486 | | ERNIE-Layout large | 2022-10-12 |
Unifying Vision, Text, and Layout for Universal Document Processing | ✓ Link | 0.847 | | UDOP | 2022-12-05 |
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | ✓ Link | 0.8392 | | TILT-Base | 2021-02-18 |
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering | ✓ Link | 0.8336 | | Claude + LATIN-Prompt | 2023-06-01 |
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering | ✓ Link | 0.8255 | | GPT-3.5 + LATIN-Prompt | 2023-06-01 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 0.809 | | PaLI-X (Multi-task FT) | 2023-05-29 |
DUBLIN -- Document Understanding By Language-Image Network | | 0.803 | | DUBLIN (variable resolution) | 2023-05-23 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 0.80 | | PaLI-X (Single-task FT) | 2023-05-29 |
DUBLIN -- Document Understanding By Language-Image Network | | 0.782 | | DUBLIN | 2023-05-23 |
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | ✓ Link | 0.7808 | | LayoutLMv2BASE | 2020-12-29 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 0.766 | | Pix2Struct-large | 2022-10-07 |
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | ✓ Link | 0.742 | | MatCha | 2022-12-19 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 0.721 | | Pix2Struct-base | 2022-10-07 |
OCR-free Document Understanding Transformer | ✓ Link | 0.675 | | Donut | 2021-11-30 |
DocVQA: A Dataset for VQA on Document Images | ✓ Link | 0.665 | 55.77 | BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline | 2020-07-01 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 0.651 | | Qwen-VL | 2023-08-24 |
End-to-end Document Recognition and Understanding with Dessurt | ✓ Link | 0.632 | | Dessurt | 2022-03-30 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 0.626 | | Qwen-VL-Chat | 2023-08-24 |