Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 80.3 | Gemini Ultra (pixel only) | 2023-12-19 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 66.2 | SMoLA-PaLI-X Specialist | 2023-12-01 |
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | ✓ Link | 65.90 | ScreenAI 5B (4.62 B params, w/ OCR) | 2024-02-07 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 65.6 | SMoLA-PaLI-X Generalist | 2023-12-01 |
Unifying Vision, Text, and Layout for Universal Document Processing | ✓ Link | 63.0 | UDOP (aux) | 2022-12-05 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 62.4 | PaLI-3 (w/ OCR) | 2023-10-13 |
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | ✓ Link | 61.20 | TILT-Large | 2021-02-18 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 57.8 | PaLI-3 | 2023-10-13 |
LAPDoc: Layout-Aware Prompting for Documents | | 54.9 | ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat) | 2024-02-15 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 54.8 | PaLI-X (Single-task FT w/ OCR) | 2023-05-29 |
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering | ✓ Link | 54.51 | Claude + LATIN-Prompt | 2023-06-01 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 50.7 | PaLI-X (Multi-task FT) | 2023-05-29 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 49.2 | PaLI-X (Single-task FT) | 2023-05-29 |
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering | ✓ Link | 48.98 | GPT-3.5 + LATIN-Prompt | 2023-06-01 |
DocFormerv2: Local Features for Document Understanding | ✓ Link | 48.8 | DocFormerv2-large | 2023-06-02 |
Unifying Vision, Text, and Layout for Universal Document Processing | ✓ Link | 47.4 | UDOP | 2022-12-05 |
DUBLIN -- Document Understanding By Language-Image Network | | 42.6 | DUBLIN (variable resolution) | 2023-05-23 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 40 | Pix2Struct-large | 2022-10-07 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 38.2 | Pix2Struct-base | 2022-10-07 |
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | ✓ Link | 37.2 | MatCha | 2022-12-19 |
DUBLIN -- Document Understanding By Language-Image Network | | 36.82 | DUBLIN | 2023-05-23 |