OpenCodePapers

visual-question-answering-vqa-on

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeANLSModelNameReleaseDate
Gemini: A Family of Highly Capable Multimodal Models✓ Link80.3Gemini Ultra (pixel only)2023-12-19
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts66.2SMoLA-PaLI-X Specialist2023-12-01
ScreenAI: A Vision-Language Model for UI and Infographics Understanding✓ Link65.90ScreenAI 5B (4.62 B params, w/ OCR)2024-02-07
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts65.6SMoLA-PaLI-X Generalist2023-12-01
Unifying Vision, Text, and Layout for Universal Document Processing✓ Link63.0UDOP (aux)2022-12-05
PaLI-3 Vision Language Models: Smaller, Faster, Stronger✓ Link62.4PaLI-3 (w/ OCR)2023-10-13
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer✓ Link61.20TILT-Large2021-02-18
PaLI-3 Vision Language Models: Smaller, Faster, Stronger✓ Link57.8PaLI-32023-10-13
LAPDoc: Layout-Aware Prompting for Documents54.9ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)2024-02-15
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link54.8PaLI-X (Single-task FT w/ OCR)2023-05-29
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering✓ Link54.51Claude + LATIN-Prompt2023-06-01
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link50.7PaLI-X (Multi-task FT)2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link49.2PaLI-X (Single-task FT)2023-05-29
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering✓ Link48.98GPT-3.5 + LATIN-Prompt2023-06-01
DocFormerv2: Local Features for Document Understanding✓ Link48.8DocFormerv2-large2023-06-02
Unifying Vision, Text, and Layout for Universal Document Processing✓ Link47.4UDOP2022-12-05
DUBLIN -- Document Understanding By Language-Image Network42.6DUBLIN (variable resolution)2023-05-23
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding✓ Link40Pix2Struct-large2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding✓ Link38.2Pix2Struct-base2022-10-07
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering✓ Link37.2MatCha2022-12-19
DUBLIN -- Document Understanding By Language-Image Network36.82DUBLIN2023-05-23