Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs | | 81.3 | ChartPaLI-5B + PaLM 2-S | 2024-03-19 |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 80.8 | Gemini Ultra | 2023-12-19 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 79.3 | DePlot+FlanPaLM+Codex (PoT Self-Consistency) | 2022-12-20 |
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs | | 77.3 | ChartPaLI-5B | 2024-03-19 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 76.7 | DePlot+Codex (PoT Self-Consistency) | 2022-12-20 |
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | ✓ Link | 76.7 | ScreenAI 5B (4.62 B params, w/ OCR) | 2024-02-07 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 74.6 | SMoLA-PaLI-X Specialist Model | 2023-12-01 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts | | 73.8 | SMoLA-PaLI-X Generalist Model | 2023-12-01 |
Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA | | 72.64 | MatCha4096 + LaMenDa | 2024-01-01 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 72.3 | PaLI-X (Single-task FT w/ OCR) | 2023-05-29 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 70.9 | PaLI-X (Single-task FT) | 2023-05-29 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 70.6 | PaLI-X (Multi-task FT) | 2023-05-29 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 70.5 | DePlot+FlanPaLM (Self-Consistency) | 2022-12-20 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 70 | PaLI-3 | 2023-10-13 |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger | ✓ Link | 69.5 | PaLI-3 (w/ OCR) | 2023-10-13 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 67.3 | DePlot+FlanPaLM (CoT) | 2022-12-20 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 66.3 | Qwen-VL-Chat | 2023-08-24 |
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning | ✓ Link | 66.24 | UniChart | 2023-05-24 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 65.7 | Qwen-VL | 2023-08-24 |
StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding | ✓ Link | 65.3 | StructChart+GPT3.5 (STR ChartQA+SimChart9K) | 2023-09-20 |
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | ✓ Link | 64.2 | MatCha | 2022-12-19 |
StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding | ✓ Link | 60.7 | StructChart+GPT3.5 (STR) | 2023-09-20 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 58.6 | Pix2Struct-large | 2022-10-07 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | ✓ Link | 56.0 | Pix2Struct-base | 2022-10-07 |
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning | ✓ Link | 45.5 | VisionTapas-OCR | 2022-03-19 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 42.3 | DePlot+GPT3 (Self-Consistency) | 2022-12-20 |
DePlot: One-shot visual language reasoning by plot-to-table translation | ✓ Link | 36.9 | DePlot+GPT3 (CoT) | 2022-12-20 |