[]() | | 77.1±0.1 | | gemini-2.0-flash-exp | |
GPT-4 Technical Report | ✓ Link | 72.1±0.2 | | GPT-4o (gpt-4o-2024-11-20) | 2023-03-15 |
Claude 3.5 Sonnet Model Card Addendum | | 71.8±0.2 | | Claude 3.5 Sonnet (claude-3-5-sonnet-20240620) | 2024-06-24 |
GPT-4 Technical Report | ✓ Link | 71.0±0.2 | | GPT-4o (gpt-4o-2024-05-13) | 2023-03-15 |
[]() | | 68.4±0.3 | 76B | InternVL2-Llama3-76B | |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 66.9±0.2 | | Gemini 1.5 Pro | 2024-03-08 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 66.9±0.3 | 72B | Qwen2-VL-72B (qwen-vl-max-0809) | 2024-09-18 |
GPT-4 Technical Report | ✓ Link | 66.8±0.3 | | gpt-4o-mini-2024-07-18 | 2023-03-15 |
GPT-4 Technical Report | ✓ Link | 66.3±0.2 | | GPT-4 Turbo (gpt-4-0125-preview) | 2023-03-15 |
[]() | | 63.8±0.2 | 40B | InternVL2-40B | |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 57.2±0.2 | | Gemini Pro Vision | 2023-12-19 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 55.8±0.2 | | Qwen-VL-Max | 2023-08-24 |
[]() | | 55.8±0.2 | | Claude 3 Opus (claude-3-opus-20240229) | |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | ✓ Link | 51.5±0.2 | | InternVL-Chat-V1-5 | 2024-04-25 |
[]() | | 50.9±0.1 | 34B | LLaVA-NeXT-34B | |
[]() | | 45.5±0.1 | | InternVL-Chat-V1-2 | |
CogVLM: Visual Expert for Pretrained Language Models | ✓ Link | 45.1±0.2 | | CogVLM-Chat | 2023-11-06 |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | ✓ Link | 42.5±0.3 | | IXC2-VL-7B | 2024-01-29 |
Generative Multimodal Models are In-Context Learners | ✓ Link | 38.0±0.1 | | Emu2-Chat | 2023-12-20 |
CogAgent: A Visual Language Model for GUI Agents | ✓ Link | 34.7±0.2 | | CogAgent-Chat | 2023-12-14 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 33.2±0.1 | 13B | LLaVA-v1.5-13B | 2023-10-05 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 28.3±0.2 | 7B | LLaVA-v1.5-7B | 2023-10-05 |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | ✓ Link | 23.2±0.1 | 9B | Otter-9B | 2023-06-08 |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | ✓ Link | 17.6±0.2 | 9B | OpenFlamingo-9B | 2023-08-02 |