GPT-4 Technical Report | ✓ Link | 60.7 | 59.9 | GPT-4V-turbo-detail:high (Visual Prompt) | 2023-03-15 |
GPT-4 Technical Report | ✓ Link | 52.8 | 51.4 | GPT-4V-turbo-detail:low (Visual Prompt) | 2023-03-15 |
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | ✓ Link | 50.5 | 49.0 | LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt | 2024-12-04 |
Making Large Language Models Better Data Creators | ✓ Link | 48.3 | 48.2 | ViP-LLaVA-13B (Visual Prompt) | 2023-10-31 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 47.1 | | LLaVA-1.5-13B (Coordinates) | 2023-10-05 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 45.3 | | Qwen-VL-Chat (Coordinates) | 2023-08-24 |
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | ✓ Link | 45.1 | 48.2 | LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt | 2024-12-04 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 41.8 | 42.9 | LLaVA-1.5-13B (Visual Prompt) | 2023-10-05 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 39.2 | 41.7 | Qwen-VL-Chat (Visual Prompt) | 2023-08-24 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | ✓ Link | 35.8 | 35.2 | InstructBLIP-13B (Visual Prompt) | 2023-05-11 |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | ✓ Link | 35.1 | | GPT4ROI 7B (ROI) | 2023-07-07 |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | ✓ Link | 33.7 | | Shikra-7B (Coordinates) | 2023-06-27 |
Kosmos-2: Grounding Multimodal Large Language Models to the World | ✓ Link | 26.9 | | Kosmos-2 (Discrete Token) | 2023-06-26 |