OpenCodePapers

visual-question-answering-on-vip-bench

Visual Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeGPT-4 score (bbox)GPT-4 score (human)ModelNameReleaseDate
GPT-4 Technical Report✓ Link60.759.9GPT-4V-turbo-detail:high (Visual Prompt)2023-03-15
GPT-4 Technical Report✓ Link52.851.4GPT-4V-turbo-detail:low (Visual Prompt)2023-03-15
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning✓ Link50.549.0LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt2024-12-04
Making Large Language Models Better Data Creators✓ Link48.348.2ViP-LLaVA-13B (Visual Prompt)2023-10-31
Improved Baselines with Visual Instruction Tuning✓ Link47.1LLaVA-1.5-13B (Coordinates)2023-10-05
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link45.3Qwen-VL-Chat (Coordinates)2023-08-24
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning✓ Link45.148.2LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt2024-12-04
Improved Baselines with Visual Instruction Tuning✓ Link41.842.9LLaVA-1.5-13B (Visual Prompt)2023-10-05
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link39.241.7Qwen-VL-Chat (Visual Prompt)2023-08-24
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning✓ Link35.835.2InstructBLIP-13B (Visual Prompt)2023-05-11
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest✓ Link35.1GPT4ROI 7B (ROI)2023-07-07
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic✓ Link33.7Shikra-7B (Coordinates)2023-06-27
Kosmos-2: Grounding Multimodal Large Language Models to the World✓ Link26.9Kosmos-2 (Discrete Token)2023-06-26