OpenCodePapers

visual-question-answering-on-mm-vet-v2

Visual Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeGPT-4 scoreParamsModelNameReleaseDate
[]()77.1±0.1gemini-2.0-flash-exp
GPT-4 Technical Report✓ Link72.1±0.2GPT-4o (gpt-4o-2024-11-20)2023-03-15
Claude 3.5 Sonnet Model Card Addendum71.8±0.2Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)2024-06-24
GPT-4 Technical Report✓ Link71.0±0.2GPT-4o (gpt-4o-2024-05-13)2023-03-15
[]()68.4±0.376BInternVL2-Llama3-76B
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link66.9±0.2Gemini 1.5 Pro2024-03-08
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link66.9±0.372BQwen2-VL-72B (qwen-vl-max-0809)2024-09-18
GPT-4 Technical Report✓ Link66.8±0.3gpt-4o-mini-2024-07-182023-03-15
GPT-4 Technical Report✓ Link66.3±0.2GPT-4 Turbo (gpt-4-0125-preview)2023-03-15
[]()63.8±0.240BInternVL2-40B
Gemini: A Family of Highly Capable Multimodal Models✓ Link57.2±0.2Gemini Pro Vision2023-12-19
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link55.8±0.2Qwen-VL-Max2023-08-24
[]()55.8±0.2Claude 3 Opus (claude-3-opus-20240229)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites✓ Link51.5±0.2InternVL-Chat-V1-52024-04-25
[]()50.9±0.134BLLaVA-NeXT-34B
[]()45.5±0.1InternVL-Chat-V1-2
CogVLM: Visual Expert for Pretrained Language Models✓ Link45.1±0.2CogVLM-Chat2023-11-06
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model✓ Link42.5±0.3IXC2-VL-7B2024-01-29
Generative Multimodal Models are In-Context Learners✓ Link38.0±0.1Emu2-Chat2023-12-20
CogAgent: A Visual Language Model for GUI Agents✓ Link34.7±0.2CogAgent-Chat2023-12-14
Improved Baselines with Visual Instruction Tuning✓ Link33.2±0.113BLLaVA-v1.5-13B2023-10-05
Improved Baselines with Visual Instruction Tuning✓ Link28.3±0.27BLLaVA-v1.5-7B2023-10-05
MIMIC-IT: Multi-Modal In-Context Instruction Tuning✓ Link23.2±0.19BOtter-9B2023-06-08
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models✓ Link17.6±0.29BOpenFlamingo-9B2023-08-02