visual-question-answering-on-mm-vet-v2

Visual Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	GPT-4 score	Params	ModelName	ReleaseDate
[]()		77.1±0.1		gemini-2.0-flash-exp
GPT-4 Technical Report	✓ Link	72.1±0.2		GPT-4o (gpt-4o-2024-11-20)	2023-03-15
Claude 3.5 Sonnet Model Card Addendum		71.8±0.2		Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)	2024-06-24
GPT-4 Technical Report	✓ Link	71.0±0.2		GPT-4o (gpt-4o-2024-05-13)	2023-03-15
[]()		68.4±0.3	76B	InternVL2-Llama3-76B
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	66.9±0.2		Gemini 1.5 Pro	2024-03-08
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	66.9±0.3	72B	Qwen2-VL-72B (qwen-vl-max-0809)	2024-09-18
GPT-4 Technical Report	✓ Link	66.8±0.3		gpt-4o-mini-2024-07-18	2023-03-15
GPT-4 Technical Report	✓ Link	66.3±0.2		GPT-4 Turbo (gpt-4-0125-preview)	2023-03-15
[]()		63.8±0.2	40B	InternVL2-40B
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	57.2±0.2		Gemini Pro Vision	2023-12-19
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	✓ Link	55.8±0.2		Qwen-VL-Max	2023-08-24
[]()		55.8±0.2		Claude 3 Opus (claude-3-opus-20240229)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	✓ Link	51.5±0.2		InternVL-Chat-V1-5	2024-04-25
[]()		50.9±0.1	34B	LLaVA-NeXT-34B
[]()		45.5±0.1		InternVL-Chat-V1-2
CogVLM: Visual Expert for Pretrained Language Models	✓ Link	45.1±0.2		CogVLM-Chat	2023-11-06
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	✓ Link	42.5±0.3		IXC2-VL-7B	2024-01-29
Generative Multimodal Models are In-Context Learners	✓ Link	38.0±0.1		Emu2-Chat	2023-12-20
CogAgent: A Visual Language Model for GUI Agents	✓ Link	34.7±0.2		CogAgent-Chat	2023-12-14
Improved Baselines with Visual Instruction Tuning	✓ Link	33.2±0.1	13B	LLaVA-v1.5-13B	2023-10-05
Improved Baselines with Visual Instruction Tuning	✓ Link	28.3±0.2	7B	LLaVA-v1.5-7B	2023-10-05
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	✓ Link	23.2±0.1	9B	Otter-9B	2023-06-08
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	✓ Link	17.6±0.2	9B	OpenFlamingo-9B	2023-08-02

OpenCodePapers

visual-question-answering-on-mm-vet-v2