visual-question-answering-on-mm-vet

Visual Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	GPT-4 score	Params	ModelName	ReleaseDate
[]()		81.2±0.4		gemini-2.0-flash-exp
[]()		78.1±0.2		gemini-exp-1206
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	76.9±0.1		Gemini 1.5 Pro (gemini-1.5-pro-002)	2024-03-08
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning		74.24		MMCTAgent (GPT-4 + GPT-4V)	2024-05-28
Claude 3.5 Sonnet Model Card Addendum		74.2±0.2		Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)	2024-06-24
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	74.0		Qwen2-VL-72B	2024-09-18
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	72.3	78B	InternVL2.5-78B	2024-12-06
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models		72.2		GPT-4o +text rationale +IoT	2024-05-22
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	✓ Link	71.4	74B	Lyra-Pro	2024-12-12
CogVLM2: Visual Language Models for Image and Video Understanding	✓ Link	71.1		GLM-4V-Plus	2024-08-29
Phantom of Latent for Large Language and Vision Models	✓ Link	70.8		Phantom-7B	2024-09-23
GPT-4 Technical Report	✓ Link	69.3±0.1		GPT-4o (gpt-4o-2024-05-13)	2023-03-15
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	68.8	38B	InternVL2.5-38B	2024-12-06
GPT-4 Technical Report	✓ Link	68.6±0.1		gpt-4o-mini-2024-07-18	2023-03-15
GPT-4 Technical Report	✓ Link	67.7±0.3		GPT-4V	2023-03-15
GPT-4 Technical Report	✓ Link	67.6±0.1		GPT-4V-Turbo-detail:high	2023-03-15
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	✓ Link	66.6±0.5		Qwen-VL-Max	2023-08-24
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	65.8±0.1		Gemini 1.5 Pro (gemini-1.5-pro)	2024-03-08
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs	✓ Link	65.60	26B	InternVL2-26B (SGP, token ratio 64%)	2024-12-04
Baichuan-Omni Technical Report	✓ Link	65.4		Baichuan-Omni (7B)	2024-10-11
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	65.0	26B	InternVL2.5-26B	2024-12-06
Gamified crowd-sourcing of high-quality data for visual fine-tuning		64.954		Qwen2-VL-7B (finetuned on GAP-VQA train)	2024-10-05
[]()		64.4		InternVL2-Llama3-76B
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	64.3±0.4		Gemini 1.0 Pro Vision (gemini-pro-vision)	2023-12-19
CogVLM: Visual Expert for Pretrained Language Models	✓ Link	63.9		GLM4 Vision	2023-11-06
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	63.7		LLaVA-OneVision-72B	2024-08-06
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	✓ Link	63.5	9B	Lyra-Base	2024-12-12
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs	✓ Link	63.20	26B	InternVL2-26B (SGP, token ratio 35%)	2024-12-04
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	✓ Link	62.8	26B	InternVL 1.5	2024-04-25
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	62.8	8B	InternVL2.5-8B	2024-12-06
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale	✓ Link	62.3		MAmmoTH-VL-8B	2024-12-06
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	62.0		Qwen2-VL-7B	2024-09-18
[]()		61.8		InternVL2-40B
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	✓ Link	61.1±0.2		Qwen-VL-Plus	2023-08-24
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	✓ Link	60.8		Mini-Gemini-HD-BS	2024-03-27
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	60.8	2B	InternVL2.5-2B	2024-12-06
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale	✓ Link	60.6		MAmmoTH-VL-8B (SI)	2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	60.6	4B	InternVL2.5-4B	2024-12-06
GPT-4 Technical Report	✓ Link	60.2±0.3		GPT-4V-Turbo-detail:low	2023-03-15
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	✓ Link	59.3		Mini-Gemini-HD	2024-03-27
[]()		58.1±0.1		Claude 3 Opus (claude-3-opus-20240229)
CogVLM2: Visual Language Models for Image and Video Understanding	✓ Link	58.0		GLM-4V-9B	2024-08-29
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	57.5		LLaVA-OneVision-7B	2024-08-06
[]()		57.4	34B	LLaVA-NeXT-34B
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	✓ Link	57.3	7B	Meteor	2024-05-24
CROME: Cross-Modal Adapters for Efficient Multimodal LLM		55.1		CROME (Vicuna-13B)	2024-08-13
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	✓ Link	54.9		IXC2-4KHD	2024-04-09
[]()		54.7	15B	Weitu-VL-1.0
TroL: Traversal of Layers for Large Language and Vision Models	✓ Link	54.7	7B	TroL-7B	2024-06-18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	✓ Link	53.0		Mini-Gemini	2024-03-27
CogVLM: Visual Expert for Pretrained Language Models	✓ Link	52.8	17B	CogVLM(Vicuna-7B)	2023-11-06
CogAgent: A Visual Language Model for GUI Agents	✓ Link	52.8	18B	CogAgent	2023-12-14
Gamified crowd-sourcing of high-quality data for visual fine-tuning		52.43		Qwen2-VL-2B (finetuned on GAP-VQA train)	2024-10-05
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs	✓ Link	52.10	26B	InternVL2-26B (SGP, token ratio 9%)	2024-12-04
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		52.0		MM1.5-30B	2024-09-30
Gamified crowd-sourcing of high-quality data for visual fine-tuning		51.789		MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train)	2024-10-05
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	✓ Link	51.7		IXC-2.5-7B	2024-07-03
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	✓ Link	51.2		InternLM-XComposer2	2024-01-29
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	✓ Link	51.2	3B	Lyra-Mini	2024-12-12
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	✓ Link	51.0	7B	CuMo-7B	2024-05-09
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	✓ Link	50.9		TACO (Qwen2-7B / SigLIP)	2024-12-07
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment		50.7		Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))	2024-10-12
POINTS: Improving Your Vision-language Model with Affordable Strategies		50.0		POINTS-9B	2024-09-07
VILA$^2$: VILA Augmented VILA		50.0		VILA^2-8B	2024-07-24
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	✓ Link	50.0		Janus-Pro-7B	2025-01-29
Silkie: Preference Distillation for Large Visual Language Models		49.9	7B	Silkie	2023-12-17
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment		49.9		Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)	2024-10-12
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	49.5		Qwen2-VL-2B	2024-09-18
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression	✓ Link	49.0	3.2B	FlashSloth-HD	2024-12-05
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	✓ Link	48.9	40B	InternVL 1.2	2024-04-25
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs		48.8		SEA-PRIME (Vicuna-13B)	2024-08-21
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	✓ Link	48.8	1B	InternVL2.5-1B	2024-12-06
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training		48.7		MM1-30B-Chat	2024-03-14
Towards Semantic Equivalence of Tokenization in Multimodal LLM		48.7	13B	SETOKIM (13B)	2024-06-07
Generative Multimodal Models are In-Context Learners	✓ Link	48.5	37B	Emu2-Chat	2023-12-20
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning	✓ Link	48.5		MG-LLaVA(34B)	2024-06-25
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	✓ Link	47.9		SPHINX-Plus	2024-02-08
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	✓ Link	45.9	7B	ConvLLaVA	2024-05-24
VILA: On Pre-training for Visual Language Models	✓ Link	45.7		VILA-13B	2023-12-12
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	✓ Link	45.7		TACO (LLaMA3-8B / SigLIP)	2024-12-07
[]()		45.3		HPT 1.5 Edge
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	✓ Link	45.2		TACO (LLaMA3-8B / CLIP)	2024-12-07
Enhancing Large Vision Language Models with Self-Training on Image Comprehension	✓ Link	45.0	7B	LLaVA-v1.6 (7B, w/ STIC)	2024-05-30
H2OVL-Mississippi Vision Language Models Technical Report		44.7		H2OVL-Mississippi-2B	2024-10-17
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding	✓ Link	44.7	7B	PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L )	2025-01-14
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	✓ Link	44.6±0.2		MM-ReAct-GPT-4	2023-03-20
Imp: Highly Capable Large Multimodal Models for Mobile Devices	✓ Link	44.6	4B	Imp-4B	2024-05-20
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment		44.2		LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)	2024-10-12
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models		44.1		MGM-7B+RP	2024-08-08
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment		44.1		LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)	2024-10-12
Multi-modal Auto-regressive Modeling via Visual Words	✓ Link	44.0		VW-LMM	2024-03-12
MoAI: Mixture of All Intelligence for Large Language and Vision Models	✓ Link	43.7	7B	MoAI	2024-03-12
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training		43.7		MM1-3B-Chat	2024-03-14
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		43.7		MM1.5-3B-MoE	2024-09-30
Imp: Highly Capable Large Multimodal Models for Mobile Devices	✓ Link	43.3	3B	Imp-3B	2024-05-20
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	✓ Link	43.1	13B	ShareGPT4V-13B	2023-11-21
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate	✓ Link	42.9		Mini-Gemini (+MoCa)	2024-10-09
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		42.2		MM1.5-7B	2024-09-30
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training		42.1		MM1-7B-Chat	2024-03-14
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression	✓ Link	41.9	3.2B	FlashSloth	2024-12-05
DeepSeek-VL: Towards Real-World Vision-Language Understanding	✓ Link	41.5		DeepSeek-VL	2024-03-08
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization		41.4	13B	LaVA1.5-13B-BPO	2024-03-13
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	✓ Link	41.3		ASMv2	2024-02-29
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression		41.3		FocusLLaVA	2024-11-21
Self-Supervised Visual Preference Alignment	✓ Link	41.0		SeVa-13B	2024-04-16
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		41.0		MM1.5-3B	2024-09-30
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models	✓ Link	40.4		LLaVA-1.5-7B (VG-S)	2024-12-09
CoLLaVO: Crayon Large Language and Vision mOdel	✓ Link	40.3	7B	CoLLaVO	2024-02-17
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	✓ Link	40.2		SPHINX-2k	2023-11-13
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	✓ Link	40.2	13B	LLaVA-1.5 (LVIS-Instrcut4V)	2023-11-13
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	✓ Link	40.1		mPLUG-Owl3	2024-08-09
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training		40.1	2B	Mono-InternVL-2B	2024-10-10
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance		39.90	13B	LLaVA1.5-13B-MDA	2024-11-21
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning	✓ Link	39.8		LLaVA-VT (Vicuna-13B)	2024-03-27
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		39.8		MM1.5-1B-MoE	2024-09-30
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	✓ Link	39.8		Janus-Pro-1B	2025-01-29
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant	✓ Link	39.7		SQ-LLaVA∗	2024-03-17
OmniFusion Technical Report		39.40		OmniFusion (grid split + ruDocVQA)	2024-04-09
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs		39.3		DeepStack-L-HD (Vicuna-13B)	2024-06-06
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information		38.9		LAF-13B	2024-01-31
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding		38.9		InfiMM-HD	2024-03-03
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs	✓ Link	38.8		InternLM-XC2 + MMDU-45k	2024-06-17
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models	✓ Link	38.5		LLaVA-1.5-7B (DC-S)	2024-12-09
MouSi: Poly-Visual-Expert Vision-Language Models	✓ Link	38.4	7.9B	LayoutLMv3+ConvNeXt+CLIP	2024-01-30
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision	✓ Link	38.0	13B	VOLCANO 13B	2023-11-13
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity	✓ Link	37.9		LLaVA-1.5+MMInstruct (Vicuna-13B)	2024-07-22
Calibrated Self-Rewarding Vision Language Models	✓ Link	37.8	13B	LLaVA-1.5-13B (+CSR)	2024-05-23
What If We Recaption Billions of Web Images with LLaMA-3?		37.8		LLaVA-1.5-LLaMA3-8B	2024-06-12
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception	✓ Link	37.8		LLaVA-1.5 + DenseFusion-1M (Vicuna-7B)	2024-07-11
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	✓ Link	37.6	7B	ShareGPT4V-7B	2023-11-21
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models	✓ Link	37.6		LLaVA-1.5+CoS	2024-03-19
COCO is "ALL'' You Need for Visual Instruction Fine-tuning		37.5		LLaVA-COCO-13B	2024-01-17
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception	✓ Link	37.5		LLaVA-S^2 + DenseFusion-1M (Vicuna-7B)	2024-07-11
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning		37.4		MM1.5-1B	2024-09-30
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification	✓ Link	37.3		Dynamic-LLaVA-13B	2024-12-01
Self-Supervised Visual Preference Alignment	✓ Link	37.2		SeVa-7B	2024-04-16
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	✓ Link	37.2		SoM-LLaVA-1.5-T	2024-04-25
Emu3: Next-Token Prediction is All You Need	✓ Link	37.2		Emu3	2024-09-27
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment	✓ Link	37.1		LLaVA-Instruct (Vicuna-1.5-13B)	2024-06-28
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance		37.0		ILLUME	2024-12-09
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization		36.8	7B	LLaVA1.5-7B-BPO	2024-03-13
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding	✓ Link	36.6		LLaVA-1.5-13B (+ MMFuser)	2024-10-15
CaMML: Context-Aware Multimodal Learner for Large Models	✓ Link	36.4		CaMML-13B	2024-01-06
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models	✓ Link	36.4		LLaVA-65B (Data Mixing)	2023-09-18
Improved Baselines with Visual Instruction Tuning	✓ Link	36.3±0.2	13B	LLaVA-1.5-13B	2023-10-05
Emu: Generative Pretraining in Multimodality	✓ Link	36.3±0.3	14B	Emu-14B	2023-07-11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	✓ Link	36.3±0.1	7B	mPLUG-Owl2	2023-11-07
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	✓ Link	36.2		Vary-base	2023-12-11
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	✓ Link	36.1		StableLLaVA	2023-08-20
DreamLLM: Synergistic Multimodal Comprehension and Creation	✓ Link	35.9	7B	DreamLLM-7B	2023-09-20
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	✓ Link	35.9		MoE-LLaVA-2.7B×4-Top2	2024-01-29
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	✓ Link	35.9		SoM-LLaVA-1.5	2024-04-25
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models	✓ Link	35.9		Dragonfly (Llama3-8B)	2024-06-03
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	✓ Link	35.7		Ferret-v2-13B	2024-04-11
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability		35.6		AlignGPT (Vicuna-13B)	2024-05-23
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	✓ Link	35.5		LLaVA-HR-X	2024-03-05
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant	✓ Link	35.5		SQ-LLaVA	2024-03-17
LOVA3: Learning to Visual Question Answering, Asking and Assessment	✓ Link	35.2	7B	LOVA$^3$	2024-05-23
Mixture-of-Subspaces in Low-Rank Adaptation	✓ Link	35.2		LLaVA-InternLM2-7B-ViT + MoSLoRA	2024-06-16
Mixture-of-Subspaces in Low-Rank Adaptation	✓ Link	35.2		InternLM2+ViT (QMoSLoRA)	2024-06-16
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance		35.20	7B	LLaVA1.5-7B-MDA	2024-11-21
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge		35.1		Mipha-3B+	2024-07-05
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	✓ Link	35.0±0.0	13B	LLaVA-Plus-13B (All Tools, V1.3, 336px)	2023-11-09
Merlin:Empowering Multimodal LLMs with Foresight Minds		34.9		Merlin	2023-11-30
Improving Multi-modal Large Language Model through Boosting Vision Capabilities		34.8		Arcana	2024-10-17
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model	✓ Link	34.5		INF-LLaVA	2024-07-23
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity	✓ Link	34.4		LLaVA-1.5+MMInstruct (Vicuna-7B)	2024-07-22
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	✓ Link	34.3		Janus	2024-10-17
TokenPacker: Efficient Visual Projector for Multimodal LLM	✓ Link	34.1		LLaVA-TokenPacker (Vicuna-13B)	2024-07-02
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models		34.0		γ-MoD-LLaVA-HR	2024-10-17
Calibrated Self-Rewarding Vision Language Models	✓ Link	33.9	7B	LLaVA-1.5-7B (CSR)	2024-05-23
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models	✓ Link	33.6		DynMOE-LLaVA	2024-05-23
Imp: Highly Capable Large Multimodal Models for Mobile Devices	✓ Link	33.5	2B	Imp-2B	2024-05-20
InfMLLM: A Unified Framework for Visual-Language Tasks	✓ Link	33.4		InfMLLM-7B-Chat	2023-11-12
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	✓ Link	33.2	7B	Video-LaVIT	2024-02-05
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment	✓ Link	32.9		LLaVA-Instruct (Vicuna-1.5-7B)	2024-06-28
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	32.9		VisionZip (Retain 128 Tokens, fine-tuning)	2024-12-05
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	✓ Link	32.8		Uni-MoE	2024-05-18
VL-Mamba: Exploring State Space Models for Multimodal Learning		32.6		VL-Mamba (Mamba LLM-2.8B)	2024-03-20
Enhancing Large Vision Language Models with Self-Training on Image Comprehension	✓ Link	32.6	7B	LLaVA-v1.5 (7B, w/ STIC)	2024-05-30
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	32.6		VisionZip (Retain 192 Tokens, fine-tuning)	2024-12-05
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	32.6		VisionZip (Retain 128 Tokens)	2024-12-05
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate	✓ Link	32.2		LLaVA-v1.5 (+MoCa)	2024-10-09
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification	✓ Link	32.2		Dynamic-LLaVA-7B	2024-12-01
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models	✓ Link	32.1	3B	Mipha-3B	2024-03-10
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision	✓ Link	32.0	7B	VOLCANO 7B	2023-11-13
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	✓ Link	32.0		Video-LLaVA	2023-11-16
TinyLLaVA: A Framework of Small-scale Large Multimodal Models	✓ Link	32.0	3.1B	TinyLLaVA-share-Sig-Ph	2024-02-22
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning	✓ Link	31.8		LLaVA-VT (Vicuna-7B)	2024-03-27
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	✓ Link	31.7±0.1	7B	LRV-Instruction-7B	2023-06-26
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	31.7		VisionZip (Retain 192 Tokens)	2024-12-05
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	31.7		VisionZip (Retain 64 Tokens)	2024-12-05
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement	✓ Link	31.6	7B	LLaVA-1.5-7B (+ SIMA)	2024-05-24
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	✓ Link	31.4±0.1	7B	LLaMA-Adapter v2-7B	2023-04-28
Explore the Limits of Omni-modal Pretraining at Scale	✓ Link	31.4		MiCo-Chat-7B	2024-06-13
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition	✓ Link	31.2		LLaVA-1.5-7B + TeamLoRA	2024-08-19
Improved Baselines with Visual Instruction Tuning	✓ Link	31.1±0.2	7B	LLaVA-1.5-7B	2023-10-05
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis		31.0		RoboCodeX-13B	2024-02-25
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models	✓ Link	31.0		HyperLLaVA	2024-03-20
Visual Agents as Fast and Slow Thinkers	✓ Link	31.0		FAST (Vicuna-7B)	2024-08-16
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	✓ Link	30.9		JanusFlow	2024-11-12
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability		30.8		AlignGPT (Vicuna-7B)	2024-05-23
Efficient Large Multi-modal Models via Visual Context Compression	✓ Link	30.7		LLaVolta	2024-06-28
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models		30.7		LLaVA-AlignedVQ	2024-11-08
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model	✓ Link	30.4	7.3B	LLaVA-1.5-HACL	2023-12-12
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model		30.4	7.2B	MaVEn	2024-08-22
VisionZip: Longer is Better but Not Necessary in Vision Language Models	✓ Link	30.2		VisionZip (Retain 64 Tokens, fine-tuning)	2024-12-05
H2OVL-Mississippi Vision Language Models Technical Report		30.0		H2OVL-Mississippi-0.8B	2024-10-17
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation		29.7		RoboMamba	2024-06-06
TokenPacker: Efficient Visual Projector for Multimodal LLM	✓ Link	29.6		LLaVA-TokenPacker (Vicuna-7B)	2024-07-02
OneLLM: One Framework to Align All Modalities with Language	✓ Link	29.1	7B	OneLLM-7B	2023-12-06
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	29.1		LLaVA-OneVision-0.5B	2024-08-06
Small Language Model Meets with Reinforced Vision Vocabulary		29.0	1.8B	Vary-toy	2024-01-23
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model	✓ Link	28.9		LLaVA-Phi	2024-01-04
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	✓ Link	27.9±0.1		MM-ReAct-GPT-3.5	2023-03-20
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling		27.80		MMAR-7B	2024-10-14
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs	✓ Link	27.7	7B	SEAL (7B)	2023-12-21
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	✓ Link	27.5±0.3	7B	LLaVA-Plus-7B (All Tools)	2023-11-09
OtterHD: A High-Resolution Multi-modality Model	✓ Link	26.3	8B	OtterHD-8B	2023-11-07
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models		25.6		TGA-7B	2024-10-16
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	✓ Link	24.8±0.2	9B	OpenFlamingo-9B (MPT-7B)	2023-08-02
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	✓ Link	24.7±0.3	9B	Otter-9B (MPT-7B)	2023-06-08
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	✓ Link	24.6±0.2	9B	Otter-9B (LLaMA)	2023-06-08
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	✓ Link	24.4±0.4	14B	MiniGPT-4-14B	2023-04-20
LinVT: Empower Your Image-level Large Language Model to Understand Videos	✓ Link	23.5		LinVT	2024-12-06
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	✓ Link	22.4±0.2	12B	BLIP-2-12B	2023-01-30
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	✓ Link	22.1±0.1	8B	MiniGPT-4-8B	2023-04-20
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	✓ Link	21.8±0.1	9B	OpenFlamingo-9B (LLaMA-7B)	2023-08-02
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model	✓ Link	21.8		Xmodel-VLM (Xmodel-LM 1.1B)	2024-05-15
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild	✓ Link	19.4		TextBind	2023-09-14
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling		18.49		MMAR-0.5B	2024-10-14

OpenCodePapers

visual-question-answering-on-mm-vet