[]() | | 81.2±0.4 | | gemini-2.0-flash-exp | |
[]() | | 78.1±0.2 | | gemini-exp-1206 | |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 76.9±0.1 | | Gemini 1.5 Pro (gemini-1.5-pro-002) | 2024-03-08 |
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | | 74.24 | | MMCTAgent (GPT-4 + GPT-4V) | 2024-05-28 |
Claude 3.5 Sonnet Model Card Addendum | | 74.2±0.2 | | Claude 3.5 Sonnet (claude-3-5-sonnet-20240620) | 2024-06-24 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 74.0 | | Qwen2-VL-72B | 2024-09-18 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 72.3 | 78B | InternVL2.5-78B | 2024-12-06 |
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | | 72.2 | | GPT-4o +text rationale +IoT | 2024-05-22 |
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | ✓ Link | 71.4 | 74B | Lyra-Pro | 2024-12-12 |
CogVLM2: Visual Language Models for Image and Video Understanding | ✓ Link | 71.1 | | GLM-4V-Plus | 2024-08-29 |
Phantom of Latent for Large Language and Vision Models | ✓ Link | 70.8 | | Phantom-7B | 2024-09-23 |
GPT-4 Technical Report | ✓ Link | 69.3±0.1 | | GPT-4o (gpt-4o-2024-05-13) | 2023-03-15 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 68.8 | 38B | InternVL2.5-38B | 2024-12-06 |
GPT-4 Technical Report | ✓ Link | 68.6±0.1 | | gpt-4o-mini-2024-07-18 | 2023-03-15 |
GPT-4 Technical Report | ✓ Link | 67.7±0.3 | | GPT-4V | 2023-03-15 |
GPT-4 Technical Report | ✓ Link | 67.6±0.1 | | GPT-4V-Turbo-detail:high | 2023-03-15 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 66.6±0.5 | | Qwen-VL-Max | 2023-08-24 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 65.8±0.1 | | Gemini 1.5 Pro (gemini-1.5-pro) | 2024-03-08 |
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | ✓ Link | 65.60 | 26B | InternVL2-26B (SGP, token ratio 64%) | 2024-12-04 |
Baichuan-Omni Technical Report | ✓ Link | 65.4 | | Baichuan-Omni (7B) | 2024-10-11 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 65.0 | 26B | InternVL2.5-26B | 2024-12-06 |
Gamified crowd-sourcing of high-quality data for visual fine-tuning | | 64.954 | | Qwen2-VL-7B (finetuned on GAP-VQA train) | 2024-10-05 |
[]() | | 64.4 | | InternVL2-Llama3-76B | |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 64.3±0.4 | | Gemini 1.0 Pro Vision (gemini-pro-vision) | 2023-12-19 |
CogVLM: Visual Expert for Pretrained Language Models | ✓ Link | 63.9 | | GLM4 Vision | 2023-11-06 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 63.7 | | LLaVA-OneVision-72B | 2024-08-06 |
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | ✓ Link | 63.5 | 9B | Lyra-Base | 2024-12-12 |
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | ✓ Link | 63.20 | 26B | InternVL2-26B (SGP, token ratio 35%) | 2024-12-04 |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | ✓ Link | 62.8 | 26B | InternVL 1.5 | 2024-04-25 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 62.8 | 8B | InternVL2.5-8B | 2024-12-06 |
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | ✓ Link | 62.3 | | MAmmoTH-VL-8B | 2024-12-06 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 62.0 | | Qwen2-VL-7B | 2024-09-18 |
[]() | | 61.8 | | InternVL2-40B | |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | ✓ Link | 61.1±0.2 | | Qwen-VL-Plus | 2023-08-24 |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | ✓ Link | 60.8 | | Mini-Gemini-HD-BS | 2024-03-27 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 60.8 | 2B | InternVL2.5-2B | 2024-12-06 |
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | ✓ Link | 60.6 | | MAmmoTH-VL-8B (SI) | 2024-12-06 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 60.6 | 4B | InternVL2.5-4B | 2024-12-06 |
GPT-4 Technical Report | ✓ Link | 60.2±0.3 | | GPT-4V-Turbo-detail:low | 2023-03-15 |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | ✓ Link | 59.3 | | Mini-Gemini-HD | 2024-03-27 |
[]() | | 58.1±0.1 | | Claude 3 Opus (claude-3-opus-20240229) | |
CogVLM2: Visual Language Models for Image and Video Understanding | ✓ Link | 58.0 | | GLM-4V-9B | 2024-08-29 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 57.5 | | LLaVA-OneVision-7B | 2024-08-06 |
[]() | | 57.4 | 34B | LLaVA-NeXT-34B | |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | ✓ Link | 57.3 | 7B | Meteor | 2024-05-24 |
CROME: Cross-Modal Adapters for Efficient Multimodal LLM | | 55.1 | | CROME (Vicuna-13B) | 2024-08-13 |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | ✓ Link | 54.9 | | IXC2-4KHD | 2024-04-09 |
[]() | | 54.7 | 15B | Weitu-VL-1.0 | |
TroL: Traversal of Layers for Large Language and Vision Models | ✓ Link | 54.7 | 7B | TroL-7B | 2024-06-18 |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | ✓ Link | 53.0 | | Mini-Gemini | 2024-03-27 |
CogVLM: Visual Expert for Pretrained Language Models | ✓ Link | 52.8 | 17B | CogVLM(Vicuna-7B) | 2023-11-06 |
CogAgent: A Visual Language Model for GUI Agents | ✓ Link | 52.8 | 18B | CogAgent | 2023-12-14 |
Gamified crowd-sourcing of high-quality data for visual fine-tuning | | 52.43 | | Qwen2-VL-2B (finetuned on GAP-VQA train) | 2024-10-05 |
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | ✓ Link | 52.10 | 26B | InternVL2-26B (SGP, token ratio 9%) | 2024-12-04 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 52.0 | | MM1.5-30B | 2024-09-30 |
Gamified crowd-sourcing of high-quality data for visual fine-tuning | | 51.789 | | MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train) | 2024-10-05 |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | ✓ Link | 51.7 | | IXC-2.5-7B | 2024-07-03 |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | ✓ Link | 51.2 | | InternLM-XComposer2 | 2024-01-29 |
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | ✓ Link | 51.2 | 3B | Lyra-Mini | 2024-12-12 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | ✓ Link | 51.0 | 7B | CuMo-7B | 2024-05-09 |
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | ✓ Link | 50.9 | | TACO (Qwen2-7B / SigLIP) | 2024-12-07 |
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | | 50.7 | | Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback)) | 2024-10-12 |
POINTS: Improving Your Vision-language Model with Affordable Strategies | | 50.0 | | POINTS-9B | 2024-09-07 |
VILA$^2$: VILA Augmented VILA | | 50.0 | | VILA^2-8B | 2024-07-24 |
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | ✓ Link | 50.0 | | Janus-Pro-7B | 2025-01-29 |
Silkie: Preference Distillation for Large Visual Language Models | | 49.9 | 7B | Silkie | 2023-12-17 |
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | | 49.9 | | Silkie (Qwen-VL-Chat + DPO w/ VLFeedback) | 2024-10-12 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 49.5 | | Qwen2-VL-2B | 2024-09-18 |
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression | ✓ Link | 49.0 | 3.2B | FlashSloth-HD | 2024-12-05 |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | ✓ Link | 48.9 | 40B | InternVL 1.2 | 2024-04-25 |
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs | | 48.8 | | SEA-PRIME (Vicuna-13B) | 2024-08-21 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 48.8 | 1B | InternVL2.5-1B | 2024-12-06 |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | 48.7 | | MM1-30B-Chat | 2024-03-14 |
Towards Semantic Equivalence of Tokenization in Multimodal LLM | | 48.7 | 13B | SETOKIM (13B) | 2024-06-07 |
Generative Multimodal Models are In-Context Learners | ✓ Link | 48.5 | 37B | Emu2-Chat | 2023-12-20 |
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning | ✓ Link | 48.5 | | MG-LLaVA(34B) | 2024-06-25 |
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | ✓ Link | 47.9 | | SPHINX-Plus | 2024-02-08 |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | ✓ Link | 45.9 | 7B | ConvLLaVA | 2024-05-24 |
VILA: On Pre-training for Visual Language Models | ✓ Link | 45.7 | | VILA-13B | 2023-12-12 |
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | ✓ Link | 45.7 | | TACO (LLaMA3-8B / SigLIP) | 2024-12-07 |
[]() | | 45.3 | | HPT 1.5 Edge | |
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | ✓ Link | 45.2 | | TACO (LLaMA3-8B / CLIP) | 2024-12-07 |
Enhancing Large Vision Language Models with Self-Training on Image Comprehension | ✓ Link | 45.0 | 7B | LLaVA-v1.6 (7B, w/ STIC) | 2024-05-30 |
H2OVL-Mississippi Vision Language Models Technical Report | | 44.7 | | H2OVL-Mississippi-2B | 2024-10-17 |
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | ✓ Link | 44.7 | 7B | PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L ) | 2025-01-14 |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | ✓ Link | 44.6±0.2 | | MM-ReAct-GPT-4 | 2023-03-20 |
Imp: Highly Capable Large Multimodal Models for Mobile Devices | ✓ Link | 44.6 | 4B | Imp-4B | 2024-05-20 |
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | | 44.2 | | LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback) | 2024-10-12 |
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | | 44.1 | | MGM-7B+RP | 2024-08-08 |
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | | 44.1 | | LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback) | 2024-10-12 |
Multi-modal Auto-regressive Modeling via Visual Words | ✓ Link | 44.0 | | VW-LMM | 2024-03-12 |
MoAI: Mixture of All Intelligence for Large Language and Vision Models | ✓ Link | 43.7 | 7B | MoAI | 2024-03-12 |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | 43.7 | | MM1-3B-Chat | 2024-03-14 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 43.7 | | MM1.5-3B-MoE | 2024-09-30 |
Imp: Highly Capable Large Multimodal Models for Mobile Devices | ✓ Link | 43.3 | 3B | Imp-3B | 2024-05-20 |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | ✓ Link | 43.1 | 13B | ShareGPT4V-13B | 2023-11-21 |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | ✓ Link | 42.9 | | Mini-Gemini (+MoCa) | 2024-10-09 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 42.2 | | MM1.5-7B | 2024-09-30 |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | | 42.1 | | MM1-7B-Chat | 2024-03-14 |
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression | ✓ Link | 41.9 | 3.2B | FlashSloth | 2024-12-05 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding | ✓ Link | 41.5 | | DeepSeek-VL | 2024-03-08 |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | 41.4 | 13B | LaVA1.5-13B-BPO | 2024-03-13 |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | ✓ Link | 41.3 | | ASMv2 | 2024-02-29 |
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression | | 41.3 | | FocusLLaVA | 2024-11-21 |
Self-Supervised Visual Preference Alignment | ✓ Link | 41.0 | | SeVa-13B | 2024-04-16 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 41.0 | | MM1.5-3B | 2024-09-30 |
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models | ✓ Link | 40.4 | | LLaVA-1.5-7B (VG-S) | 2024-12-09 |
CoLLaVO: Crayon Large Language and Vision mOdel | ✓ Link | 40.3 | 7B | CoLLaVO | 2024-02-17 |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | ✓ Link | 40.2 | | SPHINX-2k | 2023-11-13 |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | ✓ Link | 40.2 | 13B | LLaVA-1.5 (LVIS-Instrcut4V) | 2023-11-13 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | ✓ Link | 40.1 | | mPLUG-Owl3 | 2024-08-09 |
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | | 40.1 | 2B | Mono-InternVL-2B | 2024-10-10 |
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance | | 39.90 | 13B | LLaVA1.5-13B-MDA | 2024-11-21 |
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | ✓ Link | 39.8 | | LLaVA-VT (Vicuna-13B) | 2024-03-27 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 39.8 | | MM1.5-1B-MoE | 2024-09-30 |
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | ✓ Link | 39.8 | | Janus-Pro-1B | 2025-01-29 |
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant | ✓ Link | 39.7 | | SQ-LLaVA∗ | 2024-03-17 |
OmniFusion Technical Report | | 39.40 | | OmniFusion (grid split + ruDocVQA) | 2024-04-09 |
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | | 39.3 | | DeepStack-L-HD (Vicuna-13B) | 2024-06-06 |
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information | | 38.9 | | LAF-13B | 2024-01-31 |
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | | 38.9 | | InfiMM-HD | 2024-03-03 |
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs | ✓ Link | 38.8 | | InternLM-XC2 + MMDU-45k | 2024-06-17 |
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models | ✓ Link | 38.5 | | LLaVA-1.5-7B (DC-S) | 2024-12-09 |
MouSi: Poly-Visual-Expert Vision-Language Models | ✓ Link | 38.4 | 7.9B | LayoutLMv3+ConvNeXt+CLIP | 2024-01-30 |
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | ✓ Link | 38.0 | 13B | VOLCANO 13B | 2023-11-13 |
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | ✓ Link | 37.9 | | LLaVA-1.5+MMInstruct (Vicuna-13B) | 2024-07-22 |
Calibrated Self-Rewarding Vision Language Models | ✓ Link | 37.8 | 13B | LLaVA-1.5-13B (+CSR) | 2024-05-23 |
What If We Recaption Billions of Web Images with LLaMA-3? | | 37.8 | | LLaVA-1.5-LLaMA3-8B | 2024-06-12 |
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | ✓ Link | 37.8 | | LLaVA-1.5 + DenseFusion-1M (Vicuna-7B) | 2024-07-11 |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | ✓ Link | 37.6 | 7B | ShareGPT4V-7B | 2023-11-21 |
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models | ✓ Link | 37.6 | | LLaVA-1.5+CoS | 2024-03-19 |
COCO is "ALL'' You Need for Visual Instruction Fine-tuning | | 37.5 | | LLaVA-COCO-13B | 2024-01-17 |
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | ✓ Link | 37.5 | | LLaVA-S^2 + DenseFusion-1M (Vicuna-7B) | 2024-07-11 |
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | | 37.4 | | MM1.5-1B | 2024-09-30 |
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification | ✓ Link | 37.3 | | Dynamic-LLaVA-13B | 2024-12-01 |
Self-Supervised Visual Preference Alignment | ✓ Link | 37.2 | | SeVa-7B | 2024-04-16 |
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | ✓ Link | 37.2 | | SoM-LLaVA-1.5-T | 2024-04-25 |
Emu3: Next-Token Prediction is All You Need | ✓ Link | 37.2 | | Emu3 | 2024-09-27 |
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | ✓ Link | 37.1 | | LLaVA-Instruct (Vicuna-1.5-13B) | 2024-06-28 |
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | | 37.0 | | ILLUME | 2024-12-09 |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | | 36.8 | 7B | LLaVA1.5-7B-BPO | 2024-03-13 |
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding | ✓ Link | 36.6 | | LLaVA-1.5-13B (+ MMFuser) | 2024-10-15 |
CaMML: Context-Aware Multimodal Learner for Large Models | ✓ Link | 36.4 | | CaMML-13B | 2024-01-06 |
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models | ✓ Link | 36.4 | | LLaVA-65B (Data Mixing) | 2023-09-18 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 36.3±0.2 | 13B | LLaVA-1.5-13B | 2023-10-05 |
Emu: Generative Pretraining in Multimodality | ✓ Link | 36.3±0.3 | 14B | Emu-14B | 2023-07-11 |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | ✓ Link | 36.3±0.1 | 7B | mPLUG-Owl2 | 2023-11-07 |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | ✓ Link | 36.2 | | Vary-base | 2023-12-11 |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | ✓ Link | 36.1 | | StableLLaVA | 2023-08-20 |
DreamLLM: Synergistic Multimodal Comprehension and Creation | ✓ Link | 35.9 | 7B | DreamLLM-7B | 2023-09-20 |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | ✓ Link | 35.9 | | MoE-LLaVA-2.7B×4-Top2 | 2024-01-29 |
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | ✓ Link | 35.9 | | SoM-LLaVA-1.5 | 2024-04-25 |
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | ✓ Link | 35.9 | | Dragonfly (Llama3-8B) | 2024-06-03 |
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models | ✓ Link | 35.7 | | Ferret-v2-13B | 2024-04-11 |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | | 35.6 | | AlignGPT (Vicuna-13B) | 2024-05-23 |
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | ✓ Link | 35.5 | | LLaVA-HR-X | 2024-03-05 |
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant | ✓ Link | 35.5 | | SQ-LLaVA | 2024-03-17 |
LOVA3: Learning to Visual Question Answering, Asking and Assessment | ✓ Link | 35.2 | 7B | LOVA$^3$ | 2024-05-23 |
Mixture-of-Subspaces in Low-Rank Adaptation | ✓ Link | 35.2 | | LLaVA-InternLM2-7B-ViT + MoSLoRA | 2024-06-16 |
Mixture-of-Subspaces in Low-Rank Adaptation | ✓ Link | 35.2 | | InternLM2+ViT (QMoSLoRA) | 2024-06-16 |
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance | | 35.20 | 7B | LLaVA1.5-7B-MDA | 2024-11-21 |
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge | | 35.1 | | Mipha-3B+ | 2024-07-05 |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | ✓ Link | 35.0±0.0 | 13B | LLaVA-Plus-13B (All Tools, V1.3, 336px) | 2023-11-09 |
Merlin:Empowering Multimodal LLMs with Foresight Minds | | 34.9 | | Merlin | 2023-11-30 |
Improving Multi-modal Large Language Model through Boosting Vision Capabilities | | 34.8 | | Arcana | 2024-10-17 |
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | ✓ Link | 34.5 | | INF-LLaVA | 2024-07-23 |
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | ✓ Link | 34.4 | | LLaVA-1.5+MMInstruct (Vicuna-7B) | 2024-07-22 |
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation | ✓ Link | 34.3 | | Janus | 2024-10-17 |
TokenPacker: Efficient Visual Projector for Multimodal LLM | ✓ Link | 34.1 | | LLaVA-TokenPacker (Vicuna-13B) | 2024-07-02 |
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | | 34.0 | | γ-MoD-LLaVA-HR | 2024-10-17 |
Calibrated Self-Rewarding Vision Language Models | ✓ Link | 33.9 | 7B | LLaVA-1.5-7B (CSR) | 2024-05-23 |
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models | ✓ Link | 33.6 | | DynMOE-LLaVA | 2024-05-23 |
Imp: Highly Capable Large Multimodal Models for Mobile Devices | ✓ Link | 33.5 | 2B | Imp-2B | 2024-05-20 |
InfMLLM: A Unified Framework for Visual-Language Tasks | ✓ Link | 33.4 | | InfMLLM-7B-Chat | 2023-11-12 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 33.2 | 7B | Video-LaVIT | 2024-02-05 |
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | ✓ Link | 32.9 | | LLaVA-Instruct (Vicuna-1.5-7B) | 2024-06-28 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 32.9 | | VisionZip (Retain 128 Tokens, fine-tuning) | 2024-12-05 |
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | ✓ Link | 32.8 | | Uni-MoE | 2024-05-18 |
VL-Mamba: Exploring State Space Models for Multimodal Learning | | 32.6 | | VL-Mamba (Mamba LLM-2.8B) | 2024-03-20 |
Enhancing Large Vision Language Models with Self-Training on Image Comprehension | ✓ Link | 32.6 | 7B | LLaVA-v1.5 (7B, w/ STIC) | 2024-05-30 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 32.6 | | VisionZip (Retain 192 Tokens, fine-tuning) | 2024-12-05 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 32.6 | | VisionZip (Retain 128 Tokens) | 2024-12-05 |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | ✓ Link | 32.2 | | LLaVA-v1.5 (+MoCa) | 2024-10-09 |
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification | ✓ Link | 32.2 | | Dynamic-LLaVA-7B | 2024-12-01 |
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models | ✓ Link | 32.1 | 3B | Mipha-3B | 2024-03-10 |
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | ✓ Link | 32.0 | 7B | VOLCANO 7B | 2023-11-13 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 32.0 | | Video-LLaVA | 2023-11-16 |
TinyLLaVA: A Framework of Small-scale Large Multimodal Models | ✓ Link | 32.0 | 3.1B | TinyLLaVA-share-Sig-Ph | 2024-02-22 |
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | ✓ Link | 31.8 | | LLaVA-VT (Vicuna-7B) | 2024-03-27 |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | ✓ Link | 31.7±0.1 | 7B | LRV-Instruction-7B | 2023-06-26 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 31.7 | | VisionZip (Retain 192 Tokens) | 2024-12-05 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 31.7 | | VisionZip (Retain 64 Tokens) | 2024-12-05 |
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | ✓ Link | 31.6 | 7B | LLaVA-1.5-7B (+ SIMA) | 2024-05-24 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 31.4±0.1 | 7B | LLaMA-Adapter v2-7B | 2023-04-28 |
Explore the Limits of Omni-modal Pretraining at Scale | ✓ Link | 31.4 | | MiCo-Chat-7B | 2024-06-13 |
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition | ✓ Link | 31.2 | | LLaVA-1.5-7B + TeamLoRA | 2024-08-19 |
Improved Baselines with Visual Instruction Tuning | ✓ Link | 31.1±0.2 | 7B | LLaVA-1.5-7B | 2023-10-05 |
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | | 31.0 | | RoboCodeX-13B | 2024-02-25 |
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | ✓ Link | 31.0 | | HyperLLaVA | 2024-03-20 |
Visual Agents as Fast and Slow Thinkers | ✓ Link | 31.0 | | FAST (Vicuna-7B) | 2024-08-16 |
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | ✓ Link | 30.9 | | JanusFlow | 2024-11-12 |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | | 30.8 | | AlignGPT (Vicuna-7B) | 2024-05-23 |
Efficient Large Multi-modal Models via Visual Context Compression | ✓ Link | 30.7 | | LLaVolta | 2024-06-28 |
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models | | 30.7 | | LLaVA-AlignedVQ | 2024-11-08 |
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | ✓ Link | 30.4 | 7.3B | LLaVA-1.5-HACL | 2023-12-12 |
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model | | 30.4 | 7.2B | MaVEn | 2024-08-22 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models | ✓ Link | 30.2 | | VisionZip (Retain 64 Tokens, fine-tuning) | 2024-12-05 |
H2OVL-Mississippi Vision Language Models Technical Report | | 30.0 | | H2OVL-Mississippi-0.8B | 2024-10-17 |
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation | | 29.7 | | RoboMamba | 2024-06-06 |
TokenPacker: Efficient Visual Projector for Multimodal LLM | ✓ Link | 29.6 | | LLaVA-TokenPacker (Vicuna-7B) | 2024-07-02 |
OneLLM: One Framework to Align All Modalities with Language | ✓ Link | 29.1 | 7B | OneLLM-7B | 2023-12-06 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 29.1 | | LLaVA-OneVision-0.5B | 2024-08-06 |
Small Language Model Meets with Reinforced Vision Vocabulary | | 29.0 | 1.8B | Vary-toy | 2024-01-23 |
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model | ✓ Link | 28.9 | | LLaVA-Phi | 2024-01-04 |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | ✓ Link | 27.9±0.1 | | MM-ReAct-GPT-3.5 | 2023-03-20 |
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | | 27.80 | | MMAR-7B | 2024-10-14 |
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs | ✓ Link | 27.7 | 7B | SEAL (7B) | 2023-12-21 |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | ✓ Link | 27.5±0.3 | 7B | LLaVA-Plus-7B (All Tools) | 2023-11-09 |
OtterHD: A High-Resolution Multi-modality Model | ✓ Link | 26.3 | 8B | OtterHD-8B | 2023-11-07 |
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models | | 25.6 | | TGA-7B | 2024-10-16 |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | ✓ Link | 24.8±0.2 | 9B | OpenFlamingo-9B (MPT-7B) | 2023-08-02 |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | ✓ Link | 24.7±0.3 | 9B | Otter-9B (MPT-7B) | 2023-06-08 |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning | ✓ Link | 24.6±0.2 | 9B | Otter-9B (LLaMA) | 2023-06-08 |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | ✓ Link | 24.4±0.4 | 14B | MiniGPT-4-14B | 2023-04-20 |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 23.5 | | LinVT | 2024-12-06 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 22.4±0.2 | 12B | BLIP-2-12B | 2023-01-30 |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | ✓ Link | 22.1±0.1 | 8B | MiniGPT-4-8B | 2023-04-20 |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | ✓ Link | 21.8±0.1 | 9B | OpenFlamingo-9B (LLaMA-7B) | 2023-08-02 |
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model | ✓ Link | 21.8 | | Xmodel-VLM (Xmodel-LM 1.1B) | 2024-05-15 |
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild | ✓ Link | 19.4 | | TextBind | 2023-09-14 |
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | | 18.49 | | MMAR-0.5B | 2024-10-14 |