OpenCodePapers

visual-question-answering-on-mm-vet

Visual Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeGPT-4 scoreParamsModelNameReleaseDate
[]()81.2±0.4gemini-2.0-flash-exp
[]()78.1±0.2gemini-exp-1206
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link76.9±0.1Gemini 1.5 Pro (gemini-1.5-pro-002)2024-03-08
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning74.24MMCTAgent (GPT-4 + GPT-4V)2024-05-28
Claude 3.5 Sonnet Model Card Addendum74.2±0.2Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)2024-06-24
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link74.0Qwen2-VL-72B2024-09-18
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link72.378BInternVL2.5-78B2024-12-06
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models72.2GPT-4o +text rationale +IoT2024-05-22
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition✓ Link71.474BLyra-Pro2024-12-12
CogVLM2: Visual Language Models for Image and Video Understanding✓ Link71.1GLM-4V-Plus2024-08-29
Phantom of Latent for Large Language and Vision Models✓ Link70.8Phantom-7B2024-09-23
GPT-4 Technical Report✓ Link69.3±0.1GPT-4o (gpt-4o-2024-05-13)2023-03-15
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link68.838BInternVL2.5-38B2024-12-06
GPT-4 Technical Report✓ Link68.6±0.1gpt-4o-mini-2024-07-182023-03-15
GPT-4 Technical Report✓ Link67.7±0.3GPT-4V2023-03-15
GPT-4 Technical Report✓ Link67.6±0.1GPT-4V-Turbo-detail:high2023-03-15
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link66.6±0.5Qwen-VL-Max2023-08-24
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link65.8±0.1Gemini 1.5 Pro (gemini-1.5-pro)2024-03-08
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs✓ Link65.6026BInternVL2-26B (SGP, token ratio 64%)2024-12-04
Baichuan-Omni Technical Report✓ Link65.4Baichuan-Omni (7B)2024-10-11
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link65.026BInternVL2.5-26B2024-12-06
Gamified crowd-sourcing of high-quality data for visual fine-tuning64.954Qwen2-VL-7B (finetuned on GAP-VQA train)2024-10-05
[]()64.4InternVL2-Llama3-76B
Gemini: A Family of Highly Capable Multimodal Models✓ Link64.3±0.4Gemini 1.0 Pro Vision (gemini-pro-vision)2023-12-19
CogVLM: Visual Expert for Pretrained Language Models✓ Link63.9GLM4 Vision2023-11-06
LLaVA-OneVision: Easy Visual Task Transfer✓ Link63.7LLaVA-OneVision-72B2024-08-06
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition✓ Link63.59BLyra-Base2024-12-12
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs✓ Link63.2026BInternVL2-26B (SGP, token ratio 35%)2024-12-04
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites✓ Link62.826BInternVL 1.52024-04-25
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link62.88BInternVL2.5-8B2024-12-06
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale✓ Link62.3MAmmoTH-VL-8B2024-12-06
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link62.0Qwen2-VL-7B2024-09-18
[]()61.8InternVL2-40B
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link61.1±0.2Qwen-VL-Plus2023-08-24
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models✓ Link60.8Mini-Gemini-HD-BS2024-03-27
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link60.82BInternVL2.5-2B2024-12-06
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale✓ Link60.6MAmmoTH-VL-8B (SI)2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link60.64BInternVL2.5-4B2024-12-06
GPT-4 Technical Report✓ Link60.2±0.3GPT-4V-Turbo-detail:low2023-03-15
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models✓ Link59.3Mini-Gemini-HD2024-03-27
[]()58.1±0.1Claude 3 Opus (claude-3-opus-20240229)
CogVLM2: Visual Language Models for Image and Video Understanding✓ Link58.0GLM-4V-9B2024-08-29
LLaVA-OneVision: Easy Visual Task Transfer✓ Link57.5LLaVA-OneVision-7B2024-08-06
[]()57.434BLLaVA-NeXT-34B
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models✓ Link57.37BMeteor2024-05-24
CROME: Cross-Modal Adapters for Efficient Multimodal LLM55.1CROME (Vicuna-13B)2024-08-13
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD✓ Link54.9IXC2-4KHD2024-04-09
[]()54.715BWeitu-VL-1.0
TroL: Traversal of Layers for Large Language and Vision Models✓ Link54.77BTroL-7B2024-06-18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models✓ Link53.0Mini-Gemini2024-03-27
CogVLM: Visual Expert for Pretrained Language Models✓ Link52.817BCogVLM(Vicuna-7B)2023-11-06
CogAgent: A Visual Language Model for GUI Agents✓ Link52.818BCogAgent2023-12-14
Gamified crowd-sourcing of high-quality data for visual fine-tuning52.43Qwen2-VL-2B (finetuned on GAP-VQA train)2024-10-05
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs✓ Link52.1026BInternVL2-26B (SGP, token ratio 9%)2024-12-04
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning52.0MM1.5-30B2024-09-30
Gamified crowd-sourcing of high-quality data for visual fine-tuning51.789MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train)2024-10-05
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output✓ Link51.7IXC-2.5-7B2024-07-03
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model✓ Link51.2InternLM-XComposer22024-01-29
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition✓ Link51.23BLyra-Mini2024-12-12
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts✓ Link51.07BCuMo-7B2024-05-09
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action✓ Link50.9TACO (Qwen2-7B / SigLIP)2024-12-07
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment50.7Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))2024-10-12
POINTS: Improving Your Vision-language Model with Affordable Strategies50.0POINTS-9B2024-09-07
VILA$^2$: VILA Augmented VILA50.0VILA^2-8B2024-07-24
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling✓ Link50.0Janus-Pro-7B2025-01-29
Silkie: Preference Distillation for Large Visual Language Models49.97BSilkie2023-12-17
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment49.9Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)2024-10-12
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link49.5Qwen2-VL-2B2024-09-18
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression✓ Link49.03.2BFlashSloth-HD2024-12-05
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites✓ Link48.940BInternVL 1.22024-04-25
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs48.8SEA-PRIME (Vicuna-13B)2024-08-21
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling✓ Link48.81BInternVL2.5-1B2024-12-06
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training48.7MM1-30B-Chat2024-03-14
Towards Semantic Equivalence of Tokenization in Multimodal LLM48.713BSETOKIM (13B)2024-06-07
Generative Multimodal Models are In-Context Learners✓ Link48.537BEmu2-Chat2023-12-20
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning✓ Link48.5MG-LLaVA(34B)2024-06-25
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models✓ Link47.9SPHINX-Plus2024-02-08
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models✓ Link45.97BConvLLaVA2024-05-24
VILA: On Pre-training for Visual Language Models✓ Link45.7VILA-13B2023-12-12
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action✓ Link45.7TACO (LLaMA3-8B / SigLIP)2024-12-07
[]()45.3HPT 1.5 Edge
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action✓ Link45.2TACO (LLaMA3-8B / CLIP)2024-12-07
Enhancing Large Vision Language Models with Self-Training on Image Comprehension✓ Link45.07BLLaVA-v1.6 (7B, w/ STIC)2024-05-30
H2OVL-Mississippi Vision Language Models Technical Report44.7H2OVL-Mississippi-2B2024-10-17
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding✓ Link44.77BPIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L )2025-01-14
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action✓ Link44.6±0.2MM-ReAct-GPT-42023-03-20
Imp: Highly Capable Large Multimodal Models for Mobile Devices✓ Link44.64BImp-4B2024-05-20
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment44.2LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)2024-10-12
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models44.1MGM-7B+RP2024-08-08
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment44.1LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)2024-10-12
Multi-modal Auto-regressive Modeling via Visual Words✓ Link44.0VW-LMM2024-03-12
MoAI: Mixture of All Intelligence for Large Language and Vision Models✓ Link43.77BMoAI2024-03-12
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training43.7MM1-3B-Chat2024-03-14
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning43.7MM1.5-3B-MoE2024-09-30
Imp: Highly Capable Large Multimodal Models for Mobile Devices✓ Link43.33BImp-3B2024-05-20
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions✓ Link43.113BShareGPT4V-13B2023-11-21
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate✓ Link42.9Mini-Gemini (+MoCa)2024-10-09
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning42.2MM1.5-7B2024-09-30
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training42.1MM1-7B-Chat2024-03-14
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression✓ Link41.93.2BFlashSloth2024-12-05
DeepSeek-VL: Towards Real-World Vision-Language Understanding✓ Link41.5DeepSeek-VL2024-03-08
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization41.413BLaVA1.5-13B-BPO2024-03-13
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World✓ Link41.3ASMv22024-02-29
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression41.3FocusLLaVA2024-11-21
Self-Supervised Visual Preference Alignment✓ Link41.0SeVa-13B2024-04-16
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning41.0MM1.5-3B2024-09-30
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models✓ Link40.4LLaVA-1.5-7B (VG-S)2024-12-09
CoLLaVO: Crayon Large Language and Vision mOdel✓ Link40.37BCoLLaVO2024-02-17
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models✓ Link40.2SPHINX-2k2023-11-13
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning✓ Link40.213BLLaVA-1.5 (LVIS-Instrcut4V)2023-11-13
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models✓ Link40.1mPLUG-Owl32024-08-09
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training40.12BMono-InternVL-2B2024-10-10
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance39.9013BLLaVA1.5-13B-MDA2024-11-21
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning✓ Link39.8LLaVA-VT (Vicuna-13B)2024-03-27
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning39.8MM1.5-1B-MoE2024-09-30
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling✓ Link39.8Janus-Pro-1B2025-01-29
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant✓ Link39.7SQ-LLaVA∗2024-03-17
OmniFusion Technical Report39.40OmniFusion (grid split + ruDocVQA)2024-04-09
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs39.3DeepStack-L-HD (Vicuna-13B)2024-06-06
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information38.9LAF-13B2024-01-31
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding38.9InfiMM-HD2024-03-03
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs✓ Link38.8InternLM-XC2 + MMDU-45k2024-06-17
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models✓ Link38.5LLaVA-1.5-7B (DC-S)2024-12-09
MouSi: Poly-Visual-Expert Vision-Language Models✓ Link38.47.9BLayoutLMv3+ConvNeXt+CLIP2024-01-30
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision✓ Link38.013BVOLCANO 13B2023-11-13
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity✓ Link37.9LLaVA-1.5+MMInstruct (Vicuna-13B)2024-07-22
Calibrated Self-Rewarding Vision Language Models✓ Link37.813BLLaVA-1.5-13B (+CSR)2024-05-23
What If We Recaption Billions of Web Images with LLaMA-3?37.8LLaVA-1.5-LLaMA3-8B2024-06-12
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception✓ Link37.8LLaVA-1.5 + DenseFusion-1M (Vicuna-7B)2024-07-11
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions✓ Link37.67BShareGPT4V-7B2023-11-21
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models✓ Link37.6LLaVA-1.5+CoS2024-03-19
COCO is "ALL'' You Need for Visual Instruction Fine-tuning37.5LLaVA-COCO-13B2024-01-17
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception✓ Link37.5LLaVA-S^2 + DenseFusion-1M (Vicuna-7B)2024-07-11
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning37.4MM1.5-1B2024-09-30
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification✓ Link37.3Dynamic-LLaVA-13B2024-12-01
Self-Supervised Visual Preference Alignment✓ Link37.2SeVa-7B2024-04-16
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs✓ Link37.2SoM-LLaVA-1.5-T2024-04-25
Emu3: Next-Token Prediction is All You Need✓ Link37.2Emu32024-09-27
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment✓ Link37.1LLaVA-Instruct (Vicuna-1.5-13B)2024-06-28
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance37.0ILLUME2024-12-09
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization36.87BLLaVA1.5-7B-BPO2024-03-13
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding✓ Link36.6LLaVA-1.5-13B (+ MMFuser)2024-10-15
CaMML: Context-Aware Multimodal Learner for Large Models✓ Link36.4CaMML-13B2024-01-06
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models✓ Link36.4LLaVA-65B (Data Mixing)2023-09-18
Improved Baselines with Visual Instruction Tuning✓ Link36.3±0.213BLLaVA-1.5-13B2023-10-05
Emu: Generative Pretraining in Multimodality✓ Link36.3±0.314BEmu-14B2023-07-11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration✓ Link36.3±0.17BmPLUG-Owl22023-11-07
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models✓ Link36.2Vary-base2023-12-11
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data✓ Link36.1StableLLaVA2023-08-20
DreamLLM: Synergistic Multimodal Comprehension and Creation✓ Link35.97BDreamLLM-7B2023-09-20
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models✓ Link35.9MoE-LLaVA-2.7B×4-Top22024-01-29
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs✓ Link35.9SoM-LLaVA-1.52024-04-25
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models✓ Link35.9Dragonfly (Llama3-8B)2024-06-03
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models✓ Link35.7Ferret-v2-13B2024-04-11
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability35.6AlignGPT (Vicuna-13B)2024-05-23
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models✓ Link35.5LLaVA-HR-X2024-03-05
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant✓ Link35.5SQ-LLaVA2024-03-17
LOVA3: Learning to Visual Question Answering, Asking and Assessment✓ Link35.27BLOVA$^3$2024-05-23
Mixture-of-Subspaces in Low-Rank Adaptation✓ Link35.2LLaVA-InternLM2-7B-ViT + MoSLoRA2024-06-16
Mixture-of-Subspaces in Low-Rank Adaptation✓ Link35.2InternLM2+ViT (QMoSLoRA)2024-06-16
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance35.207BLLaVA1.5-7B-MDA2024-11-21
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge35.1Mipha-3B+2024-07-05
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents✓ Link35.0±0.013BLLaVA-Plus-13B (All Tools, V1.3, 336px)2023-11-09
Merlin:Empowering Multimodal LLMs with Foresight Minds34.9Merlin2023-11-30
Improving Multi-modal Large Language Model through Boosting Vision Capabilities34.8Arcana2024-10-17
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model✓ Link34.5INF-LLaVA2024-07-23
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity✓ Link34.4LLaVA-1.5+MMInstruct (Vicuna-7B)2024-07-22
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation✓ Link34.3Janus2024-10-17
TokenPacker: Efficient Visual Projector for Multimodal LLM✓ Link34.1LLaVA-TokenPacker (Vicuna-13B)2024-07-02
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models34.0γ-MoD-LLaVA-HR2024-10-17
Calibrated Self-Rewarding Vision Language Models✓ Link33.97BLLaVA-1.5-7B (CSR)2024-05-23
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models✓ Link33.6DynMOE-LLaVA2024-05-23
Imp: Highly Capable Large Multimodal Models for Mobile Devices✓ Link33.52BImp-2B2024-05-20
InfMLLM: A Unified Framework for Visual-Language Tasks✓ Link33.4InfMLLM-7B-Chat2023-11-12
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization✓ Link33.27BVideo-LaVIT2024-02-05
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment✓ Link32.9LLaVA-Instruct (Vicuna-1.5-7B)2024-06-28
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link32.9VisionZip (Retain 128 Tokens, fine-tuning)2024-12-05
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts✓ Link32.8Uni-MoE2024-05-18
VL-Mamba: Exploring State Space Models for Multimodal Learning32.6VL-Mamba (Mamba LLM-2.8B)2024-03-20
Enhancing Large Vision Language Models with Self-Training on Image Comprehension✓ Link32.67BLLaVA-v1.5 (7B, w/ STIC)2024-05-30
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link32.6VisionZip (Retain 192 Tokens, fine-tuning)2024-12-05
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link32.6VisionZip (Retain 128 Tokens)2024-12-05
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate✓ Link32.2LLaVA-v1.5 (+MoCa)2024-10-09
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification✓ Link32.2Dynamic-LLaVA-7B2024-12-01
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models✓ Link32.13BMipha-3B2024-03-10
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision✓ Link32.07BVOLCANO 7B2023-11-13
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection✓ Link32.0Video-LLaVA2023-11-16
TinyLLaVA: A Framework of Small-scale Large Multimodal Models✓ Link32.03.1BTinyLLaVA-share-Sig-Ph2024-02-22
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning✓ Link31.8LLaVA-VT (Vicuna-7B)2024-03-27
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning✓ Link31.7±0.17BLRV-Instruction-7B2023-06-26
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link31.7VisionZip (Retain 192 Tokens)2024-12-05
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link31.7VisionZip (Retain 64 Tokens)2024-12-05
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement✓ Link31.67BLLaVA-1.5-7B (+ SIMA)2024-05-24
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model✓ Link31.4±0.17BLLaMA-Adapter v2-7B2023-04-28
Explore the Limits of Omni-modal Pretraining at Scale✓ Link31.4MiCo-Chat-7B2024-06-13
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition✓ Link31.2LLaVA-1.5-7B + TeamLoRA2024-08-19
Improved Baselines with Visual Instruction Tuning✓ Link31.1±0.27BLLaVA-1.5-7B2023-10-05
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis31.0RoboCodeX-13B2024-02-25
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models✓ Link31.0HyperLLaVA2024-03-20
Visual Agents as Fast and Slow Thinkers✓ Link31.0FAST (Vicuna-7B)2024-08-16
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation✓ Link30.9JanusFlow2024-11-12
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability30.8AlignGPT (Vicuna-7B)2024-05-23
Efficient Large Multi-modal Models via Visual Context Compression✓ Link30.7LLaVolta2024-06-28
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models30.7LLaVA-AlignedVQ2024-11-08
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model✓ Link30.47.3BLLaVA-1.5-HACL2023-12-12
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model30.47.2BMaVEn2024-08-22
VisionZip: Longer is Better but Not Necessary in Vision Language Models✓ Link30.2VisionZip (Retain 64 Tokens, fine-tuning)2024-12-05
H2OVL-Mississippi Vision Language Models Technical Report30.0H2OVL-Mississippi-0.8B2024-10-17
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation29.7RoboMamba2024-06-06
TokenPacker: Efficient Visual Projector for Multimodal LLM✓ Link29.6LLaVA-TokenPacker (Vicuna-7B)2024-07-02
OneLLM: One Framework to Align All Modalities with Language✓ Link29.17BOneLLM-7B2023-12-06
LLaVA-OneVision: Easy Visual Task Transfer✓ Link29.1LLaVA-OneVision-0.5B2024-08-06
Small Language Model Meets with Reinforced Vision Vocabulary29.01.8BVary-toy2024-01-23
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model✓ Link28.9LLaVA-Phi2024-01-04
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action✓ Link27.9±0.1MM-ReAct-GPT-3.52023-03-20
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling27.80MMAR-7B2024-10-14
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs✓ Link27.77BSEAL (7B)2023-12-21
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents✓ Link27.5±0.37BLLaVA-Plus-7B (All Tools)2023-11-09
OtterHD: A High-Resolution Multi-modality Model✓ Link26.38BOtterHD-8B2023-11-07
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models25.6TGA-7B2024-10-16
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models✓ Link24.8±0.29BOpenFlamingo-9B (MPT-7B)2023-08-02
MIMIC-IT: Multi-Modal In-Context Instruction Tuning✓ Link24.7±0.39BOtter-9B (MPT-7B)2023-06-08
MIMIC-IT: Multi-Modal In-Context Instruction Tuning✓ Link24.6±0.29BOtter-9B (LLaMA)2023-06-08
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models✓ Link24.4±0.414BMiniGPT-4-14B2023-04-20
LinVT: Empower Your Image-level Large Language Model to Understand Videos✓ Link23.5LinVT2024-12-06
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link22.4±0.212BBLIP-2-12B2023-01-30
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models✓ Link22.1±0.18BMiniGPT-4-8B2023-04-20
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models✓ Link21.8±0.19BOpenFlamingo-9B (LLaMA-7B)2023-08-02
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model✓ Link21.8Xmodel-VLM (Xmodel-LM 1.1B)2024-05-15
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild✓ Link19.4TextBind2023-09-14
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling18.49MMAR-0.5B2024-10-14