[]() | | 59.2 | 51 | 35 | GPT-4o (CoT) | |
[]() | | 54 | 38.2 | 24.6 | GPT-4o | |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 50.4 | 32.6 | 17.4 | Qwen2-VL-72B | 2024-09-18 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 48.4 | 35.2 | 21.8 | LLaVA-OneVision-Qwen2-72B | 2024-08-06 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 41.6 | 29.4 | 14.6 | LLaVA-OneVision-Qwen2-7B | 2024-08-06 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 40.2 | 32.4 | 15.2 | Qwen2-VL-7B | 2024-09-18 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 37 | 27.6 | 12.4 | Gemini-1.5-Pro (CoT) | 2024-03-08 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 36.2 | 21.8 | 8.4 | VideoLLaMA2-72B | 2024-06-11 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 35.8 | 22.6 | 10.2 | Gemini-1.5-Pro | 2024-03-08 |
[]() | | 32.8 | 28.8 | 10.6 | Claude 3.5 Sonnet | |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone | ✓ Link | 32.6 | 29.2 | 11.2 | MiniCPM-2.6 | 2024-08-03 |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | ✓ Link | 30.8 | 28.4 | 9 | InternLM-XC-2.5 (CoT) | 2024-07-03 |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | ✓ Link | 28.8 | 27.8 | 9.6 | InternLM-XC-2.5 | 2024-07-03 |
[]() | | 25.8 | 22.2 | 5.2 | LLaVA-NeXT-Video-34B (CoT) | |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 24.8 | 25.8 | 6.6 | Video-LLaVA-7B | 2023-11-16 |
[]() | | 24 | 22.4 | 6.2 | Phi-3.5-Vision | |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | ✓ Link | 23.8 | 25.6 | 6.8 | MA-LMM-Vicuna-7B | 2024-04-08 |
[]() | | 23 | 21.2 | 3.8 | LLaVA-NeXT-Video-34B | |
[]() | | 21.8 | 26.2 | 6.8 | LLaVA-NeXT-Video-7B (CoT) | |
[]() | | 21.8 | 25.6 | 6.2 | LLaVA-NeXT-Video-7B | |
VTimeLLM: Empower LLM to Grasp Video Moments | ✓ Link | 19.4 | 27 | 5.2 | VTimeLLM | 2023-11-30 |
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | ✓ Link | 17 | 2.8 | 1.2 | VideoCLIP | 2021-09-28 |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | ✓ Link | 10.6 | 5 | 1.2 | LanguageBind | 2023-10-03 |
ImageBind: One Embedding Space To Bind Them All | ✓ Link | 9.4 | 3.4 | 0.6 | ImageBind | 2023-05-09 |