Seed1.5-VL Technical Report | | 63.6 | Seed1.5-VL thinking | 2025-05-11 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 63.5 | PLM-8B | 2025-04-17 |
Seed1.5-VL Technical Report | | 61.5 | Seed1.5-VL | 2025-05-11 |
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | ✓ Link | 60.6 | V-JEPA 2 ViT-g 8B | 2025-06-11 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 58.9 | PLM-3B | 2025-04-17 |
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | | 56.5 | RRPO | 2025-04-16 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 55.5 | Tarsier-34B | 2024-06-30 |
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding | ✓ Link | 54.7 | Tarsier2-7B | 2025-01-14 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 52.7 | Qwen2-VL-72B | 2024-09-18 |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | ✓ Link | 51.6 | IXC-2.5 7B | 2024-07-03 |
Aria: An Open Multimodal Native Mixture-of-Experts Model | ✓ Link | 51.0 | Aria | 2024-10-08 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 50.4 | PLM-1B | 2025-04-17 |
Video Instruction Tuning With Synthetic Data | | 50.0 | LLaVA-Video 72B | 2024-10-03 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 48.4 | VideoLLaMA2 72B | 2024-06-11 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 47.6 | Gemini 1.5 Pro | 2024-03-08 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 46.9 | Tarsier-7B | 2024-06-30 |
Video Instruction Tuning With Synthetic Data | | 45.6 | LLaVA-Video 7B | 2024-10-03 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 43.8 | Qwen2-VL-7B | 2024-09-18 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 42.9 | VideoLLaMA2 7B | 2024-06-11 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 42.3 | PLLaVA-34B | 2024-04-25 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | ✓ Link | 42.2 | mPLUG-Owl3 | 2024-08-09 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 42.1 | VideoLLaMA2.1 | 2024-06-11 |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | ✓ Link | 41.7 | VideoGPT+ | 2024-06-13 |
GPT-4o System Card | | 39.9 | GPT4o 8 frames | 2024-10-25 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 36.4 | PLLaVA-13B | 2024-04-25 |
ST-LLM: Large Language Models Are Effective Temporal Learners | ✓ Link | 35.7 | ST-LLM | 2024-03-30 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 35.0 | VideoChat2 | 2023-11-28 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 34.9 | PLLaVA-7B | 2024-04-25 |