Paper | Code | GC-mat | GC-trk | OC-cpr | OC-cnt | OC-grp | PC-cpr | PC-cnt | PC-grp | PC-VID | Average Score on VLM2-bench (9 subtasks) | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o System Card | 37.45 | 39.27 | 74.17 | 80.62 | 57.50 | 50.00 | 90.50 | 47.00 | 66.75 | 60.36 | GPT-4o | 2024-10-25 | |
Qwen2.5-VL Technical Report | ✓ Link | 35.91 | 43.38 | 71.39 | 41.72 | 47.50 | 80.00 | 57.98 | 69.00 | 46.50 | 54.82 | Qwen2.5-VL-7B | 2025-02-19 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 30.50 | 30.59 | 43.33 | 51.48 | 52.50 | 59.50 | 59.70 | 61.00 | 21.75 | 45.59 | InternVL2.5-26B | 2024-12-06 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 27.80 | 19.18 | 68.06 | 45.99 | 35.00 | 61.50 | 58.59 | 49.00 | 16.25 | 42.37 | Qwen2-VL-7B | 2024-09-18 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 21.24 | 26.03 | 53.33 | 55.23 | 46.50 | 51.50 | 60.00 | 52.00 | 5.25 | 41.23 | InternVL2.5-8B | 2024-12-06 |
Video Instruction Tuning With Synthetic Data | 18.53 | 12.79 | 54.72 | 62.47 | 28.50 | 62.00 | 66.91 | 25.00 | 59.00 | 43.32 | LLaVA-Video-7B | 2024-10-03 | |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | ✓ Link | 17.37 | 18.26 | 49.17 | 62.97 | 31.00 | 63.50 | 58.86 | 26.00 | 13.50 | 37.85 | mPLUG-Owl3-7B | 2024-08-09 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 16.60 | 13.70 | 47.22 | 56.17 | 27.50 | 62.00 | 46.67 | 37.00 | 47.25 | 39.35 | LLaVA-OneVision-7B | 2024-08-06 |
Long Context Transfer from Language to Vision | ✓ Link | 14.29 | 19.18 | 26.67 | 42.53 | 18.50 | 21.50 | 38.90 | 18.00 | 3.75 | 22.59 | LongVA-7B | 2024-06-24 |