Paper | Code | Accuracy (%) | ModelName | ReleaseDate |
---|---|---|---|---|
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks | ✓ Link | 92.0 | Florence-2-large-ft | 2023-11-10 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 86.05 | mPLUG-2 | 2023-02-01 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 81.8 | X2-VLM (large) | 2022-11-22 |
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | ✓ Link | 79.8 | XFM (base) | 2023-01-12 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 78.4 | X2-VLM (base) | 2022-11-22 |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 76.91 | X-VLM (base) | 2021-11-16 |