Paper | Code | Accuracy (%) | IoU | ModelName | ReleaseDate |
---|---|---|---|---|---|
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks | ✓ Link | 95.3 | Florence-2-large-ft | 2023-11-10 | |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 92.8 | mPLUG-2 | 2023-02-01 | |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 92.1 | X2-VLM (large) | 2022-11-22 | |
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | ✓ Link | 90.4 | XFM (base) | 2023-01-12 | |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 90.3 | X2-VLM (base) | 2022-11-22 | |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 89.00 | X-VLM (base) | 2021-11-16 | |
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | ✓ Link | 61.1 | HYDRA | 2024-03-19 |