Paper | Code | Accuracy (Private) | Accuracy (Public) | Top 5 Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|---|---|
Scaling Vision Transformers to 22 Billion Parameters | ✓ Link | 87.6 | LiT-22B | 2023-02-10 | ||
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 84.9 | LiT ViT-e | 2022-09-14 | ||
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 82.7 | CoCa | 2022-05-04 | ||
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters | ✓ Link | 82.2 | EVA-CLIP-18B | 2024-02-06 | ||
LiT: Zero-Shot Transfer with Locked-image text Tuning | ✓ Link | 81.1 | 54.5 | LiT-tuning | 2021-11-15 | |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 80.6 | InternVL-C | 2023-12-21 | ||
EVA-CLIP: Improved Training Techniques for CLIP at Scale | ✓ Link | 79.6 | EVA-CLIP-E/14+ | 2023-03-27 | ||
Learning Transferable Visual Models From Natural Language Supervision | ✓ Link | 72.3 | - | CLIP | 2021-02-26 | |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 42.62 | 58.35 | PaLI | 2022-09-14 |