InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 95.7 | 99.7 | 99.9 | 85.0 | 97.0 | 98.6 | InternVL-G | 2023-12-21 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 94.9 | 99.9 | 100.0 | 81.5 | 95.6 | 97.8 | BEiT-3 | 2022-08-22 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 94.7 | 99.6 | 99.9 | 81.7 | 96.0 | 98.2 | InternVL-C | 2023-12-21 |
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training | ✓ Link | 92.9 | 99.4 | 99.9 | 80.3 | 95.3 | 97.6 | COSMOS ViT-B/16 | 2024-12-02 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 92.5 | 99.5 | 99.9 | 80.4 | 95.7 | 97.7 | CoCa | 2022-05-04 |
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | ✓ Link | 92.1 | 99.4 | 99.7 | 80.7 | 96.1 | 97.7 | RO-ViT | 2023-05-11 |
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | ✓ Link | 91.2 | 99.2 | 99.6 | 92.2 | 99.5 | 99.7 | M2-Encoder | 2024-01-29 |
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training | ✓ Link | 91.2 | 99.1 | 99.8 | 77.4 | 93.8 | 96.4 | ERNIE-ViL 2.0 | 2022-09-30 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 90.9 | 99.1 | - | 76.7 | 93.6 | - | Florence | 2021-11-22 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 90.5 | 98.8 | 99.7 | 76.8 | 93.7 | 96.7 | ALBEF | 2021-07-16 |
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training | ✓ Link | 89.9 | 98.8 | 99.3 | 76.1 | 92.8 | 96.2 | COSMOS ViT-B/32 | 2024-12-02 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 89.3 | 98.8 | 99.7 | 79.5 | 95.3 | 97.9 | Flamingo | 2022-04-29 |
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | ✓ Link | 89.0 | 99.2 | 99.8 | 77.2 | 94.3 | 98.2 | VK-OOD | 2023-09-21 |
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ✓ Link | 88.6 | 98.7 | 99.7 | 75.7 | 93.8 | 96.8 | ALIGN | 2021-02-11 |
Learning Transferable Visual Models From Natural Language Supervision | ✓ Link | 88.0 | 98.7 | 99.4 | 68.7 | 90.6 | 95.2 | CLIP | 2021-02-26 |
Position-guided Text Prompt for Vision-Language Pre-training | ✓ Link | 87.1 | 98.4 | 99.3 | 73.1 | 91.0 | 94.8 | PTP-BLIP (14M) | 2022-12-19 |
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | ✓ Link | 86 | 98 | 99.1 | 72.5 | 91.6 | 95.4 | AltCLIP | 2022-11-12 |
UNITER: UNiversal Image-TExt Representation Learning | ✓ Link | 80.7 | 95.7 | 98.0 | 66.2 | 88.4 | 92.9 | UNITER | 2019-09-25 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 73.2 | 93.6 | 96.5 | 55 | 82.5 | 89.8 | ViLT-B/32 | 2021-02-05 |
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | | 70.7 | 90.2 | 94.0 | 54.3 | 79.6 | 87.5 | ImageBERT | 2020-01-22 |
Reproducible scaling laws for contrastive language-image learning | ✓ Link | - | 99.3 | - | - | 94.1 | - | OpenCLIP VIT-H/14 | 2022-12-14 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | | | | 90.4 | | | VAST | 2023-05-29 |