InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 74.9 | 91.3 | 95.2 | 58.6 | 81.3 | 88.0 | InternVL-G | 2023-12-21 |
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | ✓ Link | 72.8 | 92.3 | 96.3 | 56.5 | 81.6 | 88.8 | M2-Encoder | 2024-01-29 |
Vision-Language Pre-Training with Triple Contrastive Learning | ✓ Link | 71.4 | 90.8 | 95.4 | 53.5 | 79.0 | 87.1 | TCL | 2022-02-21 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 70.6 | 89.0 | 93.5 | 54.1 | 77.3 | 84.6 | InternVL-C | 2023-12-21 |
Position-guided Text Prompt for Vision-Language Pre-training | ✓ Link | 69.7 | 90.0 | 94.7 | 49.5 | 75.9 | 84.2 | PTP-BLIP | 2022-12-19 |
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | ✓ Link | 68.9 | 87.8 | 92.2 | 51.8 | 75.0 | 83.0 | RO-ViT | 2023-05-11 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 68.7 | 89.5 | 94.7 | 50.1 | 76.4 | 84.5 | ALBEF | 2021-07-16 |
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training | ✓ Link | 68.0 | 87.8 | 92.5 | 52.5 | 77.2 | 84.9 | COSMOS ViT-B/16 | 2024-12-02 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 66.3 | 86.2 | 91.8 | 51.2 | 74.2 | 82.0 | CoCa | 2022-05-04 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 65.9 | 87.3 | 92.9 | 48.0 | 73.3 | 82.1 | Flamingo | 2022-04-29 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 64.7 | 85.9 | | 47.2 | 71.4 | | Florence | 2021-11-22 |
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training | ✓ Link | 64.3 | 86.5 | 92.0 | 48.4 | 74.2 | 82.6 | COSMOS ViT-B/32 | 2024-12-02 |
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training | ✓ Link | 63.1 | 85.7 | 91.4 | 46.0 | 71.4 | 80.4 | ERNIE-ViL 2.0 | 2022-09-30 |
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ✓ Link | 58.6 | 83.0 | 89.7 | 45.6 | 69.8 | 78.6 | ALIGN | 2021-02-11 |
Learning Transferable Visual Models From Natural Language Supervision | ✓ Link | 58.4 | 81.5 | 88.1 | 37.8 | 62.4 | 72.2 | CLIP | 2021-02-26 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 56.5 | 82.6 | 89.6 | 40.4 | 70 | 81.1 | ViLT-B/32 | 2021-02-05 |
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | | 44.0 | 71.2 | 80.4 | 32.3 | 59.0 | 70.2 | ImageBERT | 2020-01-22 |
[]() | | 0 | 0 | 0 | 0 | 0 | 0 | dfdf | |