Paper | Code | mean average precision | ModelName | ReleaseDate |
---|---|---|---|---|
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 28.0 | X-VLM | 2021-11-16 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 25.5 | BLIP 2 (pretrained) | 2023-01-30 |
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ✓ Link | 24.3 | BLIP | 2022-01-28 |
Open-vocabulary Attribute Detection | ✓ Link | 21.4 | OVAD-Baseline-Box | 2022-11-23 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 21.0 | ALBEF | 2021-07-16 |
Reproducible scaling laws for contrastive language-image learning | ✓ Link | 17.0 | Open CLIP ViT-B32 | 2022-12-14 |
Learning Transferable Visual Models From Natural Language Supervision | ✓ Link | 16.6 | CLIP VIT-B16 | 2021-02-26 |