OpenCodePapers

zero-shot-cross-modal-retrieval-on-flickr30k

Image Retrieval with Multi-Modal QueryZero-Shot Cross-Modal Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeImage-to-text R@1Image-to-text R@5Image-to-text R@10Text-to-image R@1Text-to-image R@5Text-to-image R@10ModelNameReleaseDate
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks✓ Link95.799.799.985.097.098.6InternVL-G2023-12-21
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link94.999.9100.081.595.697.8BEiT-32022-08-22
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks✓ Link94.799.699.981.796.098.2InternVL-C2023-12-21
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training✓ Link92.999.499.980.395.397.6COSMOS ViT-B/162024-12-02
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link92.599.599.980.495.797.7CoCa2022-05-04
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers✓ Link92.199.499.780.796.197.7RO-ViT2023-05-11
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining✓ Link91.299.299.692.299.599.7M2-Encoder2024-01-29
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training✓ Link91.299.199.877.493.896.4ERNIE-ViL 2.02022-09-30
Florence: A New Foundation Model for Computer Vision✓ Link90.999.1-76.793.6-Florence2021-11-22
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link90.598.899.776.893.796.7ALBEF2021-07-16
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training✓ Link89.998.899.376.192.896.2COSMOS ViT-B/322024-12-02
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link89.398.899.779.595.397.9Flamingo2022-04-29
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link89.099.299.877.294.398.2VK-OOD2023-09-21
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision✓ Link88.698.799.775.793.896.8ALIGN2021-02-11
Learning Transferable Visual Models From Natural Language Supervision✓ Link88.098.799.468.790.695.2CLIP2021-02-26
Position-guided Text Prompt for Vision-Language Pre-training✓ Link87.198.499.373.191.094.8PTP-BLIP (14M)2022-12-19
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities✓ Link869899.172.591.695.4AltCLIP2022-11-12
UNITER: UNiversal Image-TExt Representation Learning✓ Link80.795.798.066.288.492.9UNITER2019-09-25
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link73.293.696.55582.589.8ViLT-B/322021-02-05
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data70.790.294.054.379.687.5ImageBERT2020-01-22
Reproducible scaling laws for contrastive language-image learning✓ Link-99.3--94.1-OpenCLIP VIT-H/142022-12-14
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link90.4VAST2023-05-29