OpenCodePapers

zero-shot-cross-modal-retrieval-on-coco-2014

Image Retrieval with Multi-Modal QueryZero-Shot Cross-Modal Retrieval
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeImage-to-text R@1Image-to-text R@5Image-to-text R@10Text-to-image R@1Text-to-image R@5Text-to-image R@10ModelNameReleaseDate
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks✓ Link74.991.395.258.681.388.0InternVL-G2023-12-21
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining✓ Link72.892.396.356.581.688.8M2-Encoder2024-01-29
Vision-Language Pre-Training with Triple Contrastive Learning✓ Link71.490.895.453.579.087.1TCL2022-02-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks✓ Link70.689.093.554.177.384.6InternVL-C2023-12-21
Position-guided Text Prompt for Vision-Language Pre-training✓ Link69.790.094.749.575.984.2PTP-BLIP2022-12-19
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers✓ Link68.987.892.251.875.083.0RO-ViT2023-05-11
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link68.789.594.750.176.484.5ALBEF2021-07-16
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training✓ Link68.087.892.552.577.284.9COSMOS ViT-B/162024-12-02
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link66.386.291.851.274.282.0CoCa2022-05-04
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link65.987.392.948.073.382.1Flamingo2022-04-29
Florence: A New Foundation Model for Computer Vision✓ Link64.785.947.271.4Florence2021-11-22
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training✓ Link64.386.592.048.474.282.6COSMOS ViT-B/322024-12-02
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training✓ Link63.185.791.446.071.480.4ERNIE-ViL 2.02022-09-30
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision✓ Link58.683.089.745.669.878.6ALIGN2021-02-11
Learning Transferable Visual Models From Natural Language Supervision✓ Link58.481.588.137.862.472.2CLIP2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link56.582.689.640.47081.1ViLT-B/322021-02-05
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data44.071.280.432.359.070.2ImageBERT2020-01-22
[]()000000dfdf