OpenCodePapers

zero-shot-cross-modal-retrieval-on-coco-2014

Image Retrieval with Multi-Modal QueryZero-Shot Cross-Modal Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Image-to-text R@1	Image-to-text R@5	Image-to-text R@10	Text-to-image R@1	Text-to-image R@5	Text-to-image R@10	ModelName	ReleaseDate
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	✓ Link	74.9	91.3	95.2	58.6	81.3	88.0	InternVL-G	2023-12-21
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining	✓ Link	72.8	92.3	96.3	56.5	81.6	88.8	M2-Encoder	2024-01-29
Vision-Language Pre-Training with Triple Contrastive Learning	✓ Link	71.4	90.8	95.4	53.5	79.0	87.1	TCL	2022-02-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	✓ Link	70.6	89.0	93.5	54.1	77.3	84.6	InternVL-C	2023-12-21
Position-guided Text Prompt for Vision-Language Pre-training	✓ Link	69.7	90.0	94.7	49.5	75.9	84.2	PTP-BLIP	2022-12-19
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	✓ Link	68.9	87.8	92.2	51.8	75.0	83.0	RO-ViT	2023-05-11
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	✓ Link	68.7	89.5	94.7	50.1	76.4	84.5	ALBEF	2021-07-16
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training	✓ Link	68.0	87.8	92.5	52.5	77.2	84.9	COSMOS ViT-B/16	2024-12-02
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	66.3	86.2	91.8	51.2	74.2	82.0	CoCa	2022-05-04
Flamingo: a Visual Language Model for Few-Shot Learning	✓ Link	65.9	87.3	92.9	48.0	73.3	82.1	Flamingo	2022-04-29
Florence: A New Foundation Model for Computer Vision	✓ Link	64.7	85.9		47.2	71.4		Florence	2021-11-22
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training	✓ Link	64.3	86.5	92.0	48.4	74.2	82.6	COSMOS ViT-B/32	2024-12-02
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training	✓ Link	63.1	85.7	91.4	46.0	71.4	80.4	ERNIE-ViL 2.0	2022-09-30
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	✓ Link	58.6	83.0	89.7	45.6	69.8	78.6	ALIGN	2021-02-11
Learning Transferable Visual Models From Natural Language Supervision	✓ Link	58.4	81.5	88.1	37.8	62.4	72.2	CLIP	2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	✓ Link	56.5	82.6	89.6	40.4	70	81.1	ViLT-B/32	2021-02-05
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data		44.0	71.2	80.4	32.3	59.0	70.2	ImageBERT	2020-01-22
[]()		0	0	0	0	0	0	dfdf