zero-shot-cross-modal-retrieval-on-flickr30k

Image Retrieval with Multi-Modal QueryZero-Shot Cross-Modal Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Image-to-text R@1	Image-to-text R@5	Image-to-text R@10	Text-to-image R@1	Text-to-image R@5	Text-to-image R@10	ModelName	ReleaseDate
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	✓ Link	95.7	99.7	99.9	85.0	97.0	98.6	InternVL-G	2023-12-21
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	94.9	99.9	100.0	81.5	95.6	97.8	BEiT-3	2022-08-22
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	✓ Link	94.7	99.6	99.9	81.7	96.0	98.2	InternVL-C	2023-12-21
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training	✓ Link	92.9	99.4	99.9	80.3	95.3	97.6	COSMOS ViT-B/16	2024-12-02
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	92.5	99.5	99.9	80.4	95.7	97.7	CoCa	2022-05-04
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	✓ Link	92.1	99.4	99.7	80.7	96.1	97.7	RO-ViT	2023-05-11
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining	✓ Link	91.2	99.2	99.6	92.2	99.5	99.7	M2-Encoder	2024-01-29
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training	✓ Link	91.2	99.1	99.8	77.4	93.8	96.4	ERNIE-ViL 2.0	2022-09-30
Florence: A New Foundation Model for Computer Vision	✓ Link	90.9	99.1	-	76.7	93.6	-	Florence	2021-11-22
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	✓ Link	90.5	98.8	99.7	76.8	93.7	96.7	ALBEF	2021-07-16
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training	✓ Link	89.9	98.8	99.3	76.1	92.8	96.2	COSMOS ViT-B/32	2024-12-02
Flamingo: a Visual Language Model for Few-Shot Learning	✓ Link	89.3	98.8	99.7	79.5	95.3	97.9	Flamingo	2022-04-29
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	✓ Link	89.0	99.2	99.8	77.2	94.3	98.2	VK-OOD	2023-09-21
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	✓ Link	88.6	98.7	99.7	75.7	93.8	96.8	ALIGN	2021-02-11
Learning Transferable Visual Models From Natural Language Supervision	✓ Link	88.0	98.7	99.4	68.7	90.6	95.2	CLIP	2021-02-26
Position-guided Text Prompt for Vision-Language Pre-training	✓ Link	87.1	98.4	99.3	73.1	91.0	94.8	PTP-BLIP (14M)	2022-12-19
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	✓ Link	86	98	99.1	72.5	91.6	95.4	AltCLIP	2022-11-12
UNITER: UNiversal Image-TExt Representation Learning	✓ Link	80.7	95.7	98.0	66.2	88.4	92.9	UNITER	2019-09-25
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	✓ Link	73.2	93.6	96.5	55	82.5	89.8	ViLT-B/32	2021-02-05
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data		70.7	90.2	94.0	54.3	79.6	87.5	ImageBERT	2020-01-22
Reproducible scaling laws for contrastive language-image learning	✓ Link	-	99.3	-	-	94.1	-	OpenCLIP VIT-H/14	2022-12-14
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link				90.4			VAST	2023-05-29

OpenCodePapers

zero-shot-cross-modal-retrieval-on-flickr30k