cross-modal-retrieval-on-flickr30k

Cross-Modal Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Image-to-text R@1	Image-to-text R@5	Image-to-text R@10	Text-to-image R@1	Text-to-image R@5	Text-to-image R@10	ModelName	ReleaseDate
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	98.8	100	100	91.8	98.6	99.5	X2-VLM (large)	2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	98.5	100	100	90.4	98.2	99.3	X2-VLM (base)	2022-11-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	98.0	100.0	100.0	90.3	98.7	99.5	BEiT-3	2022-08-22
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		97.3	99.9	100	87.9	97.8	99.1	OmniVL (14M)	2022-09-15
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training	✓ Link	97.2	100.0	100.0	93.3	99.4	99.8	ERNIE-ViL 2.0	2022-09-30
[]()		97.2	100	100	86.8	97.6	98.9	Aurora (ours, r=128)
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	✓ Link	97.1	100.0	100.0	86.9	97.3	98.7	X-VLM (base)	2021-11-16
Dissecting Deep Metric Learning Losses for Image-Text Retrieval	✓ Link	97.0	99.6	100	86.3	97.4	99.0	VSE-Gradient	2022-10-21
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	✓ Link	95.3	99.8	100	84.9	97.4	98.6	ALIGN	2021-02-11
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval		89.5	98.4	99.6	75.8	94.2	96.9	ViSTA	2022-03-31
Learning Relation Alignment for Calibrated Cross-modal Retrieval	✓ Link	88.3	98.4	99.4	76.86	93.3	95.72	IAIS	2021-05-28
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting	✓ Link	87.1	98.2	99.2	69.5	91.0	94.7	3SHNet	2024-04-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	✓ Link	83.5	96.7	98.6	64.4	88.7	93.8	ViLT-B/32	2021-02-05
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning	✓ Link	82.5	95.5	97.7	68.4	90.8	94.4	DSMD	2024-04-16
Plug-and-Play Regulators for Image-Text Matching	✓ Link	82.3	96.0	98.4	62.6	85.8	91.1	RCAR	2023-03-23
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings	✓ Link	79.6			60.0			NAPReg	2023-01-07
Similarity Reasoning and Filtration for Image-Text Matching	✓ Link	77.8	94.1	97.4	58.5	83.0	88.8	SGRAF	2021-01-05
Graph Structured Network for Image-Text Matching	✓ Link	76.4	94.3	97.3	57.4	82.3	89.0	GSMN	2020-04-01
[]()		75.3	93.4	97.3	54.98	81.3	88.26	Pearl
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval	✓ Link	74.1	93.0	96.6	53.9	79.4	87.2	IMRAM	2020-03-08
Stacked Cross Attention for Image-Text Matching	✓ Link	67.4	90.3	95.8	48.6	77.7	85.2	SCAN	2018-03-21
Dual-Path Convolutional Image-Text Embeddings with Instance Loss	✓ Link	55.6	81.9					Dual-Path (ResNet)	2017-11-15
Learning Semantic Concepts and Order for Image and Sentence Matching		55.5	82.0	89.3	41.1	70.5	80.1	SCO (ResNet)	2017-12-06
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives	✓ Link	52.9	80.5	87.2	39.6	70.1	79.5	VSE++ (ResNet)	2017-07-18
Deep Cross-Modal Projection Learning for Image-Text Matching	✓ Link	49.6	76.8	86.1	37.3	65.7	75.5	CMPL (ResNet)	2018-09-01
Dual-Path Convolutional Image-Text Embeddings with Instance Loss	✓ Link			89.5	39.1	69.2	80.9	Dual-Path (ResNet)	2017-11-15
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link				91.0	98.5	99.5	VAST	2023-05-29

OpenCodePapers

cross-modal-retrieval-on-flickr30k