cross-modal-retrieval-on-coco-2014

Cross-Modal Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Text-to-image R@1	Text-to-image R@5	Text-to-image R@10	Image-to-text R@1	Image-to-text R@5	Image-to-text R@10	ModelName	ReleaseDate
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	68.0	87.7	92.8				VAST	2023-05-29
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	67.7	87.5	92.5	84.4	96.5	98.5	X2-VLM (large)	2022-11-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	67.2	92.8	87.7	84.8	96.5	98.3	BEiT-3	2022-08-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	✓ Link	67.0	87.2	92.4	84.2	96.4	98.4	XFM (base)	2023-01-12
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	66.2	87.1	92.2	83.5	96.3	98.5	X2-VLM (base)	2022-11-22
Position-guided Text Prompt for Vision-Language Pre-training	✓ Link	64.9	87.4	92.2	81.5	95.9	97.9	PTP-BLIP (14M)	2022-12-19
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		64.8	86.1	91.6	82.1	95.9	98.1	OmniVL (14M)	2022-09-15
Dissecting Deep Metric Learning Losses for Image-Text Retrieval	✓ Link	63.6	86.0	91.5	81.4	95.6	97.9	VSE-Gradient	2022-10-21
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	✓ Link	63.4	85.8	91.5	81.2	95.6	98.2	X-VLM (base)	2021-11-16
Florence: A New Foundation Model for Computer Vision	✓ Link	63.2	85.7		81.8	95.2		Florence	2021-11-22
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	✓ Link	62.9	84.8	92.8	80.7	95.1	96.8	VK-OOD	2023-09-21
[]()		62.8	84.8	91	80.7	95.3	97.8	Aurora (ours, r=128)
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning	✓ Link	62.1	85.9	92.0	48.0	75.6	84.5	DSMD	2024-04-16
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	61.4	84.4	90.9				VALOR	2023-04-17
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	✓ Link	60.7	84.3	90.5	77.6	94.3	97.2	ALBEF	2021-07-16
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	✓ Link	59.9	83.3	89.8	77	93.5	96.9	ALIGN	2021-02-11
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training	✓ Link	59.5	83.4	90.1	77.4	93.6	97.1	ERNIE-ViL 2.0	2022-09-30
Vision-Language Pre-Training with Triple Contrastive Learning	✓ Link	59.0	83.2	89.9	75.6	92.8	96.7	TCL	2022-02-21
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks	✓ Link	57.5	82.8	89.8	73.5	92.2	96.0	Oscar	2020-04-13
An Empirical Study of Training End-to-End Vision-and-Language Transformers	✓ Link	57.08	82.66	90.07	76.16	93.16	96.82	METER	2021-11-03
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval		52.6	79.6	87.6	68.9	90.1	95.4	ViSTA	2022-03-31
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval	✓ Link	51.3	79.2	87.5	64.9	88.6	94.5	ALADIN	2022-07-29
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting	✓ Link	50.3	79.3	87.7	67.9	90.5	95.4	3SHNet	2024-04-26
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words	✓ Link	44.4	72.8	82.4				VisualSparta	2021-01-01
Plug-and-Play Regulators for Image-Text Matching	✓ Link	44.3	73.2	83.2	61.3	86.1	92.6	RCAR	2023-03-23
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings	✓ Link	43.0			59.8			NAPReg	2023-01-07
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	✓ Link	42.7	72.9	83.1	61.5	86.3	92.7	ViLT-B/32	2021-02-05
Similarity Reasoning and Filtration for Image-Text Matching	✓ Link	41.9	70.7	81.3	57.8	84.9	91.6	SGRAF	2021-01-05
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives		41.5	72.1	82.2	55.6	82.4	91.0	LILE	2022-03-02
Visual Semantic Reasoning for Image-Text Matching	✓ Link	40.5	70.6	81.1	53.0	81.1	89.4	VSRN	2019-09-06
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval	✓ Link	39.7	69.1	79.8	53.7	83.2	91.0	IMRAM	2020-03-08
Stacked Cross Attention for Image-Text Matching	✓ Link	38.6	69.3	80.4	50.4	82.2	90.0	SCAN	2018-03-21
Learning Semantic Concepts and Order for Image and Sentence Matching		33.1	62.9	75.5	42.8	72.3	83.0	SCO (ResNet)	2017-12-06
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval	✓ Link	32.4	63.0	75.0	45.2	74.3	84.5	PVSE	2019-06-11
Deep Visual-Semantic Alignments for Generating Image Descriptions	✓ Link	25.3	53.4	66.4	41.2	70.5	81.1	Dual-Path (ResNet)	2014-12-07
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	✓ Link				70.7	89.1	93.7	MaMMUT (ours)	2023-03-29

OpenCodePapers

cross-modal-retrieval-on-coco-2014