OpenCodePapers

cross-modal-retrieval-on-coco-2014

Cross-Modal Retrieval
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeText-to-image R@1Text-to-image R@5Text-to-image R@10Image-to-text R@1Image-to-text R@5Image-to-text R@10ModelNameReleaseDate
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link68.087.792.8VAST2023-05-29
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link67.787.592.584.496.598.5X2-VLM (large)2022-11-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link67.292.887.784.896.598.3BEiT-32022-08-22
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks✓ Link67.087.292.484.296.498.4XFM (base)2023-01-12
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link66.287.192.283.596.398.5X2-VLM (base)2022-11-22
Position-guided Text Prompt for Vision-Language Pre-training✓ Link64.987.492.281.595.997.9PTP-BLIP (14M)2022-12-19
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks64.886.191.682.195.998.1OmniVL (14M)2022-09-15
Dissecting Deep Metric Learning Losses for Image-Text Retrieval✓ Link63.686.091.581.495.697.9VSE-Gradient2022-10-21
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link63.485.891.581.295.698.2X-VLM (base)2021-11-16
Florence: A New Foundation Model for Computer Vision✓ Link63.285.781.895.2Florence2021-11-22
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link62.984.892.880.795.196.8VK-OOD2023-09-21
[]()62.884.89180.795.397.8Aurora (ours, r=128)
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning✓ Link62.185.992.048.075.684.5DSMD2024-04-16
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link61.484.490.9VALOR2023-04-17
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation✓ Link60.784.390.577.694.397.2ALBEF2021-07-16
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision✓ Link59.983.389.87793.596.9ALIGN2021-02-11
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training✓ Link59.583.490.177.493.697.1ERNIE-ViL 2.02022-09-30
Vision-Language Pre-Training with Triple Contrastive Learning✓ Link59.083.289.975.692.896.7TCL2022-02-21
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks✓ Link57.582.889.873.592.296.0Oscar2020-04-13
An Empirical Study of Training End-to-End Vision-and-Language Transformers✓ Link57.0882.6690.0776.1693.1696.82METER2021-11-03
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval52.679.687.668.990.195.4ViSTA2022-03-31
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval✓ Link51.379.287.564.988.694.5ALADIN2022-07-29
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting✓ Link50.379.387.767.990.595.43SHNet2024-04-26
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words✓ Link44.472.882.4VisualSparta2021-01-01
Plug-and-Play Regulators for Image-Text Matching✓ Link44.373.283.261.386.192.6RCAR2023-03-23
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings✓ Link43.059.8NAPReg2023-01-07
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link42.772.983.161.586.392.7ViLT-B/322021-02-05
Similarity Reasoning and Filtration for Image-Text Matching✓ Link41.970.781.357.884.991.6SGRAF2021-01-05
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives41.572.182.255.682.491.0LILE2022-03-02
Visual Semantic Reasoning for Image-Text Matching✓ Link40.570.681.153.081.189.4VSRN2019-09-06
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval✓ Link39.769.179.853.783.291.0IMRAM2020-03-08
Stacked Cross Attention for Image-Text Matching✓ Link38.669.380.450.482.290.0SCAN2018-03-21
Learning Semantic Concepts and Order for Image and Sentence Matching33.162.975.542.872.383.0SCO (ResNet)2017-12-06
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval✓ Link32.463.075.045.274.384.5PVSE2019-06-11
Deep Visual-Semantic Alignments for Generating Image Descriptions✓ Link25.353.466.441.270.581.1Dual-Path (ResNet)2014-12-07
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks✓ Link70.789.193.7MaMMUT (ours)2023-03-29