OpenCodePapers

cross-modal-retrieval-on-flickr30k

Cross-Modal Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeImage-to-text R@1Image-to-text R@5Image-to-text R@10Text-to-image R@1Text-to-image R@5Text-to-image R@10ModelNameReleaseDate
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link98.810010091.898.699.5X2-VLM (large)2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link98.510010090.498.299.3X2-VLM (base)2022-11-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link98.0100.0100.090.398.799.5BEiT-32022-08-22
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks97.399.910087.997.899.1OmniVL (14M)2022-09-15
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training✓ Link97.2100.0100.093.399.499.8ERNIE-ViL 2.02022-09-30
[]()97.210010086.897.698.9Aurora (ours, r=128)
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link97.1100.0100.086.997.398.7X-VLM (base)2021-11-16
Dissecting Deep Metric Learning Losses for Image-Text Retrieval✓ Link97.099.610086.397.499.0VSE-Gradient2022-10-21
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision✓ Link95.399.810084.997.498.6ALIGN2021-02-11
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval89.598.499.675.894.296.9ViSTA2022-03-31
Learning Relation Alignment for Calibrated Cross-modal Retrieval✓ Link88.398.499.476.8693.395.72IAIS2021-05-28
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting✓ Link87.198.299.269.591.094.73SHNet2024-04-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision✓ Link83.596.798.664.488.793.8ViLT-B/322021-02-05
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning✓ Link82.595.597.768.490.894.4DSMD2024-04-16
Plug-and-Play Regulators for Image-Text Matching✓ Link82.396.098.462.685.891.1RCAR2023-03-23
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings✓ Link79.660.0NAPReg2023-01-07
Similarity Reasoning and Filtration for Image-Text Matching✓ Link77.894.197.458.583.088.8SGRAF2021-01-05
Graph Structured Network for Image-Text Matching✓ Link76.494.397.357.482.389.0GSMN2020-04-01
[]()75.393.497.354.9881.388.26Pearl
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval✓ Link74.193.096.653.979.487.2IMRAM2020-03-08
Stacked Cross Attention for Image-Text Matching✓ Link67.490.395.848.677.785.2SCAN2018-03-21
Dual-Path Convolutional Image-Text Embeddings with Instance Loss✓ Link55.681.9Dual-Path (ResNet)2017-11-15
Learning Semantic Concepts and Order for Image and Sentence Matching55.582.089.341.170.580.1SCO (ResNet)2017-12-06
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives✓ Link52.980.587.239.670.179.5VSE++ (ResNet)2017-07-18
Deep Cross-Modal Projection Learning for Image-Text Matching✓ Link49.676.886.137.365.775.5CMPL (ResNet)2018-09-01
Dual-Path Convolutional Image-Text Embeddings with Instance Loss✓ Link89.539.169.280.9Dual-Path (ResNet)2017-11-15
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link91.098.599.5VAST2023-05-29