X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 98.8 | 100 | 100 | 91.8 | 98.6 | 99.5 | X2-VLM (large) | 2022-11-22 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 98.5 | 100 | 100 | 90.4 | 98.2 | 99.3 | X2-VLM (base) | 2022-11-22 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 98.0 | 100.0 | 100.0 | 90.3 | 98.7 | 99.5 | BEiT-3 | 2022-08-22 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 97.3 | 99.9 | 100 | 87.9 | 97.8 | 99.1 | OmniVL (14M) | 2022-09-15 |
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training | ✓ Link | 97.2 | 100.0 | 100.0 | 93.3 | 99.4 | 99.8 | ERNIE-ViL 2.0 | 2022-09-30 |
[]() | | 97.2 | 100 | 100 | 86.8 | 97.6 | 98.9 | Aurora (ours, r=128) | |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 97.1 | 100.0 | 100.0 | 86.9 | 97.3 | 98.7 | X-VLM (base) | 2021-11-16 |
Dissecting Deep Metric Learning Losses for Image-Text Retrieval | ✓ Link | 97.0 | 99.6 | 100 | 86.3 | 97.4 | 99.0 | VSE-Gradient | 2022-10-21 |
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ✓ Link | 95.3 | 99.8 | 100 | 84.9 | 97.4 | 98.6 | ALIGN | 2021-02-11 |
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval | | 89.5 | 98.4 | 99.6 | 75.8 | 94.2 | 96.9 | ViSTA | 2022-03-31 |
Learning Relation Alignment for Calibrated Cross-modal Retrieval | ✓ Link | 88.3 | 98.4 | 99.4 | 76.86 | 93.3 | 95.72 | IAIS | 2021-05-28 |
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting | ✓ Link | 87.1 | 98.2 | 99.2 | 69.5 | 91.0 | 94.7 | 3SHNet | 2024-04-26 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 83.5 | 96.7 | 98.6 | 64.4 | 88.7 | 93.8 | ViLT-B/32 | 2021-02-05 |
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning | ✓ Link | 82.5 | 95.5 | 97.7 | 68.4 | 90.8 | 94.4 | DSMD | 2024-04-16 |
Plug-and-Play Regulators for Image-Text Matching | ✓ Link | 82.3 | 96.0 | 98.4 | 62.6 | 85.8 | 91.1 | RCAR | 2023-03-23 |
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings | ✓ Link | 79.6 | | | 60.0 | | | NAPReg | 2023-01-07 |
Similarity Reasoning and Filtration for Image-Text Matching | ✓ Link | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | SGRAF | 2021-01-05 |
Graph Structured Network for Image-Text Matching | ✓ Link | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | GSMN | 2020-04-01 |
[]() | | 75.3 | 93.4 | 97.3 | 54.98 | 81.3 | 88.26 | Pearl | |
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval | ✓ Link | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | IMRAM | 2020-03-08 |
Stacked Cross Attention for Image-Text Matching | ✓ Link | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | SCAN | 2018-03-21 |
Dual-Path Convolutional Image-Text Embeddings with Instance Loss | ✓ Link | 55.6 | 81.9 | | | | | Dual-Path
(ResNet) | 2017-11-15 |
Learning Semantic Concepts and Order for Image and Sentence Matching | | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 | SCO
(ResNet) | 2017-12-06 |
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives | ✓ Link | 52.9 | 80.5 | 87.2 | 39.6 | 70.1 | 79.5 | VSE++
(ResNet) | 2017-07-18 |
Deep Cross-Modal Projection Learning for Image-Text Matching | ✓ Link | 49.6 | 76.8 | 86.1 | 37.3 | 65.7 | 75.5 | CMPL
(ResNet) | 2018-09-01 |
Dual-Path Convolutional Image-Text Embeddings with Instance Loss | ✓ Link | | | 89.5 | 39.1 | 69.2 | 80.9 | Dual-Path (ResNet) | 2017-11-15 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | | | | 91.0 | 98.5 | 99.5 | VAST | 2023-05-29 |