VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 68.0 | 87.7 | 92.8 | | | | VAST | 2023-05-29 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 67.7 | 87.5 | 92.5 | 84.4 | 96.5 | 98.5 | X2-VLM (large) | 2022-11-22 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 67.2 | 92.8 | 87.7 | 84.8 | 96.5 | 98.3 | BEiT-3 | 2022-08-22 |
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | ✓ Link | 67.0 | 87.2 | 92.4 | 84.2 | 96.4 | 98.4 | XFM (base) | 2023-01-12 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 66.2 | 87.1 | 92.2 | 83.5 | 96.3 | 98.5 | X2-VLM (base) | 2022-11-22 |
Position-guided Text Prompt for Vision-Language Pre-training | ✓ Link | 64.9 | 87.4 | 92.2 | 81.5 | 95.9 | 97.9 | PTP-BLIP (14M) | 2022-12-19 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 64.8 | 86.1 | 91.6 | 82.1 | 95.9 | 98.1 | OmniVL (14M) | 2022-09-15 |
Dissecting Deep Metric Learning Losses for Image-Text Retrieval | ✓ Link | 63.6 | 86.0 | 91.5 | 81.4 | 95.6 | 97.9 | VSE-Gradient | 2022-10-21 |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 63.4 | 85.8 | 91.5 | 81.2 | 95.6 | 98.2 | X-VLM (base) | 2021-11-16 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 63.2 | 85.7 | | 81.8 | 95.2 | | Florence | 2021-11-22 |
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | ✓ Link | 62.9 | 84.8 | 92.8 | 80.7 | 95.1 | 96.8 | VK-OOD | 2023-09-21 |
[]() | | 62.8 | 84.8 | 91 | 80.7 | 95.3 | 97.8 | Aurora (ours, r=128) | |
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning | ✓ Link | 62.1 | 85.9 | 92.0 | 48.0 | 75.6 | 84.5 | DSMD | 2024-04-16 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 61.4 | 84.4 | 90.9 | | | | VALOR | 2023-04-17 |
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | ✓ Link | 60.7 | 84.3 | 90.5 | 77.6 | 94.3 | 97.2 | ALBEF | 2021-07-16 |
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ✓ Link | 59.9 | 83.3 | 89.8 | 77 | 93.5 | 96.9 | ALIGN | 2021-02-11 |
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training | ✓ Link | 59.5 | 83.4 | 90.1 | 77.4 | 93.6 | 97.1 | ERNIE-ViL 2.0 | 2022-09-30 |
Vision-Language Pre-Training with Triple Contrastive Learning | ✓ Link | 59.0 | 83.2 | 89.9 | 75.6 | 92.8 | 96.7 | TCL | 2022-02-21 |
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | ✓ Link | 57.5 | 82.8 | 89.8 | 73.5 | 92.2 | 96.0 | Oscar | 2020-04-13 |
An Empirical Study of Training End-to-End Vision-and-Language Transformers | ✓ Link | 57.08 | 82.66 | 90.07 | 76.16 | 93.16 | 96.82 | METER | 2021-11-03 |
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval | | 52.6 | 79.6 | 87.6 | 68.9 | 90.1 | 95.4 | ViSTA | 2022-03-31 |
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval | ✓ Link | 51.3 | 79.2 | 87.5 | 64.9 | 88.6 | 94.5 | ALADIN | 2022-07-29 |
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting | ✓ Link | 50.3 | 79.3 | 87.7 | 67.9 | 90.5 | 95.4 | 3SHNet | 2024-04-26 |
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words | ✓ Link | 44.4 | 72.8 | 82.4 | | | | VisualSparta | 2021-01-01 |
Plug-and-Play Regulators for Image-Text Matching | ✓ Link | 44.3 | 73.2 | 83.2 | 61.3 | 86.1 | 92.6 | RCAR | 2023-03-23 |
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings | ✓ Link | 43.0 | | | 59.8 | | | NAPReg | 2023-01-07 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 42.7 | 72.9 | 83.1 | 61.5 | 86.3 | 92.7 | ViLT-B/32 | 2021-02-05 |
Similarity Reasoning and Filtration for Image-Text Matching | ✓ Link | 41.9 | 70.7 | 81.3 | 57.8 | 84.9 | 91.6 | SGRAF | 2021-01-05 |
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives | | 41.5 | 72.1 | 82.2 | 55.6 | 82.4 | 91.0 | LILE | 2022-03-02 |
Visual Semantic Reasoning for Image-Text Matching | ✓ Link | 40.5 | 70.6 | 81.1 | 53.0 | 81.1 | 89.4 | VSRN | 2019-09-06 |
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval | ✓ Link | 39.7 | 69.1 | 79.8 | 53.7 | 83.2 | 91.0 | IMRAM | 2020-03-08 |
Stacked Cross Attention for Image-Text Matching | ✓ Link | 38.6 | 69.3 | 80.4 | 50.4 | 82.2 | 90.0 | SCAN | 2018-03-21 |
Learning Semantic Concepts and Order for Image and Sentence Matching | | 33.1 | 62.9 | 75.5 | 42.8 | 72.3 | 83.0 | SCO (ResNet) | 2017-12-06 |
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval | ✓ Link | 32.4 | 63.0 | 75.0 | 45.2 | 74.3 | 84.5 | PVSE | 2019-06-11 |
Deep Visual-Semantic Alignments for Generating Image Descriptions | ✓ Link | 25.3 | 53.4 | 66.4 | 41.2 | 70.5 | 81.1 | Dual-Path (ResNet) | 2014-12-07 |
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | ✓ Link | | | | 70.7 | 89.1 | 93.7 | MaMMUT (ours) | 2023-03-29 |