Paper | Code | R1 | R@10 | R@5 | Sum(R@1,5,10) | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts | ✓ Link | 15.2 | 49.6 | 36.7 | 101.5 | PaCE | 2023-05-24 |
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | ✓ Link | 11.5 | 39.4 | 30.0 | 83.2 | VLMo | 2021-11-03 |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | ✓ Link | 11.5 | 25.6 | 33.8 | 71.0 | ViLT | 2021-02-05 |
Stacked Cross Attention for Image-Text Matching | ✓ Link | 10.4 | 37.1 | 27.0 | 74.5 | SCAN | 2018-03-21 |
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling | 9.0 | 35.7 | 26.4 | 71.1 | DE++ | 2021-07-06 |