GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 125.51 | 88.9 | 75.86 | 58.9 | 38.95 | 63.66 | 32.95 | 16.11 | GIT2, Single Model | 2022-05-27 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 124.35 | 88.57 | 75.56 | 58.99 | 39.98 | 63.99 | 33.47 | 15.75 | PaLI | 2022-09-14 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 123.92 | 88.56 | 75.48 | 58.46 | 38.44 | 63.5 | 32.86 | 15.96 | GIT, Single Model | 2022-05-27 |
[]() | | 120.73 | 87.53 | 74.49 | 57.89 | 38.92 | 62.91 | 32.71 | 15.54 | CoCa - Google Brain | |
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | | 115.54 | 86.48 | 72.6 | 55.26 | 36.31 | 61.9 | 31.8 | 15.06 | Microsoft Cognitive Services team | 2020-09-28 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 110.76 | 84.36 | 69.83 | 52.42 | 33.74 | 60.46 | 30.97 | 14.61 | Single Model | 2021-08-24 |
[]() | | 109.33 | 84.47 | 69.66 | 51.95 | 33.46 | 60.34 | 31.08 | 14.79 | FudanFVL | |
[]() | | 108.04 | 83.71 | 68.56 | 50.9 | 32.72 | 59.8 | 30.79 | 14.71 | FudanWYZ | |
[]() | | 100.15 | 84.04 | 68.58 | 49.98 | 30.78 | 59.23 | 29.53 | 14.15 | IEDA-LAB | |
[]() | | 99.51 | 81.62 | 66.65 | 49.39 | 31.42 | 58.83 | 30.48 | 14.88 | firethehole | |
[]() | | 95.73 | 83.58 | 67.99 | 49.29 | 29.96 | 58.47 | 28.84 | 13.64 | MD | |
[]() | | 95.69 | 82.55 | 66.55 | 47.8 | 29.0 | 58.22 | 29.11 | 14.37 | vll@mk514 | |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 95.16 | 82.77 | 66.94 | 47.02 | 27.97 | 57.95 | 28.24 | 13.36 | VinVL (Microsoft Cognitive Services + MSR) | 2021-01-02 |
[]() | | 89.87 | 81.93 | 65.88 | 46.72 | 27.94 | 57.34 | 27.89 | 12.98 | ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | |
[]() | | 87.41 | 79.61 | 63.01 | 43.59 | 25.85 | 55.63 | 26.63 | 12.11 | icgp2ssi1_coco_si_0.02_5_test | |
[]() | | 85.89 | 79.67 | 62.73 | 42.87 | 24.8 | 55.37 | 26.68 | 12.24 | evertyhing | |
[]() | | 84.58 | 77.05 | 56.97 | 36.84 | 19.85 | 53.06 | 28.42 | 14.72 | Human | |
[]() | | 84.0 | 79.21 | 62.26 | 40.77 | 22.56 | 54.62 | 26.3 | 12.47 | RCAL | |
[]() | | 82.07 | 80.54 | 62.32 | 40.65 | 22.37 | 54.78 | 25.91 | 11.53 | Oscar | |
[]() | | 80.21 | 80.24 | 62.31 | 41.07 | 21.53 | 54.52 | 25.98 | 12.12 | vinvl_yuan_cbs | |
[]() | | 79.72 | 79.69 | 60.75 | 39.06 | 20.97 | 53.37 | 25.64 | 11.81 | cxy_nocaps_training | |
[]() | | 79.44 | 79.59 | 60.52 | 38.95 | 20.72 | 53.18 | 25.64 | 11.88 | Xinyi | |
[]() | | 79.14 | 79.21 | 62.06 | 42.51 | 25.06 | 55.24 | 26.87 | 12.14 | camel XE | |
[]() | | 76.34 | 77.76 | 59.0 | 38.29 | 21.0 | 53.15 | 25.59 | 11.87 | MQ-UpDown-C | |
[]() | | 74.2 | 77.68 | 58.31 | 37.04 | 19.85 | 52.64 | 24.97 | 11.45 | UpDown + ELMo + CBS | |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 67.69 | | | | | | | 11.26 | ClipCap (MLP + GPT2 tuning) | 2021-11-18 |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 66.82 | | | | | | | 10.92 | ClipCap (Transformer) | 2021-11-18 |
[]() | | 63.96 | 73.6 | 54.26 | 34.59 | 18.95 | 51.23 | 24.52 | 11.14 | 7_10-7_40000_predict_test.json | |
[]() | | 61.98 | 74.77 | 53.67 | 30.66 | 13.85 | 49.45 | 22.55 | 9.83 | Neural Baby Talk + CBS | |
[]() | | 58.5 | 72.91 | 53.74 | 33.49 | 18.04 | 50.53 | 23.12 | 10.28 | None | |
[]() | | 56.85 | 75.25 | 56.93 | 36.91 | 20.49 | 51.84 | 23.6 | 10.33 | nocaps_training | |
[]() | | 56.85 | 75.25 | 56.93 | 36.91 | 20.49 | 51.84 | 23.6 | 10.33 | UpDown | |
[]() | | 53.21 | 73.69 | 54.1 | 32.37 | 15.99 | 49.63 | 21.93 | 9.26 | Neural Baby Talk | |
[]() | | 51.16 | 73.73 | 53.98 | 33.1 | 17.28 | 50.0 | 22.27 | 9.7 | YX | |
[]() | | 50.34 | 73.19 | 53.56 | 32.94 | 17.49 | 49.79 | 22.43 | 9.7 | area_attention | |
[]() | | 49.62 | 74.07 | 55.53 | 35.22 | 18.79 | 50.77 | 22.41 | 9.54 | B2 | |
[]() | | 47.53 | 70.84 | 50.79 | 30.26 | 16.14 | 48.61 | 21.48 | 9.28 | coco_all_19 | |
[]() | | 46.64 | 68.86 | 48.7 | 26.85 | 12.6 | 47.13 | 20.18 | 8.37 | Yu-Wu | |
[]() | | 40.45 | 70.05 | 48.92 | 26.19 | 12.11 | 47.04 | 20.05 | 8.28 | CS395T | |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | | | | | | | | 15.75 | PaLI | 2022-09-14 |