PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 126.67 | 86.28 | 71.19 | 52.63 | 32.0 | 61.35 | 30.99 | 15.49 | PaLI | 2022-09-14 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 122.27 | 86.28 | 71.15 | 52.36 | 30.15 | 60.91 | 30.15 | 15.62 | GIT2, Single Model | 2022-05-27 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 122.04 | 85.99 | 71.28 | 52.66 | 30.04 | 60.96 | 30.45 | 15.7 | GIT, Single Model | 2022-05-27 |
[]() | | 121.69 | 84.75 | 70.24 | 52.13 | 31.89 | 60.57 | 30.18 | 15.13 | CoCa - Google Brain | |
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | | 110.14 | 81.73 | 65.48 | 45.58 | 25.78 | 57.57 | 28.17 | 13.74 | Microsoft Cognitive Services team | 2020-09-28 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 109.49 | 80.89 | 64.21 | 44.38 | 24.47 | 56.69 | 27.91 | 13.89 | Single Model | 2021-08-24 |
[]() | | 106.55 | 81.44 | 64.71 | 45.26 | 25.31 | 57.29 | 28.13 | 14.21 | FudanFVL | |
[]() | | 103.75 | 80.0 | 62.7 | 43.58 | 24.57 | 56.41 | 27.75 | 13.75 | FudanWYZ | |
[]() | | 91.62 | 74.84 | 53.9 | 33.51 | 16.6 | 51.5 | 26.83 | 14.21 | Human | |
[]() | | 88.54 | 76.65 | 60.06 | 41.58 | 22.66 | 55.08 | 27.39 | 13.87 | firethehole | |
[]() | | 87.51 | 79.52 | 61.01 | 40.14 | 20.64 | 55.0 | 25.55 | 12.52 | IEDA-LAB | |
[]() | | 87.15 | 75.71 | 56.39 | 35.94 | 17.96 | 51.75 | 24.01 | 11.43 | icgp2ssi1_coco_si_0.02_5_test | |
[]() | | 85.18 | 75.5 | 56.14 | 34.53 | 16.69 | 51.54 | 23.69 | 11.18 | evertyhing | |
[]() | | 78.91 | 76.41 | 56.87 | 35.99 | 16.92 | 52.51 | 24.5 | 12.14 | vll@mk514 | |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 78.01 | 75.78 | 56.1 | 34.02 | 15.86 | 51.99 | 23.55 | 11.48 | VinVL (Microsoft Cognitive Services + MSR) | 2021-01-02 |
[]() | | 77.39 | 76.81 | 57.39 | 36.13 | 17.85 | 52.54 | 23.79 | 11.59 | MD | |
[]() | | 75.39 | 72.47 | 52.01 | 28.26 | 11.94 | 48.81 | 22.04 | 10.68 | RCAL | |
[]() | | 73.75 | 74.98 | 53.26 | 28.88 | 12.42 | 50.0 | 21.73 | 9.72 | Oscar | |
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | ✓ Link | 72.6 | | | | | | | 11.1 | GRIT (zero-shot, no CBS, no VL pretraining, single model) | 2022-07-20 |
[]() | | 72.13 | 76.2 | 57.25 | 36.37 | 17.68 | 52.86 | 23.88 | 11.53 | ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | |
[]() | | 71.43 | 73.95 | 52.76 | 29.34 | 11.69 | 49.5 | 22.18 | 10.57 | vinvl_yuan_cbs | |
[]() | | 70.21 | 72.94 | 51.36 | 28.32 | 11.99 | 48.6 | 21.73 | 10.15 | UpDown-C | |
[]() | | 68.92 | 72.53 | 49.99 | 27.18 | 10.57 | 47.23 | 21.57 | 10.05 | Xinyi | |
[]() | | 68.5 | 73.07 | 50.81 | 27.58 | 10.98 | 47.53 | 21.65 | 10.01 | cxy_nocaps_training | |
[]() | | 66.67 | 71.57 | 48.58 | 25.77 | 9.68 | 47.13 | 20.88 | 9.74 | UpDown + ELMo + CBS | |
[]() | | 58.48 | 65.98 | 43.2 | 21.16 | 7.5 | 44.47 | 19.04 | 8.77 | Neural Baby Talk + CBS | |
[]() | | 54.56 | 71.34 | 50.32 | 29.44 | 12.99 | 48.85 | 21.55 | 9.9 | camel XE | |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 49.35 | | | | | | | 9.7 | ClipCap (MLP + GPT2 tuning) | 2021-11-18 |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 49.14 | | | | | | | 9.57 | ClipCap (Transformer) | 2021-11-18 |
[]() | | 48.73 | 64.45 | 42.8 | 21.48 | 7.92 | 44.11 | 18.31 | 8.2 | Neural Baby Talk | |
[]() | | 43.2 | 66.14 | 44.7 | 24.58 | 10.14 | 45.72 | 19.95 | 9.35 | 7_10-7_40000_predict_test.json | |
[]() | | 39.39 | 60.95 | 38.3 | 17.19 | 6.11 | 42.46 | 16.97 | 7.62 | Yu-Wu | |
[]() | | 36.12 | 47.08 | 22.24 | 7.41 | 1.83 | 31.57 | 17.94 | 9.39 | Check | |
[]() | | 30.09 | 66.54 | 44.28 | 24.23 | 10.17 | 44.84 | 18.29 | 8.08 | nocaps_training | |
[]() | | 30.09 | 66.54 | 44.28 | 24.23 | 10.17 | 44.84 | 18.29 | 8.08 | UpDown | |
[]() | | 26.55 | 64.58 | 41.56 | 21.71 | 8.72 | 43.59 | 17.43 | 7.72 | area_attention | |
[]() | | 26.25 | 66.44 | 42.47 | 21.15 | 8.54 | 44.23 | 17.2 | 7.52 | YX | |
[]() | | 25.91 | 66.32 | 44.27 | 23.82 | 9.46 | 44.37 | 17.48 | 7.61 | B2 | |
[]() | | 23.07 | 61.62 | 38.55 | 18.45 | 7.55 | 41.58 | 16.07 | 7.4 | coco_all_19 | |
[]() | | 21.3 | 63.0 | 39.71 | 19.99 | 8.2 | 43.02 | 16.19 | 7.2 | CS395T | |