Paper | Code | CIDEr | B1 | B2 | B3 | B4 | ROUGE-L | METEOR | SPICE | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|---|---|
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 124.77 | 88.43 | 75.02 | 57.87 | 37.65 | 63.19 | 32.56 | 16.06 | GIT2 | 2022-05-27 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 123.39 | 88.1 | 74.81 | 57.68 | 37.35 | 63.12 | 32.5 | 15.94 | GIT | 2022-05-27 |
Scaling Up Vision-Language Pre-training for Image Captioning | 114.25 | 85.62 | 71.36 | 53.62 | 34.65 | 61.2 | 31.27 | 14.85 | Microsoft Cognitive Services team | 2021-11-24 | |
[]() | 102.39 | 83.69 | 67.96 | 49.38 | 29.69 | 58.99 | 29.68 | 14.71 | VLAF2 | ||
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | 100.12 | 82.27 | 66.04 | 47.48 | 28.95 | 58.26 | 29.47 | 14.04 | Microsoft Cognitive Services team | 2020-09-28 | |
[]() | 85.34 | 76.64 | 56.46 | 36.37 | 19.48 | 52.83 | 28.15 | 14.67 | Human | ||
[]() | 85.3 | 78.77 | 61.54 | 41.85 | 23.77 | 54.59 | 25.96 | 11.84 | icp2ssi1_coco_si_0.02_5_test | ||
[]() | 85.02 | 79.17 | 60.29 | 39.06 | 20.81 | 53.39 | 26.54 | 12.74 | test_cbs2 | ||
[]() | 73.09 | 76.59 | 56.74 | 35.39 | 18.41 | 51.82 | 24.42 | 11.2 | UpDown + ELMo + CBS | ||
[]() | 61.48 | 73.42 | 52.12 | 29.35 | 12.88 | 48.74 | 22.06 | 9.69 | Neural Baby Talk + CBS | ||
[]() | 54.25 | 74.0 | 55.11 | 35.23 | 19.16 | 50.92 | 22.96 | 10.14 | UpDown | ||
[]() | 53.36 | 72.33 | 52.42 | 30.83 | 14.73 | 48.87 | 21.52 | 9.15 | Neural Baby Talk |