PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 149.1 | | | | | | | | PaLI | 2022-09-14 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 124.18 | 88.86 | 75.86 | 59.94 | 41.1 | 63.82 | 33.83 | 16.36 | GIT2, Single Model | 2022-05-27 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 122.4 | 88.55 | 76.1 | 60.53 | 41.65 | 64.02 | 33.41 | 16.18 | GIT, Single Model | 2022-05-27 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 121.09 | 88.02 | 75.21 | 59.38 | 41.16 | 64.39 | 34.22 | 15.69 | PaLI | 2022-09-14 |
[]() | | 117.9 | 87.27 | 74.29 | 58.01 | 39.24 | 63.12 | 33.01 | 15.49 | CoCa - Google Brain | |
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | | 112.82 | 86.33 | 72.83 | 55.94 | 37.97 | 62.48 | 32.7 | 15.22 | Microsoft Cognitive Services team | 2020-09-28 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 108.98 | 84.64 | 70.0 | 52.96 | 34.66 | 61.01 | 31.97 | 14.6 | Single Model | 2021-08-24 |
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | ✓ Link | 105.9 | | | | | | | 13.6 | GRIT (zero-shot, no VL pretraining, no CBS) | 2022-07-20 |
[]() | | 104.9 | 84.2 | 69.57 | 52.56 | 34.8 | 60.52 | 31.77 | 15.04 | FudanFVL | |
[]() | | 104.25 | 82.91 | 68.02 | 50.75 | 33.59 | 59.67 | 31.33 | 14.85 | FudanWYZ | |
[]() | | 102.64 | 84.4 | 69.8 | 51.89 | 32.86 | 60.07 | 30.43 | 14.47 | IEDA-LAB | |
[]() | | 101.69 | 83.77 | 68.7 | 51.26 | 32.76 | 59.75 | 30.51 | 14.99 | vll@mk514 | |
[]() | | 100.03 | 84.03 | 69.12 | 51.16 | 33.15 | 59.67 | 30.06 | 14.08 | MD | |
[]() | | 99.9 | 81.86 | 67.2 | 50.5 | 34.11 | 59.54 | 31.61 | 15.17 | firethehole | |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 97.99 | 83.24 | 68.04 | 49.68 | 30.62 | 58.54 | 29.51 | 13.63 | VinVL (Microsoft Cognitive Services + MSR) | 2021-01-02 |
[]() | | 96.63 | 82.9 | 68.09 | 49.73 | 31.24 | 58.62 | 29.37 | 13.61 | ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | |
[]() | | 88.08 | 80.5 | 64.48 | 46.46 | 29.59 | 56.84 | 28.7 | 13.04 | camel XE | |
[]() | | 87.86 | 79.58 | 63.09 | 43.92 | 26.07 | 55.88 | 27.97 | 12.6 | evertyhing | |
[]() | | 87.28 | 80.68 | 64.7 | 45.33 | 27.09 | 56.76 | 27.7 | 12.79 | RCAL | |
[]() | | 87.21 | 80.26 | 63.94 | 44.65 | 27.23 | 56.4 | 27.7 | 12.28 | icgp2ssi1_coco_si_0.02_5_test | |
[]() | | 85.81 | 81.64 | 63.79 | 43.43 | 25.15 | 55.06 | 27.25 | 12.35 | cxy_nocaps_training | |
[]() | | 85.81 | 81.64 | 63.79 | 43.43 | 25.15 | 55.06 | 27.25 | 12.35 | 作者给的test文件 | |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 84.85 | | | | | | | 12.14 | ClipCap (Transformer) | 2021-11-18 |
[]() | | 84.83 | 80.7 | 63.27 | 42.86 | 25.78 | 55.91 | 27.23 | 12.06 | Oscar | |
[]() | | 84.79 | 81.61 | 63.74 | 43.22 | 24.82 | 55.03 | 27.27 | 12.3 | Xinyi | |
[]() | | 80.61 | 76.89 | 57.3 | 37.78 | 21.49 | 53.47 | 28.53 | 14.99 | Human | |
[]() | | 80.19 | 78.73 | 61.63 | 42.35 | 25.94 | 55.25 | 27.25 | 12.38 | MQ-UpDown-C | |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 79.73 | | | | | | | 12.2 | ClipCap (MLP + GPT2 tuning) | 2021-11-18 |
[]() | | 76.02 | 77.65 | 59.58 | 39.86 | 22.83 | 53.98 | 26.35 | 11.8 | UpDown + ELMo + CBS | |
[]() | | 74.27 | 77.68 | 60.34 | 41.5 | 24.57 | 54.42 | 26.04 | 11.47 | UpDown | |
[]() | | 74.27 | 77.68 | 60.34 | 41.5 | 24.57 | 54.42 | 26.04 | 11.46 | nocaps_training | |
[]() | | 73.73 | 75.31 | 56.79 | 37.85 | 21.91 | 52.44 | 26.02 | 12.04 | 7_10-7_40000_predict_test.json | |
[]() | | 70.33 | 74.35 | 55.97 | 36.12 | 20.84 | 52.26 | 25.1 | 11.07 | None | |
[]() | | 69.59 | 76.48 | 58.76 | 39.28 | 21.96 | 53.22 | 25.08 | 10.94 | YX | |
[]() | | 68.98 | 77.06 | 59.97 | 40.54 | 23.8 | 53.49 | 25.06 | 10.55 | B2 | |
[]() | | 67.91 | 76.12 | 57.98 | 38.44 | 21.92 | 52.53 | 25.07 | 10.87 | area_attention | |
[]() | | 64.37 | 72.76 | 53.52 | 34.13 | 19.45 | 50.53 | 23.47 | 10.11 | coco_all_19 | |
[]() | | 62.96 | 76.49 | 56.2 | 33.73 | 15.14 | 50.84 | 23.68 | 10.13 | Neural Baby Talk + CBS | |
[]() | | 60.89 | 75.91 | 56.78 | 35.58 | 17.39 | 51.42 | 23.8 | 9.81 | Neural Baby Talk | |
[]() | | 58.93 | 72.24 | 51.88 | 29.57 | 14.54 | 49.05 | 22.04 | 8.91 | CS395T | |
[]() | | 53.34 | 72.05 | 52.89 | 31.92 | 16.71 | 49.64 | 22.04 | 9.16 | Yu-Wu | |