Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects | | 126.8 | | | | | | | | Lyrics | 2023-12-08 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 123.39 | 88.1 | 74.81 | 57.68 | 37.35 | 63.12 | 32.5 | 15.94 | GIT, Single Model | 2022-05-27 |
[]() | | 120.55 | 87.01 | 73.71 | 56.88 | 37.71 | 62.52 | 32.29 | 15.47 | CoCa - Google Brain | |
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | | 114.25 | 85.62 | 71.36 | 53.62 | 34.65 | 61.2 | 31.27 | 14.85 | Microsoft Cognitive Services team | 2020-09-28 |
Prismer: A Vision-Language Model with Multi-Task Experts | ✓ Link | 110.84 | 84.87 | 69.99 | 52.48 | 33.66 | 60.55 | 31.13 | 14.91 | Prismer | 2023-03-04 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 110.31 | 83.78 | 68.86 | 51.06 | 32.2 | 59.86 | 30.55 | 14.49 | Single Model | 2021-08-24 |
[]() | | 108.29 | 83.9 | 68.77 | 50.84 | 32.17 | 59.82 | 30.64 | 14.72 | FudanFVL | |
[]() | | 106.81 | 82.95 | 67.45 | 49.58 | 31.38 | 59.18 | 30.32 | 14.56 | FudanWYZ | |
[]() | | 98.08 | 83.25 | 67.3 | 48.41 | 29.27 | 58.56 | 28.92 | 13.9 | IEDA-LAB | |
[]() | | 97.61 | 80.77 | 65.55 | 48.14 | 30.2 | 58.25 | 30.07 | 14.74 | firethehole | |
[]() | | 93.45 | 81.61 | 65.1 | 46.13 | 27.32 | 57.4 | 28.46 | 14.06 | vll@mk514 | |
[]() | | 93.0 | 82.43 | 66.25 | 47.18 | 28.2 | 57.57 | 28.09 | 13.35 | MD | |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 92.46 | 81.59 | 65.15 | 45.04 | 26.15 | 56.96 | 27.57 | 13.07 | VinVL (Microsoft Cognitive Services + MSR) | 2021-01-02 |
[]() | | 87.56 | 81.03 | 64.62 | 45.26 | 26.52 | 56.7 | 27.36 | 12.81 | ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | |
[]() | | 87.34 | 79.0 | 61.95 | 42.36 | 24.62 | 55.03 | 26.29 | 12.01 | icgp2ssi1_coco_si_0.02_5_test | |
[]() | | 86.0 | 78.92 | 61.6 | 41.52 | 23.52 | 54.75 | 26.31 | 12.1 | evertyhing | |
[]() | | 85.34 | 76.64 | 56.46 | 36.37 | 19.48 | 52.83 | 28.15 | 14.67 | Human | |
[]() | | 82.88 | 78.19 | 60.74 | 39.11 | 21.24 | 53.85 | 25.72 | 12.2 | RCAL | |
[]() | | 80.93 | 79.57 | 60.83 | 38.83 | 21.02 | 54.07 | 25.33 | 11.29 | Oscar | |
[]() | | 79.04 | 79.32 | 60.95 | 39.5 | 20.3 | 53.8 | 25.44 | 11.9 | vinvl_yuan_cbs | |
[]() | | 78.48 | 78.75 | 59.36 | 37.56 | 19.72 | 52.54 | 25.13 | 11.57 | cxy_nocaps_training | |
[]() | | 78.23 | 78.58 | 59.05 | 37.39 | 19.43 | 52.35 | 25.12 | 11.62 | Xinyi | |
[]() | | 75.88 | 77.97 | 60.27 | 40.68 | 23.48 | 54.3 | 26.15 | 11.89 | camel XE | |
[]() | | 75.58 | 76.89 | 57.76 | 36.93 | 20.11 | 52.53 | 25.18 | 11.68 | MQ-UpDown-C | |
[]() | | 73.09 | 76.59 | 56.74 | 35.39 | 18.41 | 51.82 | 24.42 | 11.2 | UpDown + ELMo + CBS | |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 65.83 | | | | | | | 10.86 | ClipCap (Transformer) | 2021-11-18 |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 65.7 | | | | | | | 11.1 | ClipCap (MLP + GPT2 tuning) | 2021-11-18 |
[]() | | 61.48 | 73.42 | 52.12 | 29.35 | 12.88 | 48.74 | 22.06 | 9.69 | Neural Baby Talk + CBS | |
[]() | | 61.48 | 72.49 | 52.88 | 33.22 | 17.75 | 50.4 | 23.89 | 10.96 | 7_10-7_40000_predict_test.json | |
[]() | | 55.97 | 71.69 | 52.04 | 31.7 | 16.73 | 49.64 | 22.53 | 10.1 | None | |
[]() | | 54.25 | 74.0 | 55.11 | 35.23 | 19.16 | 50.92 | 22.96 | 10.14 | nocaps_training | |
[]() | | 54.25 | 74.0 | 55.11 | 35.23 | 19.16 | 50.92 | 22.96 | 10.14 | UpDown | |
[]() | | 53.36 | 72.33 | 52.42 | 30.83 | 14.73 | 48.87 | 21.52 | 9.15 | Neural Baby Talk | |
[]() | | 49.02 | 72.78 | 52.52 | 31.74 | 16.31 | 49.38 | 21.72 | 9.54 | YX | |
[]() | | 48.29 | 72.02 | 51.97 | 31.62 | 16.48 | 49.03 | 21.87 | 9.56 | area_attention | |
[]() | | 47.69 | 73.04 | 54.08 | 33.88 | 17.69 | 49.97 | 21.85 | 9.42 | B2 | |
[]() | | 46.18 | 67.85 | 47.37 | 25.76 | 11.96 | 46.61 | 19.84 | 8.35 | Yu-Wu | |
[]() | | 45.27 | 69.44 | 48.95 | 28.64 | 15.02 | 47.6 | 20.77 | 9.13 | coco_all_19 | |
[]() | | 39.33 | 69.07 | 47.65 | 25.5 | 11.72 | 46.58 | 19.61 | 8.2 | CS395T | |