OpenCodePapers

image-captioning-on-nocaps-near-domain

Image Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeCIDErB1B2B3B4ROUGE-LMETEORSPICEModelNameReleaseDate
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link125.5188.975.8658.938.9563.6632.9516.11GIT2, Single Model2022-05-27
PaLI: A Jointly-Scaled Multilingual Language-Image Model✓ Link124.3588.5775.5658.9939.9863.9933.4715.75PaLI2022-09-14
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link123.9288.5675.4858.4638.4463.532.8615.96GIT, Single Model2022-05-27
[]()120.7387.5374.4957.8938.9262.9132.7115.54CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning115.5486.4872.655.2636.3161.931.815.06Microsoft Cognitive Services team2020-09-28
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link110.7684.3669.8352.4233.7460.4630.9714.61Single Model2021-08-24
[]()109.3384.4769.6651.9533.4660.3431.0814.79FudanFVL
[]()108.0483.7168.5650.932.7259.830.7914.71FudanWYZ
[]()100.1584.0468.5849.9830.7859.2329.5314.15IEDA-LAB
[]()99.5181.6266.6549.3931.4258.8330.4814.88firethehole
[]()95.7383.5867.9949.2929.9658.4728.8413.64MD
[]()95.6982.5566.5547.829.058.2229.1114.37vll@mk514
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link95.1682.7766.9447.0227.9757.9528.2413.36VinVL (Microsoft Cognitive Services + MSR)2021-01-02
[]()89.8781.9365.8846.7227.9457.3427.8912.98ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()87.4179.6163.0143.5925.8555.6326.6312.11icgp2ssi1_coco_si_0.02_5_test
[]()85.8979.6762.7342.8724.855.3726.6812.24evertyhing
[]()84.5877.0556.9736.8419.8553.0628.4214.72Human
[]()84.079.2162.2640.7722.5654.6226.312.47RCAL
[]()82.0780.5462.3240.6522.3754.7825.9111.53Oscar
[]()80.2180.2462.3141.0721.5354.5225.9812.12vinvl_yuan_cbs
[]()79.7279.6960.7539.0620.9753.3725.6411.81cxy_nocaps_training
[]()79.4479.5960.5238.9520.7253.1825.6411.88Xinyi
[]()79.1479.2162.0642.5125.0655.2426.8712.14camel XE
[]()76.3477.7659.038.2921.053.1525.5911.87MQ-UpDown-C
[]()74.277.6858.3137.0419.8552.6424.9711.45UpDown + ELMo + CBS
ClipCap: CLIP Prefix for Image Captioning✓ Link67.6911.26ClipCap (MLP + GPT2 tuning)2021-11-18
ClipCap: CLIP Prefix for Image Captioning✓ Link66.8210.92ClipCap (Transformer)2021-11-18
[]()63.9673.654.2634.5918.9551.2324.5211.147_10-7_40000_predict_test.json
[]()61.9874.7753.6730.6613.8549.4522.559.83Neural Baby Talk + CBS
[]()58.572.9153.7433.4918.0450.5323.1210.28None
[]()56.8575.2556.9336.9120.4951.8423.610.33nocaps_training
[]()56.8575.2556.9336.9120.4951.8423.610.33UpDown
[]()53.2173.6954.132.3715.9949.6321.939.26Neural Baby Talk
[]()51.1673.7353.9833.117.2850.022.279.7YX
[]()50.3473.1953.5632.9417.4949.7922.439.7area_attention
[]()49.6274.0755.5335.2218.7950.7722.419.54B2
[]()47.5370.8450.7930.2616.1448.6121.489.28coco_all_19
[]()46.6468.8648.726.8512.647.1320.188.37Yu-Wu
[]()40.4570.0548.9226.1912.1147.0420.058.28CS395T
PaLI: A Jointly-Scaled Multilingual Language-Image Model✓ Link15.75PaLI2022-09-14