OpenCodePapers

image-captioning-on-nocaps-out-of-domain

Image Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeCIDErB1B2B3B4ROUGE-LMETEORSPICEModelNameReleaseDate
PaLI: A Jointly-Scaled Multilingual Language-Image Model✓ Link126.6786.2871.1952.6332.061.3530.9915.49PaLI2022-09-14
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link122.2786.2871.1552.3630.1560.9130.1515.62GIT2, Single Model2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link122.0485.9971.2852.6630.0460.9630.4515.7GIT, Single Model2022-05-27
[]()121.6984.7570.2452.1331.8960.5730.1815.13CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning110.1481.7365.4845.5825.7857.5728.1713.74Microsoft Cognitive Services team2020-09-28
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link109.4980.8964.2144.3824.4756.6927.9113.89Single Model2021-08-24
[]()106.5581.4464.7145.2625.3157.2928.1314.21FudanFVL
[]()103.7580.062.743.5824.5756.4127.7513.75FudanWYZ
[]()91.6274.8453.933.5116.651.526.8314.21Human
[]()88.5476.6560.0641.5822.6655.0827.3913.87firethehole
[]()87.5179.5261.0140.1420.6455.025.5512.52IEDA-LAB
[]()87.1575.7156.3935.9417.9651.7524.0111.43icgp2ssi1_coco_si_0.02_5_test
[]()85.1875.556.1434.5316.6951.5423.6911.18evertyhing
[]()78.9176.4156.8735.9916.9252.5124.512.14vll@mk514
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link78.0175.7856.134.0215.8651.9923.5511.48VinVL (Microsoft Cognitive Services + MSR)2021-01-02
[]()77.3976.8157.3936.1317.8552.5423.7911.59MD
[]()75.3972.4752.0128.2611.9448.8122.0410.68RCAL
[]()73.7574.9853.2628.8812.4250.021.739.72Oscar
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features✓ Link72.611.1GRIT (zero-shot, no CBS, no VL pretraining, single model)2022-07-20
[]()72.1376.257.2536.3717.6852.8623.8811.53ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()71.4373.9552.7629.3411.6949.522.1810.57vinvl_yuan_cbs
[]()70.2172.9451.3628.3211.9948.621.7310.15UpDown-C
[]()68.9272.5349.9927.1810.5747.2321.5710.05Xinyi
[]()68.573.0750.8127.5810.9847.5321.6510.01cxy_nocaps_training
[]()66.6771.5748.5825.779.6847.1320.889.74UpDown + ELMo + CBS
[]()58.4865.9843.221.167.544.4719.048.77Neural Baby Talk + CBS
[]()54.5671.3450.3229.4412.9948.8521.559.9camel XE
ClipCap: CLIP Prefix for Image Captioning✓ Link49.359.7ClipCap (MLP + GPT2 tuning)2021-11-18
ClipCap: CLIP Prefix for Image Captioning✓ Link49.149.57ClipCap (Transformer)2021-11-18
[]()48.7364.4542.821.487.9244.1118.318.2Neural Baby Talk
[]()43.266.1444.724.5810.1445.7219.959.357_10-7_40000_predict_test.json
[]()39.3960.9538.317.196.1142.4616.977.62Yu-Wu
[]()36.1247.0822.247.411.8331.5717.949.39Check
[]()30.0966.5444.2824.2310.1744.8418.298.08nocaps_training
[]()30.0966.5444.2824.2310.1744.8418.298.08UpDown
[]()26.5564.5841.5621.718.7243.5917.437.72area_attention
[]()26.2566.4442.4721.158.5444.2317.27.52YX
[]()25.9166.3244.2723.829.4644.3717.487.61B2
[]()23.0761.6238.5518.457.5541.5816.077.4coco_all_19
[]()21.363.039.7119.998.243.0216.197.2CS395T