OpenCodePapers

image-captioning-on-nocaps-entire

Image Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeCIDErB1B2B3B4ROUGE-LMETEORSPICEModelNameReleaseDate
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects126.8Lyrics2023-12-08
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link123.3988.174.8157.6837.3563.1232.515.94GIT, Single Model2022-05-27
[]()120.5587.0173.7156.8837.7162.5232.2915.47CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning114.2585.6271.3653.6234.6561.231.2714.85Microsoft Cognitive Services team2020-09-28
Prismer: A Vision-Language Model with Multi-Task Experts✓ Link110.8484.8769.9952.4833.6660.5531.1314.91Prismer2023-03-04
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link110.3183.7868.8651.0632.259.8630.5514.49Single Model2021-08-24
[]()108.2983.968.7750.8432.1759.8230.6414.72FudanFVL
[]()106.8182.9567.4549.5831.3859.1830.3214.56FudanWYZ
[]()98.0883.2567.348.4129.2758.5628.9213.9IEDA-LAB
[]()97.6180.7765.5548.1430.258.2530.0714.74firethehole
[]()93.4581.6165.146.1327.3257.428.4614.06vll@mk514
[]()93.082.4366.2547.1828.257.5728.0913.35MD
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link92.4681.5965.1545.0426.1556.9627.5713.07VinVL (Microsoft Cognitive Services + MSR)2021-01-02
[]()87.5681.0364.6245.2626.5256.727.3612.81ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()87.3479.061.9542.3624.6255.0326.2912.01icgp2ssi1_coco_si_0.02_5_test
[]()86.078.9261.641.5223.5254.7526.3112.1evertyhing
[]()85.3476.6456.4636.3719.4852.8328.1514.67Human
[]()82.8878.1960.7439.1121.2453.8525.7212.2RCAL
[]()80.9379.5760.8338.8321.0254.0725.3311.29Oscar
[]()79.0479.3260.9539.520.353.825.4411.9vinvl_yuan_cbs
[]()78.4878.7559.3637.5619.7252.5425.1311.57cxy_nocaps_training
[]()78.2378.5859.0537.3919.4352.3525.1211.62Xinyi
[]()75.8877.9760.2740.6823.4854.326.1511.89camel XE
[]()75.5876.8957.7636.9320.1152.5325.1811.68MQ-UpDown-C
[]()73.0976.5956.7435.3918.4151.8224.4211.2UpDown + ELMo + CBS
ClipCap: CLIP Prefix for Image Captioning✓ Link65.8310.86ClipCap (Transformer)2021-11-18
ClipCap: CLIP Prefix for Image Captioning✓ Link65.711.1ClipCap (MLP + GPT2 tuning)2021-11-18
[]()61.4873.4252.1229.3512.8848.7422.069.69Neural Baby Talk + CBS
[]()61.4872.4952.8833.2217.7550.423.8910.967_10-7_40000_predict_test.json
[]()55.9771.6952.0431.716.7349.6422.5310.1None
[]()54.2574.055.1135.2319.1650.9222.9610.14nocaps_training
[]()54.2574.055.1135.2319.1650.9222.9610.14UpDown
[]()53.3672.3352.4230.8314.7348.8721.529.15Neural Baby Talk
[]()49.0272.7852.5231.7416.3149.3821.729.54YX
[]()48.2972.0251.9731.6216.4849.0321.879.56area_attention
[]()47.6973.0454.0833.8817.6949.9721.859.42B2
[]()46.1867.8547.3725.7611.9646.6119.848.35Yu-Wu
[]()45.2769.4448.9528.6415.0247.620.779.13coco_all_19
[]()39.3369.0747.6525.511.7246.5819.618.2CS395T