image-captioning-on-nocaps-entire

Image Captioning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	CIDEr	B1	B2	B3	B4	ROUGE-L	METEOR	SPICE	ModelName	ReleaseDate
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects		126.8								Lyrics	2023-12-08
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	123.39	88.1	74.81	57.68	37.35	63.12	32.5	15.94	GIT, Single Model	2022-05-27
[]()		120.55	87.01	73.71	56.88	37.71	62.52	32.29	15.47	CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning		114.25	85.62	71.36	53.62	34.65	61.2	31.27	14.85	Microsoft Cognitive Services team	2020-09-28
Prismer: A Vision-Language Model with Multi-Task Experts	✓ Link	110.84	84.87	69.99	52.48	33.66	60.55	31.13	14.91	Prismer	2023-03-04
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	✓ Link	110.31	83.78	68.86	51.06	32.2	59.86	30.55	14.49	Single Model	2021-08-24
[]()		108.29	83.9	68.77	50.84	32.17	59.82	30.64	14.72	FudanFVL
[]()		106.81	82.95	67.45	49.58	31.38	59.18	30.32	14.56	FudanWYZ
[]()		98.08	83.25	67.3	48.41	29.27	58.56	28.92	13.9	IEDA-LAB
[]()		97.61	80.77	65.55	48.14	30.2	58.25	30.07	14.74	firethehole
[]()		93.45	81.61	65.1	46.13	27.32	57.4	28.46	14.06	vll@mk514
[]()		93.0	82.43	66.25	47.18	28.2	57.57	28.09	13.35	MD
VinVL: Revisiting Visual Representations in Vision-Language Models	✓ Link	92.46	81.59	65.15	45.04	26.15	56.96	27.57	13.07	VinVL (Microsoft Cognitive Services + MSR)	2021-01-02
[]()		87.56	81.03	64.62	45.26	26.52	56.7	27.36	12.81	ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()		87.34	79.0	61.95	42.36	24.62	55.03	26.29	12.01	icgp2ssi1_coco_si_0.02_5_test
[]()		86.0	78.92	61.6	41.52	23.52	54.75	26.31	12.1	evertyhing
[]()		85.34	76.64	56.46	36.37	19.48	52.83	28.15	14.67	Human
[]()		82.88	78.19	60.74	39.11	21.24	53.85	25.72	12.2	RCAL
[]()		80.93	79.57	60.83	38.83	21.02	54.07	25.33	11.29	Oscar
[]()		79.04	79.32	60.95	39.5	20.3	53.8	25.44	11.9	vinvl_yuan_cbs
[]()		78.48	78.75	59.36	37.56	19.72	52.54	25.13	11.57	cxy_nocaps_training
[]()		78.23	78.58	59.05	37.39	19.43	52.35	25.12	11.62	Xinyi
[]()		75.88	77.97	60.27	40.68	23.48	54.3	26.15	11.89	camel XE
[]()		75.58	76.89	57.76	36.93	20.11	52.53	25.18	11.68	MQ-UpDown-C
[]()		73.09	76.59	56.74	35.39	18.41	51.82	24.42	11.2	UpDown + ELMo + CBS
ClipCap: CLIP Prefix for Image Captioning	✓ Link	65.83							10.86	ClipCap (Transformer)	2021-11-18
ClipCap: CLIP Prefix for Image Captioning	✓ Link	65.7							11.1	ClipCap (MLP + GPT2 tuning)	2021-11-18
[]()		61.48	73.42	52.12	29.35	12.88	48.74	22.06	9.69	Neural Baby Talk + CBS
[]()		61.48	72.49	52.88	33.22	17.75	50.4	23.89	10.96	7_10-7_40000_predict_test.json
[]()		55.97	71.69	52.04	31.7	16.73	49.64	22.53	10.1	None
[]()		54.25	74.0	55.11	35.23	19.16	50.92	22.96	10.14	nocaps_training
[]()		54.25	74.0	55.11	35.23	19.16	50.92	22.96	10.14	UpDown
[]()		53.36	72.33	52.42	30.83	14.73	48.87	21.52	9.15	Neural Baby Talk
[]()		49.02	72.78	52.52	31.74	16.31	49.38	21.72	9.54	YX
[]()		48.29	72.02	51.97	31.62	16.48	49.03	21.87	9.56	area_attention
[]()		47.69	73.04	54.08	33.88	17.69	49.97	21.85	9.42	B2
[]()		46.18	67.85	47.37	25.76	11.96	46.61	19.84	8.35	Yu-Wu
[]()		45.27	69.44	48.95	28.64	15.02	47.6	20.77	9.13	coco_all_19
[]()		39.33	69.07	47.65	25.5	11.72	46.58	19.61	8.2	CS395T

OpenCodePapers

image-captioning-on-nocaps-entire