image-captioning-on-nocaps-near-domain

Image Captioning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	CIDEr	B1	B2	B3	B4	ROUGE-L	METEOR	SPICE	ModelName	ReleaseDate
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	125.51	88.9	75.86	58.9	38.95	63.66	32.95	16.11	GIT2, Single Model	2022-05-27
PaLI: A Jointly-Scaled Multilingual Language-Image Model	✓ Link	124.35	88.57	75.56	58.99	39.98	63.99	33.47	15.75	PaLI	2022-09-14
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	123.92	88.56	75.48	58.46	38.44	63.5	32.86	15.96	GIT, Single Model	2022-05-27
[]()		120.73	87.53	74.49	57.89	38.92	62.91	32.71	15.54	CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning		115.54	86.48	72.6	55.26	36.31	61.9	31.8	15.06	Microsoft Cognitive Services team	2020-09-28
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	✓ Link	110.76	84.36	69.83	52.42	33.74	60.46	30.97	14.61	Single Model	2021-08-24
[]()		109.33	84.47	69.66	51.95	33.46	60.34	31.08	14.79	FudanFVL
[]()		108.04	83.71	68.56	50.9	32.72	59.8	30.79	14.71	FudanWYZ
[]()		100.15	84.04	68.58	49.98	30.78	59.23	29.53	14.15	IEDA-LAB
[]()		99.51	81.62	66.65	49.39	31.42	58.83	30.48	14.88	firethehole
[]()		95.73	83.58	67.99	49.29	29.96	58.47	28.84	13.64	MD
[]()		95.69	82.55	66.55	47.8	29.0	58.22	29.11	14.37	vll@mk514
VinVL: Revisiting Visual Representations in Vision-Language Models	✓ Link	95.16	82.77	66.94	47.02	27.97	57.95	28.24	13.36	VinVL (Microsoft Cognitive Services + MSR)	2021-01-02
[]()		89.87	81.93	65.88	46.72	27.94	57.34	27.89	12.98	ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()		87.41	79.61	63.01	43.59	25.85	55.63	26.63	12.11	icgp2ssi1_coco_si_0.02_5_test
[]()		85.89	79.67	62.73	42.87	24.8	55.37	26.68	12.24	evertyhing
[]()		84.58	77.05	56.97	36.84	19.85	53.06	28.42	14.72	Human
[]()		84.0	79.21	62.26	40.77	22.56	54.62	26.3	12.47	RCAL
[]()		82.07	80.54	62.32	40.65	22.37	54.78	25.91	11.53	Oscar
[]()		80.21	80.24	62.31	41.07	21.53	54.52	25.98	12.12	vinvl_yuan_cbs
[]()		79.72	79.69	60.75	39.06	20.97	53.37	25.64	11.81	cxy_nocaps_training
[]()		79.44	79.59	60.52	38.95	20.72	53.18	25.64	11.88	Xinyi
[]()		79.14	79.21	62.06	42.51	25.06	55.24	26.87	12.14	camel XE
[]()		76.34	77.76	59.0	38.29	21.0	53.15	25.59	11.87	MQ-UpDown-C
[]()		74.2	77.68	58.31	37.04	19.85	52.64	24.97	11.45	UpDown + ELMo + CBS
ClipCap: CLIP Prefix for Image Captioning	✓ Link	67.69							11.26	ClipCap (MLP + GPT2 tuning)	2021-11-18
ClipCap: CLIP Prefix for Image Captioning	✓ Link	66.82							10.92	ClipCap (Transformer)	2021-11-18
[]()		63.96	73.6	54.26	34.59	18.95	51.23	24.52	11.14	7_10-7_40000_predict_test.json
[]()		61.98	74.77	53.67	30.66	13.85	49.45	22.55	9.83	Neural Baby Talk + CBS
[]()		58.5	72.91	53.74	33.49	18.04	50.53	23.12	10.28	None
[]()		56.85	75.25	56.93	36.91	20.49	51.84	23.6	10.33	nocaps_training
[]()		56.85	75.25	56.93	36.91	20.49	51.84	23.6	10.33	UpDown
[]()		53.21	73.69	54.1	32.37	15.99	49.63	21.93	9.26	Neural Baby Talk
[]()		51.16	73.73	53.98	33.1	17.28	50.0	22.27	9.7	YX
[]()		50.34	73.19	53.56	32.94	17.49	49.79	22.43	9.7	area_attention
[]()		49.62	74.07	55.53	35.22	18.79	50.77	22.41	9.54	B2
[]()		47.53	70.84	50.79	30.26	16.14	48.61	21.48	9.28	coco_all_19
[]()		46.64	68.86	48.7	26.85	12.6	47.13	20.18	8.37	Yu-Wu
[]()		40.45	70.05	48.92	26.19	12.11	47.04	20.05	8.28	CS395T
PaLI: A Jointly-Scaled Multilingual Language-Image Model	✓ Link								15.75	PaLI	2022-09-14

OpenCodePapers

image-captioning-on-nocaps-near-domain