image-captioning-on-nocaps-out-of-domain

Image Captioning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	CIDEr	B1	B2	B3	B4	ROUGE-L	METEOR	SPICE	ModelName	ReleaseDate
PaLI: A Jointly-Scaled Multilingual Language-Image Model	✓ Link	126.67	86.28	71.19	52.63	32.0	61.35	30.99	15.49	PaLI	2022-09-14
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	122.27	86.28	71.15	52.36	30.15	60.91	30.15	15.62	GIT2, Single Model	2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	122.04	85.99	71.28	52.66	30.04	60.96	30.45	15.7	GIT, Single Model	2022-05-27
[]()		121.69	84.75	70.24	52.13	31.89	60.57	30.18	15.13	CoCa - Google Brain
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning		110.14	81.73	65.48	45.58	25.78	57.57	28.17	13.74	Microsoft Cognitive Services team	2020-09-28
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	✓ Link	109.49	80.89	64.21	44.38	24.47	56.69	27.91	13.89	Single Model	2021-08-24
[]()		106.55	81.44	64.71	45.26	25.31	57.29	28.13	14.21	FudanFVL
[]()		103.75	80.0	62.7	43.58	24.57	56.41	27.75	13.75	FudanWYZ
[]()		91.62	74.84	53.9	33.51	16.6	51.5	26.83	14.21	Human
[]()		88.54	76.65	60.06	41.58	22.66	55.08	27.39	13.87	firethehole
[]()		87.51	79.52	61.01	40.14	20.64	55.0	25.55	12.52	IEDA-LAB
[]()		87.15	75.71	56.39	35.94	17.96	51.75	24.01	11.43	icgp2ssi1_coco_si_0.02_5_test
[]()		85.18	75.5	56.14	34.53	16.69	51.54	23.69	11.18	evertyhing
[]()		78.91	76.41	56.87	35.99	16.92	52.51	24.5	12.14	vll@mk514
VinVL: Revisiting Visual Representations in Vision-Language Models	✓ Link	78.01	75.78	56.1	34.02	15.86	51.99	23.55	11.48	VinVL (Microsoft Cognitive Services + MSR)	2021-01-02
[]()		77.39	76.81	57.39	36.13	17.85	52.54	23.79	11.59	MD
[]()		75.39	72.47	52.01	28.26	11.94	48.81	22.04	10.68	RCAL
[]()		73.75	74.98	53.26	28.88	12.42	50.0	21.73	9.72	Oscar
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features	✓ Link	72.6							11.1	GRIT (zero-shot, no CBS, no VL pretraining, single model)	2022-07-20
[]()		72.13	76.2	57.25	36.37	17.68	52.86	23.88	11.53	ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS
[]()		71.43	73.95	52.76	29.34	11.69	49.5	22.18	10.57	vinvl_yuan_cbs
[]()		70.21	72.94	51.36	28.32	11.99	48.6	21.73	10.15	UpDown-C
[]()		68.92	72.53	49.99	27.18	10.57	47.23	21.57	10.05	Xinyi
[]()		68.5	73.07	50.81	27.58	10.98	47.53	21.65	10.01	cxy_nocaps_training
[]()		66.67	71.57	48.58	25.77	9.68	47.13	20.88	9.74	UpDown + ELMo + CBS
[]()		58.48	65.98	43.2	21.16	7.5	44.47	19.04	8.77	Neural Baby Talk + CBS
[]()		54.56	71.34	50.32	29.44	12.99	48.85	21.55	9.9	camel XE
ClipCap: CLIP Prefix for Image Captioning	✓ Link	49.35							9.7	ClipCap (MLP + GPT2 tuning)	2021-11-18
ClipCap: CLIP Prefix for Image Captioning	✓ Link	49.14							9.57	ClipCap (Transformer)	2021-11-18
[]()		48.73	64.45	42.8	21.48	7.92	44.11	18.31	8.2	Neural Baby Talk
[]()		43.2	66.14	44.7	24.58	10.14	45.72	19.95	9.35	7_10-7_40000_predict_test.json
[]()		39.39	60.95	38.3	17.19	6.11	42.46	16.97	7.62	Yu-Wu
[]()		36.12	47.08	22.24	7.41	1.83	31.57	17.94	9.39	Check
[]()		30.09	66.54	44.28	24.23	10.17	44.84	18.29	8.08	nocaps_training
[]()		30.09	66.54	44.28	24.23	10.17	44.84	18.29	8.08	UpDown
[]()		26.55	64.58	41.56	21.71	8.72	43.59	17.43	7.72	area_attention
[]()		26.25	66.44	42.47	21.15	8.54	44.23	17.2	7.52	YX
[]()		25.91	66.32	44.27	23.82	9.46	44.37	17.48	7.61	B2
[]()		23.07	61.62	38.55	18.45	7.55	41.58	16.07	7.4	coco_all_19
[]()		21.3	63.0	39.71	19.99	8.2	43.02	16.19	7.2	CS395T

OpenCodePapers

image-captioning-on-nocaps-out-of-domain