OpenCodePapers

video-captioning-on-youcook2

Video Captioning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	BLEU-4	BLEU-3	CIDEr	ROUGE-L	METEOR	ModelName	ReleaseDate
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	18.2		1.99			VAST	2023-05-29
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	17.92	24.12	1.90	47.04	22.56	UniVL + MELTR	2023-03-23
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation	✓ Link	17.35	23.87	1.81	46.52	22.35	UniVL	2020-02-15
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		14.2		1.28	37.7		VideoCoCa	2022-12-09
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	✓ Link	12.27	17.78	1.3869	41.51	18.22	VLM	2021-05-20
Multimodal Pretraining for Dense Video Captioning	✓ Link	12.04		1.22	39.03	18.32	E2vidD6-MASSvid-BiD	2020-11-10
Text with Knowledge Graph Augmented Transformer for Video Captioning		11.7		1.33	40.2	14.8	TextKG	2023-03-22
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	✓ Link	11.30	17.97	0.57	37.94	19.85	COOT	2020-11-01
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	10.1		1.31			COSA	2023-06-15
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	8.8		116.4	37.3	15.9	HowToCaption	2023-10-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		8.72	12.87	1.16	36.09	14.83	OmniVL	2022-09-15
End-to-End Dense Video Captioning with Masked Transformer	✓ Link	4.38	7.53	0.38	27.44	11.55	Zhou	2018-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning	✓ Link	4.33	7.59	0.55	28.80	11.94	VideoBERT + S3D	2019-04-03
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	✓ Link			1.31		17.6	MA-LMM	2024-04-08