video-captioning-on-msr-vtt-1

Video Captioning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	CIDEr	METEOR	ROUGE-L	BLEU-4	GS	ModelName	ReleaseDate
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	80.0	34.9	70.1	57.8		mPLUG-2	2023-02-01
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	78.0			56.7		VAST	2023-05-29
GIT: A Generative Image-to-text Transformer for Vision and Language	✓ Link	75.9	33.1	68.2	54.8	201.6	GIT2	2022-05-27
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		74.9	33.4	68.3	54.6		VLAB	2023-05-22
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	74.7			53.7		COSA	2023-06-15
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	74.0	32.9	68.0	54.4		VALOR	2023-04-17
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	✓ Link	73.6					MaMMUT (ours)	2023-03-29
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		73.2		68.0	53.8		VideoCoCa	2022-12-09
RTQ: Rethinking Video-language Understanding Based on Image-text Model	✓ Link	69.3		66.1	49.6		RTQ	2023-12-01
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	65.3	32.2	66.3	49.8		HowToCaption	2023-10-07
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		65.1	30.7	65.0	49.2		HiTeA	2022-12-30
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	✓ Link	64.6	30.8				Vid2Seq	2023-02-27
Text with Knowledge Graph Augmented Transformer for Video Captioning		60.8	30.5	64.8	46.6		TextKG	2023-03-22
IcoCap: Improving Video Captioning by Compounding Images		60.2	31.1	64.9	47.0		IcoCap (ViT-B/16)	2023-10-05
End-to-end Generative Pretraining for Multimodal Video Captioning		60.0	38.7	64.0	48.9		MV-GPT	2022-01-20
IcoCap: Improving Video Captioning by Compounding Images		59.1	30.3	64.3	46.1		IcoCap (ViT-B/32)	2023-10-05
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter	✓ Link	58.7	31.3	64.8	48.2		CLIP-DCD	2021-11-30
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	58					VIOLETv2	2022-09-04
Accurate and Fast Compressed Video Captioning	✓ Link	57.2	30.3	63.4	44.4		CoCap (ViT/L14)	2023-09-22
Diverse Video Captioning by Adaptive Spatio-temporal Attention	✓ Link	56.08	30.24	62.9	44.21		VASTA (Vatex-backbone)	2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention	✓ Link	55	30.2	62.5	43.4		VASTA (Kinetics-backbone)	2022-08-19
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	54.6	30.2	63.2	45.3		EMCL-Net	2022-11-21
SEM-POS: Grammatically and Semantically Correct Video Captioning		53.1	30.7	64.1	45.2	192.6	SEM-POS	2023-03-26
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	52.77	29.26	62.35	44.17		UniVL + MELTR	2023-03-23

OpenCodePapers

video-captioning-on-msr-vtt-1