OpenCodePapers
video-captioning-on-youcook2
Video Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
BLEU-4
↕
BLEU-3
↕
CIDEr
↕
ROUGE-L
↕
METEOR
↕
ModelName
ReleaseDate
↕
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
✓ Link
18.2
1.99
VAST
2023-05-29
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
✓ Link
17.92
24.12
1.90
47.04
22.56
UniVL + MELTR
2023-03-23
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
✓ Link
17.35
23.87
1.81
46.52
22.35
UniVL
2020-02-15
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
14.2
1.28
37.7
VideoCoCa
2022-12-09
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
✓ Link
12.27
17.78
1.3869
41.51
18.22
VLM
2021-05-20
Multimodal Pretraining for Dense Video Captioning
✓ Link
12.04
1.22
39.03
18.32
E2vidD6-MASSvid-BiD
2020-11-10
Text with Knowledge Graph Augmented Transformer for Video Captioning
11.7
1.33
40.2
14.8
TextKG
2023-03-22
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
✓ Link
11.30
17.97
0.57
37.94
19.85
COOT
2020-11-01
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
✓ Link
10.1
1.31
COSA
2023-06-15
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
✓ Link
8.8
116.4
37.3
15.9
HowToCaption
2023-10-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
8.72
12.87
1.16
36.09
14.83
OmniVL
2022-09-15
End-to-End Dense Video Captioning with Masked Transformer
✓ Link
4.38
7.53
0.38
27.44
11.55
Zhou
2018-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning
✓ Link
4.33
7.59
0.55
28.80
11.94
VideoBERT + S3D
2019-04-03
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
✓ Link
1.31
17.6
MA-LMM
2024-04-08