OpenCodePapers

video-captioning-on-youcook2

Video Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeBLEU-4BLEU-3CIDErROUGE-LMETEORModelNameReleaseDate
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link18.21.99VAST2023-05-29
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link17.9224.121.9047.0422.56UniVL + MELTR2023-03-23
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation✓ Link17.3523.871.8146.5222.35UniVL2020-02-15
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners14.21.2837.7VideoCoCa2022-12-09
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding✓ Link12.2717.781.386941.5118.22VLM2021-05-20
Multimodal Pretraining for Dense Video Captioning✓ Link12.041.2239.0318.32E2vidD6-MASSvid-BiD2020-11-10
Text with Knowledge Graph Augmented Transformer for Video Captioning11.71.3340.214.8TextKG2023-03-22
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning✓ Link11.3017.970.5737.9419.85COOT2020-11-01
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link10.11.31COSA2023-06-15
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link8.8116.437.315.9HowToCaption2023-10-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks8.7212.871.1636.0914.83OmniVL2022-09-15
End-to-End Dense Video Captioning with Masked Transformer✓ Link4.387.530.3827.4411.55Zhou2018-04-03
VideoBERT: A Joint Model for Video and Language Representation Learning✓ Link4.337.590.5528.8011.94VideoBERT + S3D2019-04-03
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding✓ Link1.3117.6MA-LMM2024-04-08