OpenCodePapers

video-captioning-on-msr-vtt-1

Video Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeCIDErMETEORROUGE-LBLEU-4GSModelNameReleaseDate
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link80.034.970.157.8mPLUG-22023-02-01
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link78.056.7VAST2023-05-29
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link75.933.168.254.8201.6GIT22022-05-27
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending74.933.468.354.6VLAB2023-05-22
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link74.753.7COSA2023-06-15
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link74.032.968.054.4VALOR2023-04-17
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks✓ Link73.6MaMMUT (ours)2023-03-29
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners73.268.053.8VideoCoCa2022-12-09
RTQ: Rethinking Video-language Understanding Based on Image-text Model✓ Link69.366.149.6RTQ2023-12-01
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link65.332.266.349.8HowToCaption2023-10-07
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training65.130.765.049.2HiTeA2022-12-30
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning✓ Link64.630.8Vid2Seq2023-02-27
Text with Knowledge Graph Augmented Transformer for Video Captioning60.830.564.846.6TextKG2023-03-22
IcoCap: Improving Video Captioning by Compounding Images60.231.164.947.0IcoCap (ViT-B/16)2023-10-05
End-to-end Generative Pretraining for Multimodal Video Captioning60.038.764.048.9MV-GPT2022-01-20
IcoCap: Improving Video Captioning by Compounding Images59.130.364.346.1IcoCap (ViT-B/32)2023-10-05
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter✓ Link58.731.364.848.2CLIP-DCD2021-11-30
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link58VIOLETv22022-09-04
Accurate and Fast Compressed Video Captioning✓ Link57.230.363.444.4CoCap (ViT/L14)2023-09-22
Diverse Video Captioning by Adaptive Spatio-temporal Attention✓ Link56.0830.2462.944.21VASTA (Vatex-backbone)2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention✓ Link5530.262.543.4VASTA (Kinetics-backbone)2022-08-19
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link54.630.263.245.3EMCL-Net2022-11-21
SEM-POS: Grammatically and Semantically Correct Video Captioning53.130.764.145.2192.6SEM-POS2023-03-26
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link52.7729.2662.3544.17UniVL + MELTR2023-03-23