OpenCodePapers

video-captioning-on-msvd-1

Video Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeCIDErBLEU-4METEORROUGE-LGSModelNameReleaseDate
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks✓ Link195.6MaMMUT2023-03-29
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending179.879.351.287.9VLAB2023-05-22
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link178.580.751.087.9VALOR2023-04-17
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link178.576.5COSA2023-06-15
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link165.870.548.485.3mPLUG-22023-02-01
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link154.270.446.483.2HowToCaption2023-10-07
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training146.971.045.381.4HiTeA2022-12-30
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning✓ Link146.245.3Vid2Seq2023-02-27
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link139.2VIOLETv22022-09-04
RTQ: Rethinking Video-language Understanding Based on Image-text Model✓ Link123.466.982.2RTQ2023-12-01
Accurate and Fast Compressed Video Captioning✓ Link121.560.141.478.2CoCap (ViT/L14)2023-09-22
Diverse Video Captioning by Adaptive Spatio-temporal Attention✓ Link119.759.240.6576.7VASTA (Vatex-backbone)2022-08-19
IcoCap: Improving Video Captioning by Compounding Images110.359.139.576.5IcoCap (ViT-B/16)2023-10-05
SEM-POS: Grammatically and Semantically Correct Video Captioning108.360.138.576.0607.1SEM-POS2023-03-26
Diverse Video Captioning by Adaptive Spatio-temporal Attention✓ Link106.456.139.174.5VASTA (Kinetics-backbone)2022-08-19
IcoCap: Improving Video Captioning by Compounding Images103.856.338.975.0IcoCap (ViT-B/32)2023-10-05