mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 80.0 | 34.9 | 70.1 | 57.8 | | mPLUG-2 | 2023-02-01 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 78.0 | | | 56.7 | | VAST | 2023-05-29 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 75.9 | 33.1 | 68.2 | 54.8 | 201.6 | GIT2 | 2022-05-27 |
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 74.9 | 33.4 | 68.3 | 54.6 | | VLAB | 2023-05-22 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 74.7 | | | 53.7 | | COSA | 2023-06-15 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 74.0 | 32.9 | 68.0 | 54.4 | | VALOR | 2023-04-17 |
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | ✓ Link | 73.6 | | | | | MaMMUT (ours) | 2023-03-29 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | | 73.2 | | 68.0 | 53.8 | | VideoCoCa | 2022-12-09 |
RTQ: Rethinking Video-language Understanding Based on Image-text Model | ✓ Link | 69.3 | | 66.1 | 49.6 | | RTQ | 2023-12-01 |
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | ✓ Link | 65.3 | 32.2 | 66.3 | 49.8 | | HowToCaption | 2023-10-07 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 65.1 | 30.7 | 65.0 | 49.2 | | HiTeA | 2022-12-30 |
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | ✓ Link | 64.6 | 30.8 | | | | Vid2Seq | 2023-02-27 |
Text with Knowledge Graph Augmented Transformer for Video Captioning | | 60.8 | 30.5 | 64.8 | 46.6 | | TextKG | 2023-03-22 |
IcoCap: Improving Video Captioning by Compounding Images | | 60.2 | 31.1 | 64.9 | 47.0 | | IcoCap (ViT-B/16) | 2023-10-05 |
End-to-end Generative Pretraining for Multimodal Video Captioning | | 60.0 | 38.7 | 64.0 | 48.9 | | MV-GPT | 2022-01-20 |
IcoCap: Improving Video Captioning by Compounding Images | | 59.1 | 30.3 | 64.3 | 46.1 | | IcoCap (ViT-B/32) | 2023-10-05 |
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter | ✓ Link | 58.7 | 31.3 | 64.8 | 48.2 | | CLIP-DCD | 2021-11-30 |
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 58 | | | | | VIOLETv2 | 2022-09-04 |
Accurate and Fast Compressed Video Captioning | ✓ Link | 57.2 | 30.3 | 63.4 | 44.4 | | CoCap (ViT/L14) | 2023-09-22 |
Diverse Video Captioning by Adaptive Spatio-temporal Attention | ✓ Link | 56.08 | 30.24 | 62.9 | 44.21 | | VASTA (Vatex-backbone) | 2022-08-19 |
Diverse Video Captioning by Adaptive Spatio-temporal Attention | ✓ Link | 55 | 30.2 | 62.5 | 43.4 | | VASTA (Kinetics-backbone) | 2022-08-19 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 54.6 | 30.2 | 63.2 | 45.3 | | EMCL-Net | 2022-11-21 |
SEM-POS: Grammatically and Semantically Correct Video Captioning | | 53.1 | 30.7 | 64.1 | 45.2 | 192.6 | SEM-POS | 2023-03-26 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 52.77 | 29.26 | 62.35 | 44.17 | | UniVL + MELTR | 2023-03-23 |