InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 46.4 | | | | | 46.7 | | | | | InternVideo2-6B | 2024-03-22 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 43.1 | 64.5 | 71.4 | | | 40.7 | 63.6 | 70.2 | | | vid-TLDR (UMT-L) | 2024-03-20 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 43.0 | 65.5 | 73.0 | | | 41.4 | 71.5 | 64.3 | | | UMT-L (ViT-L/16) | 2023-03-28 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 40.4 | 80.1 | 92.8 | 2.0 | 3.9 | 34.6 | 91.8 | 71.8 | 2.0 | 4.3 | HunYuan_tvr (huge) | 2022-04-07 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 39.4 | | | | | | | | | | COSA | 2023-06-15 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 34.4 | 55.2 | 65.1 | | | | | | | | mPLUG-2 | 2023-02-01 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 34.2 | 56.0 | 64.1 | | | | | | | | VALOR | 2023-04-17 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 34.0 | | | | | 34.9 | | | | | InternVideo | 2022-12-06 |
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | ✓ Link | 30.7 | 51.4 | 60.6 | 5 | | | | | | | CLIP-ViP | 2022-09-14 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 29.7 | 46.4 | 55.4 | 7 | 56.4 | 30.1 | 55.7 | 47.5 | 7 | 48.9 | HunYuan_tvr | 2022-04-07 |
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | ✓ Link | 29.2 | 49.5 | 58.8 | 6 | | | | | | | STAN | 2023-01-26 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 28.7 | 50.3 | 59.0 | | | | | | | | HiTeA | 2022-12-30 |
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization | | 26.9 | 46.7 | 55.9 | 6.7 | 48.0 | | | | | | MDMMT-2 | 2022-03-14 |
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✓ Link | 26.1 | | | | | 26.9 | | | | | X-CLIP | 2022-07-15 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 25.9 | 46.4 | | | | 26.7 | 54.4 | 44.7 | | 8 | EMCL-Net++ | 2022-11-21 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 25.9 | 46.1 | 53.7 | | 54.4 | | | | | | CAMoE | 2021-09-09 |
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | ✓ Link | 25.2 | 43.7 | 53.5 | 8.0 | 53.2 | 22.7 | 51.2 | 42.6 | 10.0 | 47.4 | X-Pool | 2022-03-28 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 24.8 | 44 | 54.5 | 8 | | | | | | | Clover | 2022-07-16 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 24.4 | 43.1 | 54.3 | 8.0 | 40.7 | 23.0 | 51.5 | 43.5 | 9.0 | 40.2 | DiffusionRet | 2023-03-17 |
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval | ✓ Link | 24.2 | 46.2 | 55.9 | 8 | 47.3 | 24.5 | 55.8 | 46.4 | 7 | 41.3 | CenterCLIP (ViT-B/16) | 2022-05-02 |
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 24 | 43.5 | 54.1 | | | | | | | | VIOLETv2 | 2022-09-04 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 23.9 | 42.4 | 50.9 | | | 22.2 | 49.2 | 40.6 | | 12 | EMCL-Net | 2022-11-21 |
Cross Modal Retrieval with Querybank Normalisation | ✓ Link | 22.4 | 40.1 | 49.5 | 11.0 | | | | | | | QB-Norm+CLIP4Clip | 2021-12-23 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 21.6 | 41.8 | 49.8 | | 58.0 | | | | | | CLIP4Clip | 2021-04-18 |
MDMMT: Multidomain Multimodal Transformer for Video Retrieval | ✓ Link | 18.8 | 38.5 | 47.9 | 12.3 | 58.0 | | | | | | MDMMT | 2021-03-19 |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | ✓ Link | 17.4 | 34.1 | 44.1 | 15 | | | | | | | HD-VILA | 2021-11-19 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 15.0 | 30.8 | 39.8 | 20.0 | | | | | | | FROZEN | 2021-04-01 |
Video and Text Matching with Conditioned Embeddings | ✓ Link | 14.9 | 33.2 | | | | 15.3 | | 34.1 | | | Ours | 2021-10-21 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 13.5 | 29.9 | 40.1 | 19.3 | | | | | | | MMT-Pretrained | 2020-07-21 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 13.2 | 29.2 | 38.8 | 21 | | | | | | | MMT | 2020-07-21 |
A Straightforward Framework For Video Retrieval Using CLIP | ✓ Link | 11.3 | 22.7 | 29.2 | 56.5 | | 6.8 | 22.1 | 16.4 | 73 | | CLIP | 2021-02-24 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 11.2 | 26.9 | 34.8 | 25 | | | | | | | Collaborative Experts | 2019-07-31 |
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data | ✓ Link | 10.1 | 25.6 | 34.6 | 27 | | | | | | | MoEE | 2018-04-07 |
A Joint Sequence Fusion Model for Video Question Answering and Retrieval | ✓ Link | 9.1 | 21.2 | 34.1 | 36 | | | | | | | JSFusion | 2018-08-07 |
Learning from Video and Text via Large-Scale Discriminative Clustering | ✓ Link | 7.3 | 19.2 | 27.1 | 52 | | | | | | | Large-Scale Discriminative Clustering | 2017-07-27 |
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips | ✓ Link | 7.2 | 19.6 | 27.9 | 40 | | | | | | | Text-Video Embedding | 2019-06-07 |
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering | | 5.1 | 16.3 | 25.2 | 46 | | | | | | | CT-SAN | 2016-10-10 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | | | 53.7 | | 8 | | | | | | EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015) | 2022-11-21 |