OpenCodePapers

video-retrieval-on-lsmdc

Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video Median Ranktext-to-video Mean Rankvideo-to-text R@1video-to-text R@10video-to-text R@5video-to-text Median Rankvideo-to-text Mean RankModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link46.446.7InternVideo2-6B2024-03-22
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link43.164.571.440.763.670.2vid-TLDR (UMT-L)2024-03-20
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link43.065.573.041.471.564.3UMT-L (ViT-L/16)2023-03-28
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations40.480.192.82.03.934.6 91.871.82.04.3HunYuan_tvr (huge)2022-04-07
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link39.4COSA2023-06-15
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link34.455.265.1mPLUG-22023-02-01
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link34.256.064.1VALOR2023-04-17
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link34.034.9InternVideo2022-12-06
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment✓ Link30.751.460.65CLIP-ViP2022-09-14
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations29.746.455.4756.430.155.747.5748.9HunYuan_tvr2022-04-07
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring✓ Link29.249.558.86STAN2023-01-26
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training28.750.359.0HiTeA2022-12-30
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization26.946.755.96.748.0MDMMT-22022-03-14
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval✓ Link26.126.9X-CLIP2022-07-15
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link25.946.426.754.444.78EMCL-Net++2022-11-21
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss✓ Link25.946.153.754.4CAMoE2021-09-09
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval✓ Link25.243.753.58.053.222.751.242.610.047.4X-Pool2022-03-28
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link24.84454.58Clover2022-07-16
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model✓ Link24.443.154.38.040.723.051.543.59.040.2DiffusionRet2023-03-17
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval✓ Link24.246.255.9847.324.555.846.4741.3CenterCLIP (ViT-B/16)2022-05-02
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link2443.554.1VIOLETv22022-09-04
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link23.942.450.922.249.240.612EMCL-Net2022-11-21
Cross Modal Retrieval with Querybank Normalisation✓ Link22.440.149.511.0QB-Norm+CLIP4Clip2021-12-23
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link21.641.849.858.0CLIP4Clip2021-04-18
MDMMT: Multidomain Multimodal Transformer for Video Retrieval✓ Link18.838.547.912.358.0MDMMT2021-03-19
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions✓ Link17.434.144.115HD-VILA2021-11-19
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link15.030.839.820.0FROZEN2021-04-01
Video and Text Matching with Conditioned Embeddings✓ Link14.933.215.334.1Ours2021-10-21
Multi-modal Transformer for Video Retrieval✓ Link13.529.940.119.3MMT-Pretrained2020-07-21
Multi-modal Transformer for Video Retrieval✓ Link13.229.238.821MMT2020-07-21
A Straightforward Framework For Video Retrieval Using CLIP✓ Link11.322.729.256.56.822.116.473CLIP2021-02-24
Use What You Have: Video Retrieval Using Representations From Collaborative Experts✓ Link11.226.934.825Collaborative Experts2019-07-31
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data✓ Link10.125.634.627MoEE2018-04-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval✓ Link9.121.234.136JSFusion2018-08-07
Learning from Video and Text via Large-Scale Discriminative Clustering✓ Link7.319.227.152Large-Scale Discriminative Clustering2017-07-27
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips✓ Link7.219.627.940Text-Video Embedding2019-06-07
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering5.116.325.246CT-SAN2016-10-10
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link53.78EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)2022-11-21