OpenCodePapers

video-retrieval-on-msr-vtt-1ka

Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video Median Ranktext-to-video Mean Rankvideo-to-text R@1video-to-text R@5video-to-text R@10video-to-text Median Rankvideo-to-text Mean RankModelNameReleaseDate
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations62.984.590.81.09.3 64.884.991.11.05.5HunYuan_tvr (huge)2022-04-07
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment✓ Link57.780.588.21.0CLIP-ViP2022-09-14
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval55.979.887.61.010.754.578,387.31.07.5PIDRo2023-01-01
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning✓ Link55.579.487.11.010.055.779.287.21.07.3DMAE (ViT-B/16)2023-09-20
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations55.055.578.485.81.07.7HunYuan_tvr2022-04-07
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling54.777.786.0MuLTI2023-03-10
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring✓ Link54.179.587.81STAN2023-01-26
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning✓ Link54.178.886.9EERCF2024-01-01
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval✓ Link54.079.387.4TS2-Net2022-07-16
RTQ: Rethinking Video-language Understanding Based on Image-text Model✓ Link53.476.184.4RTQ2023-12-01
Disentangled Representation Learning for Text-Video Retrieval✓ Link53.380.387.6111.456.279.987.41.07.6DRL2022-03-14
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link53.177.684.7mPLUG-22023-02-01
CLIP2TV: Align, Match and Distill for Video-Text Retrieval52.978.586.5112.854.177.485.719.0CLIP2TV2021-11-10
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning✓ Link52.375.584.21.012.8Side4Video2023-11-27
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link51.678.185.3151.880.2881EMCL-Net++2022-11-21
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?✓ Link51.475.783.9112.449.075.285.028.0Cap4Video2022-12-31
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning✓ Link49.875.183.947.37684.3SuMA (ViT-B/16)2023-02-19
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link49.676.784.2X2-VLM (large)2022-11-22
Unified Coarse-to-Fine Alignment for Video-Text Retrieval✓ Link49.472.183.547.174.383.0UCoFiA2023-09-18
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval✓ Link49.375.884.82.012.248.976.884.52.08.1X-CLIP2022-07-15
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model✓ Link49.075.282.72.012.147.773.884.52.08.8DiffusionRet2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model✓ Link48.975.283.12.012.149.374.383.82.08.5DiffusionRet+QB-Norm2023-03-17
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss✓ Link48.875.685.3212.450.374.683.829.9CAMoE2021-09-09
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning✓ Link48.674.683.42.012.046.874.384.32.08.9HBI2023-03-25
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval✓ Link48.572.782.52.014.048.373.083.22.09.7PAU2023-09-29
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval✓ Link48.473.882.0213.847.775.083.3210.2CenterCLIP (ViT-B/16)2022-05-02
Holistic Features are almost Sufficient for Text-to-Video Retrieval✓ Link48.075.983.5TeachCLIP (ViT-B/16)2024-01-01
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link47.674.184.2X2-VLM (base)2022-11-22
Cross Modal Retrieval with Querybank Normalisation✓ Link47.273.083.02QB-Norm+CLIP2Video2021-12-23
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval✓ Link46.972.882.2214.344.473.384.02.09.0X-Pool2022-03-28
Holistic Features are almost Sufficient for Text-to-Video Retrieval✓ Link46.874.382.6TeachCLIP2024-01-01
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link46.873.183.1246.573.583.52EMCL-Net2022-11-21
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training46.871.281.9HiTeA2022-12-30
VindLU: A Recipe for Effective Video-and-Language Pretraining✓ Link46.571.580.4VindLU2022-12-09
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval✓ Link45.871.582LAFF2021-12-03
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP✓ Link45.672.681.7214.643.372.382.1210.2CLIP2Video2021-06-21
Revealing Single Frame Bias for Video-and-Language Learning✓ Link41.568.777Singularity2022-06-07
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link41.373.582.5All-in-one + MELTR2023-03-23
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link40.569.879.42Clover2022-07-16
MDMMT: Multidomain Multimodal Transformer for Video Retrieval✓ Link38.969.079.7216.5MDMMT2021-03-19
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval38.963.173.93MAC2022-12-02
All in One: Exploring Unified Video-Language Pre-training✓ Link37.968.177.1All-in-one-B2022-03-14
Bridging Video-text Retrieval with Multiple Choice Questions✓ Link37.664.875.13BridgeFormer2022-01-13
Florence: A New Foundation Model for Computer Vision✓ Link37.663.872.6Florence2021-11-22
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval36.863.873.22COTS2022-04-15
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link35.567.278.43VIOLET + MELTR2023-03-23
A Straightforward Framework For Video Retrieval Using CLIP✓ Link31.253.764.2427.251.762.65CLIP2021-02-24
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link31.155.768.34UniVL + MELTR2023-03-23
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link31.059.570.53FROZEN2021-04-01
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding✓ Link30.955.466.8VideoCLIP2021-09-28
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment28.457.871.24TACo2021-08-23
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding✓ Link28.1055.5067.404VLM2021-05-20
Multi-modal Transformer for Video Retrieval✓ Link26.657.169.6424.0MMT-Pretrained2020-07-21
Bridging Video-text Retrieval with Multiple Choice Questions✓ Link2646.456.47BridgeFormer (Zero-shot)2022-01-13
Multi-modal Transformer for Video Retrieval✓ Link24.654.067.1426.7MMT2020-07-21
Use What You Have: Video Retrieval Using Representations From Collaborative Experts✓ Link20.948.862.4628.2Collaborative Experts2019-07-31
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips✓ Link14.940.252.89HT-Pretrained2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips✓ Link12.135.048.012HT2019-06-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval✓ Link10.231.243.213JSFusion2018-08-07
OmniVec: Learning robust representations with cross modal sharing89.4OmniVec2023-11-07
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link81.6215.342.770.980.62CLIP4Clip2021-04-18
OmniVec: Learning robust representations with cross modal sharing78.6OmniVec (pretrained)2023-11-07
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language✓ Link42.8Socratic Models2022-04-01