OpenCodePapers

video-retrieval-on-msr-vtt

Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video Mean Ranktext-to-video Median Rankvideo-to-text R@1video-to-text R@5video-to-text R@10video-to-text Median Rankvideo-to-text Mean Ranktext-to-video MedianRtext-to-videoMedian RankModelNameReleaseDate
Gramian Multimodal Representation Learning and Alignment✓ Link6489.364.891.5GRAM2024-12-16
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link63.984.389.6VAST2023-05-29
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link62.860.2InternVideo2-6B2024-03-22
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link59.983.589.6VALOR2023-04-17
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link58.881.087.158.681.686.5UMT-L (ViT-L/16)2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link58.181.081.658.781.686.9vid-TLDR (UMT-L)2024-03-20
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link57.9COSA2023-06-15
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link55.257.9InternVideo2022-12-06
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending55.178.887.6VLAB2023-05-22
[]()52.473.9821Aurora (ours, r=64)
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment5276.686.1TEFAL2023-07-24
Unified Coarse-to-Fine Alignment for Video-Text Retrieval✓ Link49.472.183.5UCoFiA2023-09-18
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks47.874.283.8OmniVL2022-09-15
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link44.571.481.6CLIP4Clip-seqTransf2021-04-18
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link38.674.484.7All-in-one + MELTR2023-03-23
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link37.264.875.8VIOLETv22022-09-04
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions✓ Link35.665.3783HD-VILA2021-11-19
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners34.357.867.064.785.291.4VideoCoCa (zero-shot)2022-12-09
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization33.760.570.837.83.0MDMMT-22022-03-14
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link33.663.777.83VIOLET + MELTR2023-03-23
CLIP2TV: Align, Match and Distill for Video-Text Retrieval33.158.968.944.73CLIP2TV2021-11-10
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss✓ Link32.958.368.442.6359.886.292.813.8CAMoE2021-09-09
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link32.561.571.2FROZEN2021-04-01
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval32.160.870.23COTS2022-04-15
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link30.052.461.649.973.481.4CoCa (zero-shot)2022-05-04
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP✓ Link29.855.566.245.4454.682.190.815.3CLIP2Video2021-06-21
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval✓ Link29.154.965.8LAFF2021-12-03
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link28.555.567.64UniVL + MELTR2023-03-23
Video and Text Matching with Conditioned Embeddings✓ Link2656.7326.756.53Ours2021-10-21
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment24.852.164.05TACo2021-08-23
MDMMT: Multidomain Multimodal Transformer for Video Retrieval✓ Link23.149.861.852.86MDMMT2021-03-19
A Straightforward Framework For Video Retrieval Using CLIP✓ Link21.441.150.41040.369.779.22CLIP2021-02-24
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation✓ Link21.249.663.16UniVL2020-02-15
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips✓ Link14.952.8940.2Text-Video Embedding2019-06-07
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval✓ Link10.729.641.217RoME2022-06-26
A Joint Sequence Fusion Model for Video Question Answering and Retrieval✓ Link10.243.21331.2JSFusion2018-08-07
Use What You Have: Video Retrieval Using Representations From Collaborative Experts✓ Link10.029.041.286.81615.640.955.28.338.1Collaborative Experts2019-07-31
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval✓ Link7.020.929.7213.829.712.532.142.216134JEMC2018-06-11
Temporal Tessellation: A Unified Approach for Video Analysis✓ Link4.724.14116.6Kaufman2016-12-21
Learning Language-Visual Embedding for Movie Understanding with Natural-Language4.219.95512.9C+LSTM+SA+FC72016-09-26