OpenCodePapers

zero-shot-video-retrieval-on-msr-vtt

Zero-Shot Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video Median Ranktext-to-video Mean Rankvideo-to-text R@1video-to-text R@5video-to-text R@10video-to-text Median RankModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link55.978.385.153.777.584.1InternVideo2-6B2024-03-22
Gramian Multimodal Representation Learning and Alignment✓ Link54.883.952.982.9GRAM2024-12-16
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link51.975.382.550.973.481.8InternVideo2-1B2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link5073.281.41VAST, HowToCaption-finetuned2023-10-07
Make Your Training Flexible: Towards Deployment-Efficient Video Models✓ Link49.971.079.649.473.982.4FluxViT-B2025-03-18
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link49.368.373.9VAST2023-05-29
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link47.169.779.0mPLUG-22023-02-01
Make Your Training Flexible: Towards Deployment-Efficient Video Models✓ Link45.067.575.844.968.276.5FluxViT-S2025-03-18
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment✓ Link44.870.078.7240.966.475.72.LanguageBind(ViT-H/14)2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment✓ Link42.867.576.02.038.365.877.83.0LanguageBind(ViT-L/14)2023-10-03
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link42.664.473.138.659.869.6UMT-L (ViT-L/16)2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link42.163.972.437.759.869.4vid-TLDR (UMT-L)2024-03-20
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link40.964.773.5BT-Adapter2023-09-27
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link40.739.6InternVideo2022-12-06
Florence: A New Foundation Model for Computer Vision✓ Link37.663.872.6Florence2021-11-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link37.66273.33HowToCaption2023-10-07
ImageBind: One Embedding Space To Bind Them All✓ Link36.861.870.0ImageBind2023-05-09
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks34.658.466.6OmniVL2022-09-15
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training34.460.069.9HiTeA-17M2022-12-30
Revealing Single Frame Bias for Video-and-Language Learning✓ Link34.056.766.7Singularity-17M2022-06-07
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link32.057.066.9434.0CLIP4Clip2021-04-18
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning✓ Link30.954.465.0Yatai Ji et. al.2022-11-24
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training29.954.262.9HiTeA-5M2022-12-30
Revealing Single Frame Bias for Video-and-Language Learning✓ Link28.450.259.5Singularity-5M2022-06-07
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link26.449.5606Clover2022-07-16
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval✓ Link26.147.256.97MILES2022-04-26
Bridging Video-text Retrieval with Multiple Choice Questions✓ Link26.046.456.47.0Y. Ge et. al.2022-01-13
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling✓ Link25.949.559.7VIOLET2021-11-24
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link24.746.957.27.0FROZEN2021-04-01
Align and Prompt: Video-and-Language Pre-training with Entity Prompts✓ Link24.144.755.48ALPRO2021-12-17
Object-aware Video-language Pre-training for Retrieval✓ Link23.447.555.68.0OA-Trans2021-12-01
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval23.444.153.3817.236.247.912LaT2022-07-11
Learning Audio-Video Modalities from Image Captions19.439.550.3A. Nagrani et. al.2022-04-01
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions✓ Link14.634.444.115HD-VILA2021-11-19
Multi-granularity Correspondence Learning from Long-term Noisy Videos✓ Link10.724.1Norton2024-01-30
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding✓ Link10.422.230.0VideoCLIP2021-09-28
End-to-End Learning of Visual Representations from Uncurated Instructional Videos✓ Link9.924.032.429.5MIL-NCE2019-12-13
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment9.825.033.4TACo2021-08-23
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning✓ Link8.021.329.3SSML2020-03-06
Multi-modal Transformer for Video Retrieval✓ Link14.466148.1MMT2020-07-21
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text✓ Link29.749VATT-MBS2021-04-22