OpenCodePapers

video-retrieval-on-didemo

Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video R@50text-to-video Median Ranktext-to-video Mean Rankvideo-to-text R@1video-to-text R@5video-to-text R@10video-to-text Median Rankvideo-to-text Mean Ranktext-to-videoR@1ModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link74.271.9InternVideo2-6B2024-03-22
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link72.391.294.268.589.893.8vid-TLDR (UMT-L)2024-03-20
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link72.089.091.4VAST2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link70.5COSA2023-06-15
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link70.490.193.565.789.693.3UMT-L (ViT-L/16)2023-03-28
Gramian Multimodal Representation Learning and Alignment✓ Link67.390.163.591.6GRAM2024-12-16
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link61.585.390.4VALOR2023-04-17
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding✓ Link61.287.291.5TESTA (ViT-B/16)2023-10-29
VindLU: A Recipe for Effective Video-and-Language Pretraining✓ Link61.285.891.0VindLU2022-12-09
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link57.959.1InternVideo2022-12-06
RTQ: Rethinking Video-language Understanding Based on Image-text Model✓ Link57.684.189.9RTQ2023-12-01
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending56.881.688.7VLAB2023-05-22
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training56.581.789.7HiTeA2022-12-30
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling56.580.287.0MuLTI2023-03-10
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link56.479.185.2mPLUG-22023-02-01
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment✓ Link55.38289.31CLIP-ViP2022-09-14
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring✓ Link54.678.485.11STAN2023-01-26
Revealing Single Frame Bias for Video-and-Language Learning✓ Link53.979.486.9Singularity2022-06-07
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning✓ Link52.779.386.61.010.5DMAE (ViT-B/32)2023-09-20
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations52.777.885.21.013.754.178.386.81.09.1HunYuan_tvr (huge)2022-04-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks52.479.585.4OmniVL2022-09-15
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations52.178.285.7111.154.879.987.217.1HunYuan_tvr2022-04-07
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?✓ Link52.079.487.5110.551.278.587.417.3Cap4Video2022-12-31
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link50.176.785.61Clover2022-07-16
Disentangled Representation Learning for Text-Video Retrieval✓ Link49.076.584.52.011.549.983.327.9DRL2022-03-14
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model✓ Link48.975.583.32.014.150.375.182.91.010.3DiffusionRet+QB-Norm2023-03-17
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval✓ Link48.676.084.52.012.948.174.285.72.09.8PAU2023-09-29
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link47.976.584.1VIOLETv22022-09-04
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval✓ Link47.879.312.647.876.810.5X-CLIP2022-07-15
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning✓ Link46.974.982.72.012.146.273.082.72.08.7HBI2023-03-25
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model✓ Link46.774.782.72.014.346.274.382.22.010.7DiffusionRet2023-03-17
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss✓ Link43.871.479.92.016.345.580.5210.2CAMoE2021-09-09
Cross Modal Retrieval with Querybank Normalisation✓ Link43.571.480.92.0QB-Norm+CLIP4Clip2021-12-23
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link43.470.280.62.017.5CLIP4Clip2021-04-18
Align and Prompt: Video-and-Language Pre-training with Entity Prompts✓ Link35.967.578.83ALPRO2021-12-17
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link31.059.872.43FROZEN2021-04-01
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions✓ Link28.857.469.14HD-VILA2021-11-19
Rudder: A Cross Lingual Video and Text Retrieval Dataset✓ Link16.356.5840.21554.9839.6PO Loss2021-03-09
Use What You Have: Video Retrieval Using Representations From Collaborative Experts✓ Link16.141.154.482.78.343.7Collaborative Experts2019-07-31
[]()77.485.3153.1Aurora (ours, r=64)