OpenCodePapers

zero-shot-video-retrieval-on-lsmdc

Zero-Shot Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10text-to-video Median Ranktext-to-video Mean Rankvideo-to-text R@1video-to-text R@5video-to-text R@10ModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link33.855.962.230.147.754.8InternVideo2-6B2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link32.052.459.427.344.251.6InternVideo2-1B2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link27.746.554.67VAST, HowToCaption-finetuned2023-10-07
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link25.243.050.523.237.7 44.2UMT-L (ViT-L/16)2023-03-28
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link24.143.852.0mPLUG-22023-02-01
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link19.535.945.0BT-Adapter2023-09-27
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training18.336.744.2HiTeA-17M2022-12-30
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link17.632.440.213.227.834.9InternVideo2022-12-06
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale✓ Link17.331.738.629HowToCaption2023-10-07
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning✓ Link17.232.439.1Yatai Ji et. al.2022-11-24
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training15.531.139.8HiTeA-5M2022-12-30
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval✓ Link15.128.536.428117CLIP4Clip2021-04-18
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link14.729.238.224Clover2022-07-16
Bridging Video-text Retrieval with Multiple Choice Questions✓ Link12.225.932.2 42.0Y. Ge et. al.2022-01-13
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval✓ Link11.124.730.650.7MILES2022-04-26
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning✓ Link4.211.617.1SSML2020-03-06