Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 62.9 | 84.5 | 90.8 | 1.0 | 9.3 | 64.8 | 84.9 | 91.1 | 1.0 | 5.5 | HunYuan_tvr (huge) | 2022-04-07 |
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | ✓ Link | 57.7 | 80.5 | 88.2 | 1.0 | | | | | | | CLIP-ViP | 2022-09-14 |
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval | | 55.9 | 79.8 | 87.6 | 1.0 | 10.7 | 54.5 | 78,3 | 87.3 | 1.0 | 7.5 | PIDRo | 2023-01-01 |
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning | ✓ Link | 55.5 | 79.4 | 87.1 | 1.0 | 10.0 | 55.7 | 79.2 | 87.2 | 1.0 | 7.3 | DMAE
(ViT-B/16) | 2023-09-20 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 55.0 | | | | | 55.5 | 78.4 | 85.8 | 1.0 | 7.7 | HunYuan_tvr | 2022-04-07 |
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | | 54.7 | 77.7 | 86.0 | | | | | | | | MuLTI | 2023-03-10 |
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | ✓ Link | 54.1 | 79.5 | 87.8 | 1 | | | | | | | STAN | 2023-01-26 |
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning | ✓ Link | 54.1 | 78.8 | 86.9 | | | | | | | | EERCF | 2024-01-01 |
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval | ✓ Link | 54.0 | 79.3 | 87.4 | | | | | | | | TS2-Net | 2022-07-16 |
RTQ: Rethinking Video-language Understanding Based on Image-text Model | ✓ Link | 53.4 | 76.1 | 84.4 | | | | | | | | RTQ | 2023-12-01 |
Disentangled Representation Learning for Text-Video Retrieval | ✓ Link | 53.3 | 80.3 | 87.6 | 1 | 11.4 | 56.2 | 79.9 | 87.4 | 1.0 | 7.6 | DRL | 2022-03-14 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 53.1 | 77.6 | 84.7 | | | | | | | | mPLUG-2 | 2023-02-01 |
CLIP2TV: Align, Match and Distill for Video-Text Retrieval | | 52.9 | 78.5 | 86.5 | 1 | 12.8 | 54.1 | 77.4 | 85.7 | 1 | 9.0 | CLIP2TV | 2021-11-10 |
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | ✓ Link | 52.3 | 75.5 | 84.2 | 1.0 | 12.8 | | | | | | Side4Video | 2023-11-27 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 51.6 | 78.1 | 85.3 | | 1 | 51.8 | 80.2 | 88 | | 1 | EMCL-Net++ | 2022-11-21 |
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? | ✓ Link | 51.4 | 75.7 | 83.9 | 1 | 12.4 | 49.0 | 75.2 | 85.0 | 2 | 8.0 | Cap4Video | 2022-12-31 |
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning | ✓ Link | 49.8 | 75.1 | 83.9 | | | 47.3 | 76 | 84.3 | | | SuMA (ViT-B/16) | 2023-02-19 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 49.6 | 76.7 | 84.2 | | | | | | | | X2-VLM (large) | 2022-11-22 |
Unified Coarse-to-Fine Alignment for Video-Text Retrieval | ✓ Link | 49.4 | 72.1 | 83.5 | | | 47.1 | 74.3 | 83.0 | | | UCoFiA | 2023-09-18 |
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✓ Link | 49.3 | 75.8 | 84.8 | 2.0 | 12.2 | 48.9 | 76.8 | 84.5 | 2.0 | 8.1 | X-CLIP | 2022-07-15 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 49.0 | 75.2 | 82.7 | 2.0 | 12.1 | 47.7 | 73.8 | 84.5 | 2.0 | 8.8 | DiffusionRet | 2023-03-17 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 48.9 | 75.2 | 83.1 | 2.0 | 12.1 | 49.3 | 74.3 | 83.8 | 2.0 | 8.5 | DiffusionRet+QB-Norm | 2023-03-17 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 48.8 | 75.6 | 85.3 | 2 | 12.4 | 50.3 | 74.6 | 83.8 | 2 | 9.9 | CAMoE | 2021-09-09 |
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | ✓ Link | 48.6 | 74.6 | 83.4 | 2.0 | 12.0 | 46.8 | 74.3 | 84.3 | 2.0 | 8.9 | HBI | 2023-03-25 |
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | ✓ Link | 48.5 | 72.7 | 82.5 | 2.0 | 14.0 | 48.3 | 73.0 | 83.2 | 2.0 | 9.7 | PAU | 2023-09-29 |
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval | ✓ Link | 48.4 | 73.8 | 82.0 | 2 | 13.8 | 47.7 | 75.0 | 83.3 | 2 | 10.2 | CenterCLIP (ViT-B/16) | 2022-05-02 |
Holistic Features are almost Sufficient for Text-to-Video Retrieval | ✓ Link | 48.0 | 75.9 | 83.5 | | | | | | | | TeachCLIP (ViT-B/16) | 2024-01-01 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 47.6 | 74.1 | 84.2 | | | | | | | | X2-VLM (base) | 2022-11-22 |
Cross Modal Retrieval with Querybank Normalisation | ✓ Link | 47.2 | 73.0 | 83.0 | 2 | | | | | | | QB-Norm+CLIP2Video | 2021-12-23 |
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | ✓ Link | 46.9 | 72.8 | 82.2 | 2 | 14.3 | 44.4 | 73.3 | 84.0 | 2.0 | 9.0 | X-Pool | 2022-03-28 |
Holistic Features are almost Sufficient for Text-to-Video Retrieval | ✓ Link | 46.8 | 74.3 | 82.6 | | | | | | | | TeachCLIP | 2024-01-01 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 46.8 | 73.1 | 83.1 | | 2 | 46.5 | 73.5 | 83.5 | | 2 | EMCL-Net | 2022-11-21 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 46.8 | 71.2 | 81.9 | | | | | | | | HiTeA | 2022-12-30 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 46.5 | 71.5 | 80.4 | | | | | | | | VindLU | 2022-12-09 |
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval | ✓ Link | 45.8 | 71.5 | 82 | | | | | | | | LAFF | 2021-12-03 |
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP | ✓ Link | 45.6 | 72.6 | 81.7 | 2 | 14.6 | 43.3 | 72.3 | 82.1 | 2 | 10.2 | CLIP2Video | 2021-06-21 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 41.5 | 68.7 | 77 | | | | | | | | Singularity | 2022-06-07 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 41.3 | 73.5 | 82.5 | | | | | | | | All-in-one + MELTR | 2023-03-23 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 40.5 | 69.8 | 79.4 | 2 | | | | | | | Clover | 2022-07-16 |
MDMMT: Multidomain Multimodal Transformer for Video Retrieval | ✓ Link | 38.9 | 69.0 | 79.7 | 2 | 16.5 | | | | | | MDMMT | 2021-03-19 |
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | | 38.9 | 63.1 | 73.9 | 3 | | | | | | | MAC | 2022-12-02 |
All in One: Exploring Unified Video-Language Pre-training | ✓ Link | 37.9 | 68.1 | 77.1 | | | | | | | | All-in-one-B | 2022-03-14 |
Bridging Video-text Retrieval with Multiple Choice Questions | ✓ Link | 37.6 | 64.8 | 75.1 | 3 | | | | | | | BridgeFormer | 2022-01-13 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 37.6 | 63.8 | 72.6 | | | | | | | | Florence | 2021-11-22 |
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval | | 36.8 | 63.8 | 73.2 | 2 | | | | | | | COTS | 2022-04-15 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 35.5 | 67.2 | 78.4 | 3 | | | | | | | VIOLET + MELTR | 2023-03-23 |
A Straightforward Framework For Video Retrieval Using CLIP | ✓ Link | 31.2 | 53.7 | 64.2 | 4 | | 27.2 | 51.7 | 62.6 | 5 | | CLIP | 2021-02-24 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 31.1 | 55.7 | 68.3 | 4 | | | | | | | UniVL + MELTR | 2023-03-23 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 31.0 | 59.5 | 70.5 | 3 | | | | | | | FROZEN | 2021-04-01 |
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | ✓ Link | 30.9 | 55.4 | 66.8 | | | | | | | | VideoCLIP | 2021-09-28 |
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment | | 28.4 | 57.8 | 71.2 | 4 | | | | | | | TACo | 2021-08-23 |
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding | ✓ Link | 28.10 | 55.50 | 67.40 | 4 | | | | | | | VLM | 2021-05-20 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 26.6 | 57.1 | 69.6 | 4 | 24.0 | | | | | | MMT-Pretrained | 2020-07-21 |
Bridging Video-text Retrieval with Multiple Choice Questions | ✓ Link | 26 | 46.4 | 56.4 | 7 | | | | | | | BridgeFormer (Zero-shot) | 2022-01-13 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 24.6 | 54.0 | 67.1 | 4 | 26.7 | | | | | | MMT | 2020-07-21 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 20.9 | 48.8 | 62.4 | 6 | 28.2 | | | | | | Collaborative Experts | 2019-07-31 |
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips | ✓ Link | 14.9 | 40.2 | 52.8 | 9 | | | | | | | HT-Pretrained | 2019-06-07 |
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips | ✓ Link | 12.1 | 35.0 | 48.0 | 12 | | | | | | | HT | 2019-06-07 |
A Joint Sequence Fusion Model for Video Question Answering and Retrieval | ✓ Link | 10.2 | 31.2 | 43.2 | 13 | | | | | | | JSFusion | 2018-08-07 |
OmniVec: Learning robust representations with cross modal sharing | | | | 89.4 | | | | | | | | OmniVec | 2023-11-07 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | | | 81.6 | 2 | 15.3 | 42.7 | 70.9 | 80.6 | 2 | | CLIP4Clip | 2021-04-18 |
OmniVec: Learning robust representations with cross modal sharing | | | | 78.6 | | | | | | | | OmniVec (pretrained) | 2023-11-07 |
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | ✓ Link | | | | | | 42.8 | | | | | Socratic Models | 2022-04-01 |