InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 61.4 | | | | | | 85.2 | | | | | InternVideo2-6B | 2024-03-22 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 59.0 | 84.0 | 90.3 | 1.0 | 7.6 | | 73.0 | 94.5 | 96.6 | 1.0 | 7.6 | HunYuan_tvr (huge) | 2022-04-07 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 58.4 | | | | | | 76.3 | | | | | InternVideo | 2022-12-06 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 58.2 | 83.5 | 90.1 | 1 | 7.8 | | 69.1 | 91.5 | 95.0 | 1.0 | 3.8 | HunYuan_tvr | 2022-04-07 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 57.9 | 83.8 | 89.4 | | | | 82.7 | 94.5 | 96.3 | | | vid-TLDR (UMT-L) | 2024-03-20 |
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 57.5 | 83.6 | 89.9 | | | | | | | | | VLAB | 2023-05-22 |
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization | | 56.8 | 83.1 | 89.2 | 1.0 | 8.8 | | | | | | | MDMMT-2 | 2022-03-14 |
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | ✓ Link | 56.1 | 81.7 | 88.8 | 1.0 | 8.4 | | | | | | | Side4Video | 2023-11-27 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 51.8 | 87.6 | 87.6 | 1 | 8.9 | | 69.3 | 90.6 | 94.6 | 1 | 3.1 | CAMoE | 2021-09-09 |
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? | ✓ Link | 51.8 | 80.8 | 88.3 | 1 | 8.3 | | 70.0 | 93.2 | 96.2 | 1 | 2.4 | Cap4Video | 2022-12-31 |
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval | ✓ Link | 50.6 | 80.3 | 88.4 | 1 | 8.4 | | 68.4 | 90.1 | 95.0 | 1 | 3.0 | CenterCLIP (ViT-B/16) | 2022-05-02 |
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✓ Link | 50.4 | 80.6 | | | 8.4 | | 66.8 | | 90.4 | | 4.2 | X-CLIP | 2022-07-15 |
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning | ✓ Link | 48.7 | 78.4 | 86.3 | 2.0 | 9.8 | | | | | | | DMAE
(ViT-B/32) | 2023-09-20 |
Cross Modal Retrieval with Querybank Normalisation | ✓ Link | 48.0 | 77.9 | 86.2 | 2.0 | | | | | | | | QB-Norm+CLIP2Video | 2021-12-23 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 47.9 | 77.2 | 84.8 | | 15.6 | | 60.3 | 86.4 | 92 | 1.0 | 4.5 | DiffusionRet+QB-Norm | 2023-03-17 |
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | ✓ Link | 47.3 | 77.4 | 85.5 | 2.0 | 9.6 | | 68.9 | 93.1 | 97.1 | 1.0 | 2.4 | PAU | 2023-09-29 |
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | ✓ Link | 47.2 | 77.4 | 86.0 | 2.0 | 9.3 | | 66.4 | 90.0 | 94.2 | 1.0 | 3.3 | X-Pool | 2022-03-28 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 46.6 | 75.9 | 84.1 | 2.0 | 15.7 | | 61.9 | 88.3 | 92.9 | 1.0 | 4.5 | DiffusionRet | 2023-03-17 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 46.2 | 76.1 | 84.6 | 2 | 10.0 | | 62.0 | 87.3 | 92.6 | 1 | | CLIP4Clip | 2021-04-18 |
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval | ✓ Link | 45.4 | 76.0 | 84.6 | | | | | | | | | LAFF | 2021-12-03 |
A Straightforward Framework For Video Retrieval Using CLIP | ✓ Link | 37 | 64.1 | 73.8 | 3.0 | | | 59.9 | 85.2 | 90.7 | 1 | | CLIP | 2021-02-24 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 33.7 | 64.7 | 76.3 | 3 | | | | | | | | FROZEN | 2021-04-01 |
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | ✓ Link | 20.3 | 49.0 | 63.3 | 6.0 | -- | -- | | | | | | SSML | 2020-03-06 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 19.8 | 49.0 | 63.8 | 6.0 | 23.1 | 89.0 | | | | | | Collaborative Experts | 2019-07-31 |