Gramian Multimodal Representation Learning and Alignment | ✓ Link | 64 | | 89.3 | | | 64.8 | | 91.5 | | | | | GRAM | 2024-12-16 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 63.9 | 84.3 | 89.6 | | | | | | | | | | VAST | 2023-05-29 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 62.8 | | | | | 60.2 | | | | | | | InternVideo2-6B | 2024-03-22 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 59.9 | 83.5 | 89.6 | | | | | | | | | | VALOR | 2023-04-17 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 58.8 | 81.0 | 87.1 | | | 58.6 | 81.6 | 86.5 | | | | | UMT-L (ViT-L/16) | 2023-03-28 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 58.1 | 81.0 | 81.6 | | | 58.7 | 81.6 | 86.9 | | | | | vid-TLDR (UMT-L) | 2024-03-20 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 57.9 | | | | | | | | | | | | COSA | 2023-06-15 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 55.2 | | | | | 57.9 | | | | | | | InternVideo | 2022-12-06 |
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 55.1 | 78.8 | 87.6 | | | | | | | | | | VLAB | 2023-05-22 |
[]() | | 52.4 | 73.9 | 82 | | | | | | | | | 1 | Aurora (ours, r=64) | |
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment | | 52 | 76.6 | 86.1 | | | | | | | | | | TEFAL | 2023-07-24 |
Unified Coarse-to-Fine Alignment for Video-Text Retrieval | ✓ Link | 49.4 | 72.1 | 83.5 | | | | | | | | | | UCoFiA | 2023-09-18 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 47.8 | 74.2 | 83.8 | | | | | | | | | | OmniVL | 2022-09-15 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 44.5 | 71.4 | 81.6 | | | | | | | | | | CLIP4Clip-seqTransf | 2021-04-18 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 38.6 | 74.4 | 84.7 | | | | | | | | | | All-in-one + MELTR | 2023-03-23 |
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 37.2 | 64.8 | 75.8 | | | | | | | | | | VIOLETv2 | 2022-09-04 |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | ✓ Link | 35.6 | 65.3 | 78 | | | | | | | | 3 | | HD-VILA | 2021-11-19 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | | 34.3 | 57.8 | 67.0 | | | 64.7 | 85.2 | 91.4 | | | | | VideoCoCa (zero-shot) | 2022-12-09 |
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization | | 33.7 | 60.5 | 70.8 | 37.8 | 3.0 | | | | | | | | MDMMT-2 | 2022-03-14 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 33.6 | 63.7 | 77.8 | | 3 | | | | | | | | VIOLET + MELTR | 2023-03-23 |
CLIP2TV: Align, Match and Distill for Video-Text Retrieval | | 33.1 | 58.9 | 68.9 | 44.7 | 3 | | | | | | | | CLIP2TV | 2021-11-10 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 32.9 | 58.3 | 68.4 | 42.6 | 3 | 59.8 | 86.2 | 92.8 | 1 | 3.8 | | | CAMoE | 2021-09-09 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 32.5 | 61.5 | 71.2 | | | | | | | | | | FROZEN | 2021-04-01 |
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval | | 32.1 | 60.8 | 70.2 | | 3 | | | | | | | | COTS | 2022-04-15 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 30.0 | 52.4 | 61.6 | | | 49.9 | 73.4 | 81.4 | | | | | CoCa (zero-shot) | 2022-05-04 |
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP | ✓ Link | 29.8 | 55.5 | 66.2 | 45.4 | 4 | 54.6 | 82.1 | 90.8 | 1 | 5.3 | | | CLIP2Video | 2021-06-21 |
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval | ✓ Link | 29.1 | 54.9 | 65.8 | | | | | | | | | | LAFF | 2021-12-03 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 28.5 | 55.5 | 67.6 | | 4 | | | | | | | | UniVL + MELTR | 2023-03-23 |
Video and Text Matching with Conditioned Embeddings | ✓ Link | 26 | 56.7 | | | 3 | 26.7 | 56.5 | | 3 | | | | Ours | 2021-10-21 |
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment | | 24.8 | 52.1 | 64.0 | | 5 | | | | | | | | TACo | 2021-08-23 |
MDMMT: Multidomain Multimodal Transformer for Video Retrieval | ✓ Link | 23.1 | 49.8 | 61.8 | 52.8 | 6 | | | | | | | | MDMMT | 2021-03-19 |
A Straightforward Framework For Video Retrieval Using CLIP | ✓ Link | 21.4 | 41.1 | 50.4 | | 10 | 40.3 | 69.7 | 79.2 | 2 | | | | CLIP | 2021-02-24 |
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | ✓ Link | 21.2 | 49.6 | 63.1 | | 6 | | | | | | | | UniVL | 2020-02-15 |
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips | ✓ Link | 14.9 | | 52.8 | | 9 | | 40.2 | | | | | | Text-Video Embedding | 2019-06-07 |
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval | ✓ Link | 10.7 | 29.6 | 41.2 | | 17 | | | | | | | | RoME | 2022-06-26 |
A Joint Sequence Fusion Model for Video Question Answering and Retrieval | ✓ Link | 10.2 | | 43.2 | | 13 | | 31.2 | | | | | | JSFusion | 2018-08-07 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 10.0 | 29.0 | 41.2 | 86.8 | 16 | 15.6 | 40.9 | 55.2 | 8.3 | 38.1 | | | Collaborative Experts | 2019-07-31 |
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval | ✓ Link | 7.0 | 20.9 | 29.7 | 213.8 | 29.7 | 12.5 | 32.1 | 42.2 | 16 | 134 | | | JEMC | 2018-06-11 |
Temporal Tessellation: A Unified Approach for Video Analysis | ✓ Link | 4.7 | | 24.1 | | 41 | | 16.6 | | | | | | Kaufman | 2016-12-21 |
Learning Language-Visual Embedding for Movie Understanding with Natural-Language | | 4.2 | | 19.9 | | 55 | | 12.9 | | | | | | C+LSTM+SA+FC7 | 2016-09-26 |