Paper | Code | text-to-video R@1 | text-to-video R@5 | text-to-video R@10 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 73.3 | 92.7 | 96.6 | UMT-L (ViT-L/16) | 2023-03-28 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 73.1 | 93.3 | 96.6 | vid-TLDR (UMT-L) | 2024-03-20 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | 55.2 | 89.1 | 81.4 | HiTeA | 2022-12-30 | |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 53.1 | 81.8 | VindLU | 2022-12-09 | |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 47.4 | 75.9 | 84 | Singularity-temporal | 2022-06-07 |