Paper | Code | text-to-video R@1 | text-to-video R@5 | text-to-video R@10 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | ✓ Link | 83.4 | 93.8 | 95.3 | TESTA (ViT-B/16) | 2023-10-29 |
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | ✓ Link | 69.7 | 85.7 | 90.3 | LF-VILA | 2022-10-12 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 67.8 | 86.3 | 81.8 | VINDLU | 2022-12-09 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 53.8 | 75.7 | 82.7 | Frozen | 2021-04-01 |
Cross Modal Retrieval with Querybank Normalisation | ✓ Link | 15.1 | QB-Norm+TT-CE+ | 2021-12-23 |