Paper | Code | text-to-video R@1 | text-to-video R@5 | text-to-video R@10 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | ✓ Link | 24.9 | 46.5 | 55.1 | TESTA (ViT-B/16) | 2023-10-29 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 18.4 | 36.4 | 44.3 | VINDLU | 2022-12-09 |
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | ✓ Link | 13.6 | 32.5 | 41.8 | LF-VILA | 2022-10-12 |