Paper | Code | text-to-video R@1 | text-to-video R@5 | text-to-video R@10 | video-to-text R@1 | video-to-text R@5 | video-to-text R@10 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|
Gramian Multimodal Representation Learning and Alignment | ✓ Link | 83.9 | 99.5 | 82.7 | 99 | GRAM | 2024-12-16 | ||
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 71.5 | 94.0 | 97.1 | 85.3 | 97.9 | 99.3 | InternVideo2-6B | 2024-03-22 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 70.4 | 93.4 | 96.9 | 85.4 | 97.6 | 99.1 | InternVideo2-1B | 2024-03-22 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | 53.2 | 83.3 | 90.1 | 73.6 | 93.2 | 97.2 | VideoCoCa | 2022-12-09 | |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 49.5 | 69.5 | InternVideo | 2022-12-06 |