Paper | Code | text-to-video R@1 | text-to-video R@5 | text-to-video R@10 | video-to-text R@1 | video-to-text R@5 | video-to-text R@10 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 46.3 | 70.5 | 79.6 | 42.4 | 65.9 | 75.4 | InternVL-G | 2023-12-21 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | ✓ Link | 44.7 | 68.2 | 78.4 | 40.2 | 63.1 | 74.1 | InternVL-C | 2023-12-21 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | 34.3 | 57.8 | 67.0 | 64.7 | 85.2 | 91.4 | VideoCoCa | 2022-12-09 |