Paper | Code | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 97.6 | VIOLETv2 | 2022-09-04 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | 97.4 | HiTeA | 2022-12-30 | |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 95.5 | VindLU | 2022-12-09 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 95.2 | Clover | 2022-07-16 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 93.7 | Singularity-temporal | 2022-06-07 |
Multi-granularity Correspondence Learning from Long-term Noisy Videos | ✓ Link | 92.7 | Norton | 2024-01-30 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 92.1 | Singularity | 2022-06-07 |