VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 99.6 | | | | VideoMAE V2-g | 2023-03-29 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 97.5 | Kinetics400 | false | | MVD (ViT-B) | 2022-12-08 |
A Large-Scale Analysis on Self-Supervised Video Representation Learning | | 97.3 | Kinetics400 | false | | SSL-KD (R21D-18) | 2023-06-09 |
Masked Motion Encoding for Self-Supervised Video Representation Learning | ✓ Link | 96.5 | Kinetics400 | false | | M3Video | 2022-10-12 |
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning | ✓ Link | 96.3 | Kinetics400 | false | | pBYOL | 2021-04-29 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 96.1 | Kinetics400 | false | | VideoMAE | 2022-03-23 |
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning | ✓ Link | 95.3 | Kinetics400 | false | | SCE (R3D-50) | 2022-12-21 |
Self-Supervised MultiModal Versatile Networks | ✓ Link | 95.2 | Audioset + Howto100M | false | | MMV TSM-50x2 | 2020-06-29 |
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning | ✓ Link | 94.1 | Kinetics400 | | | XKD (ViT-B/112/16) | 2022-11-25 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 93.9 | Kinetics600 | false | | CVRL (R3D-152 2x; K600) | 2020-08-09 |
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning | ✓ Link | 93.7 | Kinetics400 | false | | RSPNet | 2020-10-27 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 93.4 | Kinetics600 | false | | CVRL (R3D-50; K600) | 2020-08-09 |
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | ✓ Link | 93.4 | no extra data | false | | VideoMS (ViT-B) | 2022-11-19 |
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning | ✓ Link | 93.4 | | | | XKD-Modality-Agnostic (ViT-B/112/16) | 2022-11-25 |
Broaden Your Views for Self-Supervised Video Learning | ✓ Link | 93.1 | | false | | BraVe:V-FA (TSM-50x2) | 2021-03-30 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 92.4 | AudioSet | false | | CrissCross (AudioSet) | 2021-11-09 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 92.2 | Kinetics400 | false | | CVRL (R3D-50; K400) | 2020-08-09 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 91.5 | Audioset (Audio+Video) | false | | AVID+CMA (Modified R2+1D-18 on Audioset) | 2020-04-27 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 91.5 | Kinetics400 | false | | CrissCross (Kinetics400) | 2021-11-09 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 91.3 | no extra data | false | | VideoMAE(no extra data) | 2022-03-23 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 91.0 | Audioset (Audio+Video) | false | | AVID (Modified R2+1D-18 on Audioset) | 2020-04-27 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 90.5 | UCF101 | false | | ViCC (S3D; R+F) | 2021-06-18 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 88.8 | UCF101 | false | | ViCC (S3D; RGB) | 2021-06-18 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 88.8 | UCF101 | false | | ViCC (R2+1D; R+F) | 2021-06-18 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 88.3 | Kinetics-Sound | false | | CrissCross (Kinetics-Sound) | 2021-11-09 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 87.5 | Kinetics400 (Audio+Video) | false | | AVID+CMA (Modified R2+1D-18 on Kinetics) | 2020-04-27 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 86.9 | Kinetics400 (Audio+Video) | false | | AVID (Modified R2+1D-18 on Kinetics) | 2020-04-27 |
Self-Supervised Video Representation Learning with Meta-Contrastive Network | | 85.4 | | | | MCN (R3D-18; RGB) | 2021-08-19 |
Self-Supervised Video Representation Learning with Meta-Contrastive Network | | 84.8 | | | | MCN (R2+1D; RGB) | 2021-08-19 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 82.8 | UCF101 | false | | ViCC (R2+1D; RGB) | 2021-06-18 |
TCLR: Temporal Contrastive Learning for Video Representation | ✓ Link | 82.4 | UCF101 | false | | TCLR (R3D-18) | 2021-01-20 |
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning | ✓ Link | 82.3 | UCF101 | false | | PCL (ResNet-18) | 2020-10-29 |
Video Representation Learning by Dense Predictive Coding | ✓ Link | 75.7 | Kinetics400 | false | | DPC (Modified 3D Resnet-34) | 2019-09-10 |
Self-supervised Co-training for Video Representation Learning | ✓ Link | 74.5 | | false | | CoCLR | 2020-10-19 |
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework | ✓ Link | 74.4 | UCF101 | false | | IIC (R3D) | 2020-08-06 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 72.2 | UCF101 | true | | ViCC (S3D; RGB) | 2021-06-18 |
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning | ✓ Link | 71.2 | Kinetics400 | false | | TCE (ResNet-50) | 2020-03-21 |
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning | ✓ Link | 68.8 | Kinetics400 | false | | TCE (ResNet-18, Split 1) | 2020-03-21 |
Video Representation Learning by Dense Predictive Coding | ✓ Link | 68.2 | Kinetics400 | false | | DPC (3D ResNet-18) | 2019-09-10 |
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning | ✓ Link | 68.2 | UCF101 | false | | TCE (ResNet18, Split 1) | 2020-03-21 |
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning | ✓ Link | 66 | UCF101 | false | | VCP (R3D) | 2020-01-02 |
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles | | 65.8 | Kinetics400 | false | | 3D Cubic Puzzles (3D ResNet-18) | 2018-11-24 |
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction | | 64.9 | UCF101 | false | | Video Clip Ordering (R3D) | 2019-06-01 |
Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking | | 64.4 | UCF101 | false | | Skip-Clip (3D ResNet-18) | 2019-10-28 |
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction | | 62.9 | Kinetics400 | false | | 3D RotNet (3D ResNet-18) | 2018-11-28 |
Video Representation Learning by Dense Predictive Coding | ✓ Link | 60.6 | UCF101 | false | | DPC (3D ResNet-18, Split 1) | 2019-09-10 |
Self-Supervised Video Representation Learning With Odd-One-Out Networks | | 60.3 | UCF101 | false | | O3N (AlexNet) | 2016-11-21 |
Contrastive Multiview Coding | ✓ Link | 59.1 | UCF101 | false | | Contrastive Multiview Coding (CaffeNet x2) | 2019-06-13 |
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics | ✓ Link | 58.8 | UCF101 | false | | Motion & Appearance (C3D) | 2019-04-07 |
Learning and Using the Arrow of Time | | 55.3 | UCF101 | false | | Arrow of Time (AlexNet) | 2018-06-01 |
Generating Videos with Scene Dynamics | | 52.1 | UCF101 | false | | VideoGan (C3D) | 2016-09-08 |
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification | | 50.9 | UCF101 | false | | Shuffle and Learn (AlexNet) | 2016-03-28 |
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos | ✓ Link | | UCF101 | false | 83.2 | SLIC (R3D-18) | 2022-06-25 |