Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 79.7 | Kinetics400 | false | MVD (ViT-B) | 2022-12-08 |
Masked Motion Encoding for Self-Supervised Video Representation Learning | ✓ Link | 78.0 | Kinetics400 | false | M3Video | 2022-10-12 |
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning | ✓ Link | 75.0 | Kinetics400 | false | pBYOL | 2021-04-29 |
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning | ✓ Link | 74.7 | Kinetics400 | false | SCE (R3D-50) | 2022-12-21 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 73.3 | Kinetics400 | false | VideoMAE | 2022-03-23 |
Broaden Your Views for Self-Supervised Video Learning | ✓ Link | 70.5 | | false | BraVe:V-FA (TSM-50x2) | 2021-03-30 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 69.9 | Kinetics600 | false | CVRL (R3D-152 2x; K600) | 2020-08-09 |
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning | ✓ Link | 69 | | | XKD (ViT-B/112/16) | 2022-11-25 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 68.9 | IG-Kinetics | false | XDC | 2019-11-28 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 68.0 | Kinetics600 | false | CVRL (R3D-50; K600) | 2020-08-09 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 66.8 | AudioSet | false | CrissCross (AudioSet) | 2021-11-09 |
Spatiotemporal Contrastive Video Representation Learning | ✓ Link | 66.7 | Kinetics400 | false | CVRL (R3D-50; K400) | 2020-08-09 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 66.5 | IG-Random | false | XDC | 2019-11-28 |
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning | ✓ Link | 65.9 | | | XKD-Modality-Agnostic (ViT-B/112/16) | 2022-11-25 |
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | ✓ Link | 65.8 | no extra data | false | VideoMS (ViT-B) | 2022-11-19 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 64.7 | Audioset (Video+Audio) | false | AVID+CMA (Modified R2+1D-18 on Audioset) | 2020-04-27 |
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning | ✓ Link | 64.7 | Kinetics400 | false | RSPNet | 2020-10-27 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 64.7 | Kinetics400 | false | CrissCross (Kinetics400) | 2021-11-09 |
Evolving Losses for Unsupervised Video Representation Learning | | 64.5 | | false | ELo | 2020-02-26 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 64.1 | Audioset (Video+Audio) | false | AVID (Modified R2+1D-18 on Audioset) | 2020-04-27 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 63.7 | AudioSet | false | XDC | 2019-11-28 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 62.6 | no extra data | false | VideoMAE(no extra data) | 2022-03-23 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 62.2 | UCF101 | false | ViCC (S3D; R+F) | 2021-06-18 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 61.5 | UCF101 | false | ViCC (R2+1D; R+F) | 2021-06-18 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 60.8 | Kinetics400 (Video+Audio) | false | AVID+CMA (Modified R2+1D-18 on Kinetics) | 2020-04-27 |
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | ✓ Link | 60.5 | Kinetics-Sound | false | CrissCross (Kinetics-Sound) | 2021-11-09 |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | ✓ Link | 59.9 | Kinetics400 (Video+Audio) | false | AVID (Modified R2+1D-18 on Kinetics) | 2020-04-27 |
Self-Supervised Video Representation Learning with Meta-Contrastive Network | | 54.8 | UCF101 | false | MCN (R3D-18; RGB) | 2021-08-19 |
Self-Supervised Video Representation Learning with Meta-Contrastive Network | | 54.5 | UCF101 | false | MCN (R2+1D; RGB) | 2021-08-19 |
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos | ✓ Link | 54.5 | UCF101 | false | SLIC (R3D-18) | 2022-06-25 |
TCLR: Temporal Contrastive Learning for Video Representation | ✓ Link | 52.9 | UCF101 | false | TCLR (R3D-18) | 2021-01-20 |
Self-Supervised Learning by Cross-Modal Audio-Video Clustering | ✓ Link | 52.6 | Kinetics400 | false | XDC | 2019-11-28 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 52.4 | UCF101 | false | ViCC (R2+1D; RGB) | 2021-06-18 |
Self-supervised Co-training for Video Representation Learning | ✓ Link | 46.1 | | false | CoCLR | 2020-10-19 |
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning | ✓ Link | 43.2 | UCF101 | false | PCL (ResNet-18) | 2020-10-29 |
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting | ✓ Link | 38.5 | UCF101 | true | ViCC (S3D; RGB) | 2021-06-18 |
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework | ✓ Link | 38.3 | UCF101 | false | IIC (R3D) | 2020-08-06 |
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning | ✓ Link | 36.6 | Kinetics400 | false | TCE (ResNet-50) | 2020-03-21 |
Video Representation Learning by Dense Predictive Coding | ✓ Link | 35.7 | Kinetics400 | false | DPC (Modified 3D Resnet-34) | 2019-09-10 |
Video Representation Learning by Dense Predictive Coding | ✓ Link | 34.5 | Kinetics400 | false | DPC (Modified 3D ResNet-18) | 2019-09-10 |
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning | ✓ Link | 34.2 | Kinetics400 | false | TCE (ResNet-18) | 2020-03-21 |
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction | | 33.7 | Kinetics400 | false | 3D RotNet (3D ResNet-18) | 2018-11-28 |
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles | | 33.7 | Kinetics400 | false | 3D Cubic Puzzles (3D ResNet-18) | 2018-11-24 |
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning | ✓ Link | 31.5 | UCF101 | false | VCP (R3D) | 2020-01-02 |
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction | | 29.5 | UCF101 | false | Video Clip Ordering (R3D) | 2019-06-01 |
Unsupervised Representation Learning by Sorting Sequences | ✓ Link | 23.8 | UCF101 | false | OPN (VGG-M-2048) | 2017-08-03 |
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics | ✓ Link | 20.3 | UCF101 | false | Motion & Appearance (C3D) | 2019-04-07 |
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification | | 19.8 | UCF101 | false | Shuffle and Learn (AlexNet) | 2016-03-28 |