Vision Transformers Need Registers | ✓ Link | 87.1 | | 1100M | DINOv2+reg (ViT-g/14) | 2023-09-28 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 86.7% | | 1100M | DINOv2 (ViT-g/14 @448) | 2023-04-14 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 86.5% | | 1100M | DINOv2 (ViT-g/14) | 2023-04-14 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 86.3% | | 307M | DINOv2 distilled (ViT-L/14) | 2023-04-14 |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | ✓ Link | 84.7% | | 632M | MIM-Refiner (D2V2-ViT-H/14) | 2024-02-15 |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | ✓ Link | 84.5% | | 1890M | MIM-Refiner (MAE-ViT-2B/14) | 2024-02-15 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 84.5% | | 85M | DINOv2 distilled (ViT-B/14) | 2023-04-14 |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | ✓ Link | 83.7% | | 632M | MIM-Refiner (MAE-ViT-H/14 | 2024-02-15 |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | ✓ Link | 83.5% | | 307M | MIM-Refiner (D2V2-ViT-L/16) | 2024-02-15 |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations | ✓ Link | 82.8% | | 307M | MIM-Refiner (MAE-ViT-L/16) | 2024-02-15 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 82.3% | | 307M | iBOT (ViT-L/16) (IN22k) | 2021-11-15 |
Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget | ✓ Link | 82.2% | | 632M | MAE-CT (ViT-H/16) | 2023-04-20 |
Mugs: A Multi-Granular Self-Supervised Learning Framework | ✓ Link | 82.1% | | 307M | Mugs (VIT-L/16) | 2022-03-27 |
Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget | ✓ Link | 81.5% | | 307M | MAE-CT (ViT-L/16 | 2023-04-20 |
Efficient Self-supervised Vision Transformers for Representation Learning | ✓ Link | 81.3 | 95.5 | 87M | EsViT (Swin-B) | 2021-06-17 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 81.3% | | 307M | iBOT (ViT-L/16) | 2021-11-15 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 81.1% | | 21M | DINOv2 distilled (ViT-S/14) | 2023-04-14 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 81.0% | | 304M | MoCo v3 (ViT-BN-L/7) | 2021-04-05 |
Efficient Self-supervised Vision Transformers for Representation Learning | ✓ Link | 80.8 | | 49M | EsViT(Swin-S) | 2021-06-17 |
Masked Siamese Networks for Label-Efficient Learning | ✓ Link | 80.7% | | 306M | MSN (ViT-L/7) | 2022-04-14 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 80.6% | | 250M | ReLICv2 (ResNet-200 x2) | 2022-01-13 |
Masked Reconstruction Contrastive Learning with Information Bottleneck Principle | | 80.4% | | | MR BarTwins (MR BarTwins) | 2022-11-15 |
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective | ✓ Link | 80.3% | | 732M | DiGIT | 2024-10-16 |
DINO as a von Mises-Fisher mixture model | | 80.3% | | 85M | iBOT-vMF (ViT-B/16) | 2024-05-17 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 80.3% | | 84M | DINO (xcit_medium_24_p8) | 2021-04-29 |
Perceptual Group Tokenizer: Building Perception with Iterative Grouping | | 80.3% | | 70M | PGT (PGT-B w/ Flow) | 2023-11-30 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 80.1% | | 80M | DINO (ViT-B/8) | 2021-04-29 |
Big Self-Supervised Models are Strong Semi-Supervised Learners | ✓ Link | 79.8% | 94.9% | 795M | SimCLRv2 (ResNet-152 x3, SK) | 2020-06-17 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 79.8% | | 10000M | SEERv2 | 2022-02-16 |
Improving Visual Representation Learning through Perceptual Understanding | ✓ Link | 79.8% | | 80M | PercMAE (ViT-B, dVAE) | 2022-12-30 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 79.8% | | 63M | ReLICv2 (ResNet200) | 2022-01-13 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 79.7% | | 21M | DINO (ViT-S/8) | 2021-04-29 |
Bootstrap your own latent: A new approach to self-supervised Learning | ✓ Link | 79.6% | 94.8% | 250M | BYOL (ResNet-200 x2) | 2020-06-13 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 79.4% | | 375M | ReLICv2 (ResNet-50 4x) | 2022-01-13 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 79.3% | | 58M | ReLICv2 (ResNet152) | 2022-01-13 |
Unsupervised Representation Learning by Balanced Self Attention Matching | ✓ Link | 79.3% | | | BAM (CAFormer-M36) | 2024-08-04 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 79.1% | | 700M | MoCo v3 (ViT-BN-H) | 2021-04-05 |
Unicom: Universal and Compact Representation Learning for Image Retrieval | ✓ Link | 79.1% | | 80M | Unicom (ViT-B/16) | 2023-04-12 |
Unsupervised Visual Representation Learning by Synchronous Momentum Grouping | ✓ Link | 79.0% | 94.4 | 375M | SMoG (ResNet-50 x4) | 2022-07-13 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 79% | | 94M | ReLICv2 (ResNet-50 x2) | 2022-01-13 |
Compressive Visual Representations | ✓ Link | 78.8% | 94.5% | 94M | C-BYOL (ResNet-50 2x, 1000 epochs) | 2021-09-27 |
DINO as a von Mises-Fisher mixture model | | 78.8% | | 85M | DINO-vMF (ViT-B/16) | 2024-05-17 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 78.7% | | 44M | ReLICv2 (ResNet101) | 2022-01-13 |
Bootstrap your own latent: A new approach to self-supervised Learning | ✓ Link | 78.6% | 94.2% | 375M | BYOL (ResNet-50 x4) | 2020-06-13 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 78.5% | | 586M | SwAV (ResNet-50 x5) | 2020-06-17 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 78.2% | | 85M | DINO (ViT-B/16) | 2021-04-29 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 78.1% | | 632M | MoCo v3 (ViT-H) | 2021-04-05 |
Improving Visual Representation Learning through Perceptual Understanding | ✓ Link | 78.1% | | 80M | PercMAE (ViT-B) | 2022-12-30 |
Unsupervised Representation Learning by Balanced Self Attention Matching | ✓ Link | 78.1% | | 80M | BAM (ViT-B/16) | 2024-08-04 |
Unsupervised Visual Representation Learning by Synchronous Momentum Grouping | ✓ Link | 78.0% | 93.9 | 94M | SMoG (ResNet-50 x2) | 2022-07-13 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 77.6% | | 307M | MoCo v3 (ViT-L) | 2021-04-05 |
Self-supervised Pretraining of Visual Features in the Wild | ✓ Link | 77.5% | | 1300M | SEER | 2021-03-02 |
Bootstrap your own latent: A new approach to self-supervised Learning | ✓ Link | 77.4% | 93.6% | 94M | BYOL (ResNet-50 x2) | 2020-06-13 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 77.3% | | 94M | SwAV (ResNet-50 x2) | 2020-06-17 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 77.1% | | 25M | ReLICv2 (ResNet-50) | 2022-01-13 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 77.0% | | 21M | DINO (ViT-S/16) | 2021-04-29 |
DINO as a von Mises-Fisher mixture model | | 77.0% | | 21M | DINO-vMF (ViT-S/16) | 2024-05-17 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 76.7% | | 86M | MoCo v3 (ViT-B/16) | 2021-04-05 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 76.6% | | 700M | MAE (ViT-H) | 2021-11-11 |
A Simple Framework for Contrastive Learning of Visual Representations | ✓ Link | 76.5% | 93.2% | 375M | SimCLR (ResNet-50 4x) | 2020-02-13 |
Unsupervised Visual Representation Learning by Online Constrained K-Means | ✓ Link | 76.4% | | 25M | CoKe (ResNet-50) | 2021-05-24 |
Unsupervised Visual Representation Learning by Synchronous Momentum Grouping | ✓ Link | 76.4% | | 25M | SMoG (ResNet-50) | 2022-07-13 |
Weak Augmentation Guided Relational Self-Supervised Learning | ✓ Link | 76.3% | | 24M | ReSSL (ResNet-50 w/ Predictor and Stronger Aug) | 2022-03-16 |
Weak Augmentation Guided Relational Self-Supervised Learning | ✓ Link | 76.0% | | 24M | ReSSL (ResNet-50 w/ Predictor) | 2022-03-16 |
Solving Inefficiency of Self-supervised Representation Learning | ✓ Link | 75.9% | | 23.56M | Triplet (ResNet-50) | 2021-04-18 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 75.8% | | 306M | MAE (ViT-L) | 2021-11-11 |
Divide and Contrast: Self-supervised Learning from Uncurated Data | | 75.8% | | 24M | DnC (ResNet-50) | 2021-05-17 |
CaCo: Both Positive and Negative Samples are Directly Learnable via Cooperative-adversarial Contrastive Learning | ✓ Link | 75.7% | | 24M | CaCo (ResNet-50) | 2022-03-27 |
Big Self-Supervised Models are Strong Semi-Supervised Learners | ✓ Link | 75.6% | 92.7% | 94M | SimCLRv2 (ResNet-50 x2) | 2020-06-17 |
Compressive Visual Representations | ✓ Link | 75.6% | 92.7% | 25M | C-BYOL (ResNet-50, 1000 epochs) | 2021-09-27 |
With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations | ✓ Link | 75.6% | 92.4 | 25M | NNCLR (ResNet-50, multi-crop) | 2021-04-29 |
Self-supervised Pre-training with Hard Examples Improves Visual Representations | | 75.5% | | 24M | HEXA | 2020-12-25 |
Similarity Contrastive Estimation for Self-Supervised Soft Contrastive Learning | ✓ Link | 75.4% | | 24M | SCE (ResNet-50, multi-crop) | 2021-11-29 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 75.3% | | 24M | SwAV (ResNet-50) | 2020-06-17 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 75.3% | | 24M | DINO (ResNet-50) | 2021-04-29 |
What Makes for Good Views for Contrastive Learning? | ✓ Link | 75.2% | | 120M | InfoMin (ResNeXt-152) | 2020-05-20 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 75.2% | | 24M | DeepCluster-v2 (ResNet-50) | 2020-06-17 |
Unicom: Universal and Compact Representation Learning for Image Retrieval | ✓ Link | 75.0% | | 80M | Unicom (ViT-B/32) | 2023-04-12 |
Self-Supervised Learning with Swin Transformers | ✓ Link | 75% | | 29M | MoBY (Swin-T) | 2021-05-10 |
Representation Learning via Invariant Causal Mechanisms | ✓ Link | 74.8% | | 24M | ReLIC (ResNet-50) | 2020-10-15 |
ReSSL: Relational Self-Supervised Learning with Weak Augmentation | ✓ Link | 74.7% | 92.3% | 24M | ReSSL(ResNet-50) 200ep | 2021-07-20 |
Weakly Supervised Contrastive Learning | ✓ Link | 74.7% | | 24M | WCL (ResNet-50) | 2021-10-10 |
MV-MR: multi-views and multi-representations for self-supervised learning and knowledge distillation | ✓ Link | 74.5% | 92.1 | | MV-MR | 2023-03-21 |
Boosting Contrastive Self-Supervised Learning with False Negative Cancellation | ✓ Link | 74.4% | 91.8% | 24M | FNC (ResNet-50) | 2020-11-23 |
Bootstrap your own latent: A new approach to self-supervised Learning | ✓ Link | 74.3% | 91.6% | 24M | BYOL (ResNet-50) | 2020-06-13 |
A Simple Framework for Contrastive Learning of Visual Representations | ✓ Link | 74.2% | 92.0% | 94M | SimCLR (ResNet-50 2x) | 2020-02-13 |
Self-Supervised Classification Network | ✓ Link | 74.2% | | 24M | Self-Classifier (ResNet-50) | 2021-03-19 |
Learning by Sorting: Self-supervised Learning with Group Ordering Constraints | ✓ Link | 73.9% | 91.6 | 25M | GroCo (ResNet-50) | 2023-01-05 |
OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning | ✓ Link | 73.8% | 92.2% | 24M | OBoW (ResNet-50) | 2020-12-21 |
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | ✓ Link | 73.2 | 91.1 | 24M | VICReg (ResNet50) | 2021-05-11 |
Barlow Twins: Self-Supervised Learning via Redundancy Reduction | ✓ Link | 73.2% | 91 | 24M | Barlow Twins (ResNet-50) | 2021-03-04 |
What Makes for Good Views for Contrastive Learning? | ✓ Link | 73.0% | 91.1% | 24M | InfoMin (ResNet-50) | 2020-05-20 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 72.8% | | 30M | DINO (ResMLP-24) | 2021-05-07 |
Self-Supervised Learning with Swin Transformers | ✓ Link | 72.8% | | 22M | MoBY (DeiT-S) | 2021-05-10 |
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution | ✓ Link | 72.1 | 91.0 | 25M | I-VNE+ (ResNet-50) | 2023-04-04 |
Generative Pretraining from Pixels | ✓ Link | 72.0% | | 6801M | iGPT-XL (64x64, 15360 features) | 2020-07-17 |
Big Self-Supervised Models are Strong Semi-Supervised Learners | ✓ Link | 71.7% | 90.4% | 24M | SimCLRv2 (ResNet-50) | 2020-06-17 |
Data-Efficient Image Recognition with Contrastive Predictive Coding | ✓ Link | 71.5% | 90.1% | 305M | CPC v2 (ResNet-161) (arxiv v2) | 2019-05-22 |
Exploring Simple Siamese Representation Learning | ✓ Link | 71.3% | | 24M | SimSiam (ResNet-50) | 2020-11-20 |
Improved Baselines with Momentum Contrastive Learning | ✓ Link | 71.1% | 90.1% | 24M | MoCo v2 (ResNet-50) | 2020-03-09 |
SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations | ✓ Link | 70.6% | 89.8% | 24M | SynCo (ResNet-50) 800ep | 2024-10-03 |
Contrastive Multiview Coding | ✓ Link | 70.6% | 89.7% | 188M | CMC (ResNet-50 x2) (arxiv v5) | 2019-06-13 |
A Simple Framework for Contrastive Learning of Visual Representations | ✓ Link | 69.3% | 89.0% | 24M | SimCLR (ResNet-50) | 2020-02-13 |
Generative Pretraining from Pixels | ✓ Link | 68.7% | | 6800M | iGPT-XL (64x64, 3072 features) | 2020-07-17 |
Momentum Contrast for Unsupervised Visual Representation Learning | ✓ Link | 68.6% | | 375M | MoCo (ResNet-50 4x) | 2019-11-13 |
Learning Representations by Maximizing Mutual Information Across Views | ✓ Link | 68.1% | | 626M | AMDIM (large) (arxiv v2) | 2019-06-03 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 68.0% | | 80M | MAE (ViT-B) | 2021-11-11 |
SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations | ✓ Link | 67.9% | 88 | 24M | SynCo (ResNet-50) 200ep | 2024-10-03 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 67.5% | | 15M | DINO (ResMLP-12) | 2021-05-07 |
Contrastive Multiview Coding | ✓ Link | 66.2% | 87.0% | 47M | CMC (ResNet-50) (arxiv v5) | 2019-06-13 |
Prototypical Contrastive Learning of Unsupervised Representations | ✓ Link | 65.9% | | 25M | PCL (ResNet-50) | 2020-05-11 |
Momentum Contrast for Unsupervised Visual Representation Learning | ✓ Link | 65.4% | | 94M | MoCo (ResNet-50 2x) | 2019-11-13 |
Generative Pretraining from Pixels | ✓ Link | 65.2% | | 1400M | iGPT-L (48x48) | 2020-07-17 |
Contrastive Multiview Coding | ✓ Link | 65.0% | 86.0% | | CMC (ResNet-101) (arxiv v3) | 2019-06-13 |
Data-Efficient Image Recognition with Contrastive Predictive Coding | ✓ Link | 63.8% | 85.3% | 24M | CPC v2 (ResNet-50) (arxiv v2) | 2019-05-22 |
Max-Margin Contrastive Learning | ✓ Link | 63.8% | | | MMCL (100 epoch, 256 batch size) | 2021-12-21 |
Self-Supervised Learning of Pretext-Invariant Representations | ✓ Link | 63.6% | | 24M | PIRL | 2019-12-04 |
Learning Representations by Maximizing Mutual Information Across Views | ✓ Link | 63.5% | | 194M | AMDIM (small) (arxiv v2) | 2019-06-03 |
Self-labelling via simultaneous clustering and representation learning | ✓ Link | 61.5% | 84.0% | 24M | SeLa (ResNet50) (arxiv 3) | 2019-11-13 |
Large Scale Adversarial Representation Learning | ✓ Link | 61.3% | 81.9% | 86M | BigBiGAN (RevNet-50 ×4, BN+CReLU) | 2019-07-04 |
Data-Efficient Image Recognition with Contrastive Predictive Coding | ✓ Link | 61.0% | 83.0% | 305M | CPC v2 (ResNet-161) (arxiv v1) | 2019-05-22 |
Large Scale Adversarial Representation Learning | ✓ Link | 60.8% | 81.4% | 86M | BigBiGAN (RevNet-50 ×4) | 2019-07-04 |
Momentum Contrast for Unsupervised Visual Representation Learning | ✓ Link | 60.6% | | 24M | MoCo (ResNet-50) | 2019-11-13 |
Generative Pretraining from Pixels | ✓ Link | 60.3% | | 1400M | iGPT-L (32x32) | 2020-07-17 |
Learning Representations by Maximizing Mutual Information Across Views | ✓ Link | 60.2% | | 337M | AMDIM (arxiv v1) | 2019-06-03 |
Local Aggregation for Unsupervised Learning of Visual Embeddings | ✓ Link | 60.2% | | 24M | LocalAgg (ResNet-50) | 2019-03-29 |
Contrastive Multiview Coding | ✓ Link | 60.1% | 82.8% | 44M | CMC (ResNet-101) | 2019-06-13 |
Large Scale Adversarial Representation Learning | ✓ Link | 56.6% | 78.6% | 24M | BigBiGAN (ResNet-50, BN+CReLU) | 2019-07-04 |
Self-labelling via simultaneous clustering and representation learning | ✓ Link | 55.7% | 79.5% | 24M | SeLa (ResNet50) | 2019-11-13 |
Revisiting Self-Supervised Visual Representation Learning | ✓ Link | 55.4% | 77.9% | 86M | Revisited Rotation (RevNet-50 ×4) | 2019-01-25 |
Large Scale Adversarial Representation Learning | ✓ Link | 55.4% | 77.4% | 25M | BigBiGAN (ResNet-50) | 2019-07-04 |
Revisiting Self-Supervised Visual Representation Learning | ✓ Link | 51.4% | 74.0% | 94M | Revisited Rel.Patch.Loc (ResNet50 ×2) | 2019-01-25 |
Self-labelling via simultaneous clustering and representation learning | ✓ Link | 50.0% | | 61M | SeLa (AlexNet) (arxiv v3) | 2019-11-13 |
Representation Learning with Contrastive Predictive Coding | ✓ Link | 48.7% | 73.6% | 44M | CPC (ResNet-101 V2) | 2018-07-10 |
Revisiting Self-Supervised Visual Representation Learning | ✓ Link | 46.0% | 68.8% | 211M | Revisited Exemplar (ResNet-50 ×3) | 2019-01-25 |
Revisiting Self-Supervised Visual Representation Learning | ✓ Link | 44.6% | 68.0% | 94M | Revisited Jigsaw (ResNet50 ×2) | 2019-01-25 |
Contrastive Multiview Coding | ✓ Link | 42.6% | | 30M | CMC (Alexnet/2) | 2019-06-13 |
Deep Clustering for Unsupervised Learning of Visual Features | ✓ Link | 41.0 | | 61M | DeepCluster (AlexNet) | 2018-07-15 |
Multi-task Self-Supervised Visual Learning | | 39.6 | 62.5 | 44M | Colorisation (improved) (ResNet-101) | 2017-08-25 |
Unsupervised Representation Learning by Predicting Image Rotations | ✓ Link | 38.7 | | 86M | Rotation (AlexNet) | 2018-03-21 |
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction | ✓ Link | 35.4% | | 61M | Split-Brain (AlexNet) | 2016-11-29 |
Representation Learning by Learning to Count | ✓ Link | 34.3 | | 61M | Counting (AlexNet) | 2017-08-22 |
Colorful Image Colorization | ✓ Link | 32.6% | | 61M | Colorization (AlexNet) | 2016-03-28 |
Multi-task Self-Supervised Visual Learning | | | 70.2 | 44M | Multi-task SSL (ResNet-101) | 2017-08-25 |