DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 88.9% | 1100M | DINOv2 (ViT-g/14, 448) | 2023-04-14 |
Improving Visual Representation Learning through Perceptual Understanding | ✓ Link | 88.6% | 307M | PercMAE (ViT-L, dVAE) | 2022-12-30 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 88.5% | 1100M | DINOv2 (ViT-g/14) | 2023-04-14 |
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers | ✓ Link | 88.3% | 632M | PeCo(ViT-H/14, 448) | 2021-11-24 |
Improving Visual Representation Learning through Perceptual Understanding | ✓ Link | 88.1% | 307M | PercMAE (ViT-L) | 2022-12-30 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 88.0% | 632M | dBOT (ViT-H/14) | 2022-09-08 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 87.8% | 632M | MAE (ViT-H/14, 448) | 2021-11-11 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 87.8% | 307M | iBOT(ViT-L/16, 512) | 2021-11-15 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 87.2% | 632M | MAE + AugSub finetune (ViT-H/14) | 2023-06-20 |
SimMIM: A Simple Framework for Masked Image Modeling | ✓ Link | 87.1% | 658M | SimMIM (SwinV2-H, 512) | 2021-11-18 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 86.9% | | MAE (ViT-H/14) | 2021-11-11 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 86.6% | 307M | iBOT(ViT-L/16) | 2021-11-15 |
Towards Sustainable Self-supervised Learning | ✓ Link | 86.5% | | TEC_MAE (ViT-L/16, 224) | 2022-10-20 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 86.3% | 307M | BEiT-L (ViT) | 2021-06-15 |
Context Autoencoder for Self-Supervised Representation Learning | ✓ Link | 86.3% | 307M | CAE (ViT-L/16) | 2022-02-07 |
Masked Image Residual Learning for Scaling Deeper Vision Transformers | ✓ Link | 86.2% | 341M | MIRL (ViT-B-48) | 2023-09-25 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 86.1% | 304M | MAE + AugSub finetune (ViT-L/16) | 2023-06-20 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 86.0% | 198M | SparK (ConvNeXt-Large, 384) | 2023-01-09 |
Bootstrapped Masked Autoencoders for Vision BERT Pretraining | ✓ Link | 85.9% | 307M | BootMAE(ViT-L) | 2022-07-14 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 85.8% | 10000M | SEER (Regnet10B) | 2022-02-16 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 85.7% | 307M | MaskFeat (ViT-L) | 2021-12-16 |
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | ✓ Link | 85.6% | 473M | OFA (Large) | 2022-02-07 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 85.4% | 198M | SparK (ConvNeXt-Large) | 2023-01-09 |
SimMIM: A Simple Framework for Masked Image Modeling | ✓ Link | 85.4% | 197M | SimMIM (Swin-L) | 2021-11-18 |
Mugs: A Multi-Granular Self-Supervised Learning Framework | ✓ Link | 85.2% | 307M | Mugs (ViT-L/16) | 2022-03-27 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 84.8% | 307M | iBOT (ViT-L/16) | 2021-11-15 |
Masked Image Residual Learning for Scaling Deeper Vision Transformers | ✓ Link | 84.8% | 96M | MIRL (ViT-S-54) | 2023-09-25 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 84.8% | 89M | ConvNeXt-Base (SparK pre-training) | 2023-01-09 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 84.6% | 86M | BEiT-B (ViT) | 2021-06-15 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 84.5% | | A2MIM+ (ViT-B) | 2022-05-27 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 84.4% | 85M | iBOT (ViT-B/16) | 2021-11-15 |
Mugs: A Multi-Granular Self-Supervised Learning Framework | ✓ Link | 84.3% | 85M | Mugs (ViT-B/16) | 2022-03-27 |
Self-supervised Pretraining of Visual Features in the Wild | ✓ Link | 84.2% | 1.3B | SEER (RegNetY-256GF) | 2021-03-02 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 84.2% | | A2MIM (ViT-B) | 2022-05-27 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 84.1% | 304M | MoCo v3 (ViT-L/16) | 2021-04-05 |
mc-BEiT: Multi-choice Discretization for Image BERT Pre-training | ✓ Link | 84.1% | 86M | mc-BEiT (ViT-B/16) | 2022-03-29 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 84.1% | 50M | ConvNeXt-Small (SparK pre-training) | 2023-01-09 |
SimMIM: A Simple Framework for Masked Image Modeling | ✓ Link | 84.0% | 88M | SimMIM (Swin-B) | 2021-11-18 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 84.0% | 85M | iBOT (ViT-B/16) | 2021-11-15 |
Efficient Self-supervised Vision Transformers for Representation Learning | ✓ Link | 83.9% | 87M | EsViT (Swin-B) | 2021-06-17 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 83.9% | 87M | MAE + AugSub finetune (ViT-B/16) | 2023-06-20 |
Self-supervised Pretraining of Visual Features in the Wild | ✓ Link | 83.8% | 693M | SEER (RegNetY-128GF) | 2021-03-02 |
SimMIM: A Simple Framework for Masked Image Modeling | ✓ Link | 83.8% | 85M | SimMIM (ViT-B/16) | 2021-11-18 |
An Empirical Study of Training Self-Supervised Vision Transformers | ✓ Link | 83.2% | 86M | MoCo v3 (ViT-B/16) | 2021-04-05 |
Multiplexed Immunofluorescence Brain Image Analysis Using Self-Supervised Dual-Loss Adaptive Masked Autoencoder | ✓ Link | 83.2% | | DAMA (ViT-B/16) | 2022-05-10 |
Big Self-Supervised Models are Strong Semi-Supervised Learners | ✓ Link | 83.1% | 795M | SimCLRv2 (ResNet-152, 3×+SK) | 2020-06-17 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 83.1% | 65M | ResNet-200 (SparK pre-training) | 2023-01-09 |
Emerging Properties in Self-Supervised Vision Transformers | ✓ Link | 82.8% | 85M | DINO (ViT-B/16) | 2021-04-29 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 82.7% | 60M | ResNet-152 (SparK pre-training) | 2023-01-09 |
Mugs: A Multi-Granular Self-Supervised Learning Framework | ✓ Link | 82.6% | 21M | Mugs (ViT-S/16) | 2022-03-27 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 82.4% | | A2MIM+ (ViT-S) | 2022-05-27 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 82.2% | 44M | ResNet-101 (SparK pre-training) | 2023-01-09 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 82.2% | | A2MIM (ViT-S) | 2022-05-27 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 82.0% | 193M | SwAV (ResNeXt-101-32x16d) | 2020-06-17 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 80.6% | 26M | ResNet-50 (SparK pre-training) | 2023-01-09 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 80.5% | | A2MIM+ (ResNet-50 RSB-A2) | 2022-05-27 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 80.4% | | A2MIM (ResNet-50 RSB-A2) | 2022-05-27 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 78.9% | | A2MIM+ (ResNet-50 RSB-A3) | 2022-05-27 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 78.8% | | A2MIM (ResNet-50 RSB-A3) | 2022-05-27 |
Divide and Contrast: Self-supervised Learning from Uncurated Data | | 78.2% | | DnC (Resnet-50) | 2021-05-17 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 77.8% | 182M | SwAV (Resnet-50) | 2020-06-17 |
Momentum Contrast for Unsupervised Visual Representation Learning | ✓ Link | 77.3% | | MoCo (Resnet-50) | 2019-11-13 |
A Simple Framework for Contrastive Learning of Visual Representations | ✓ Link | 77.2% | | SimCLR (Resnet-50) | 2020-02-13 |
Momentum Contrast for Unsupervised Visual Representation Learning | ✓ Link | 77.0% | | MoCo (Resnet-50) | 2019-11-13 |
Unsupervised Pre-Training of Image Features on Non-Curated Data | ✓ Link | 74.9% | 138M | DeeperCluster (VGG16) | 2019-05-03 |