Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 88.5 | | Hiera-H (448px) | 2023-06-01 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 88.3 | | MAE (ViT-H, 448) | 2021-11-11 |
Grafit: Learning fine-grained image representations with coarse labels | | 84.1 | | Grafit (RegnetY 8GF) | 2020-11-25 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 83.9 | | MixMIM-L | 2022-05-26 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 83.7 | 186M | RDNet-L (224 res, IN-1K pretrained) | 2024-03-28 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 83.5 | 87M | RDNet-B (224 res, IN-1K pretrained) | 2024-03-28 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 82.9 | 50M | RDNet-S (224 res, IN-1K pretrained) | 2024-03-28 |
Conviformers: Convolutionally guided Vision Transformer | ✓ Link | 82.85 | | Conviformer-B | 2022-08-17 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 82.7 | | CeiT-S (384 finetune resolution) | 2021-03-22 |
Going deeper with Image Transformers | ✓ Link | 81.8 | | CaiT-M-36 U 224 | 2021-03-31 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 81.2 | 24M | RDNet-T (224 res, IN-1K pretrained) | 2024-03-28 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 78.9 | | CeiT-S | 2021-03-22 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 77.9 | | CeiT-T (384 finetune resolution) | 2021-03-22 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 75.0 | | ResNet50 (A2) | 2021-10-01 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 74.3 | | LeViT-384 | 2021-04-02 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 72.8 | | CeiT-T | 2021-03-22 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 72.5 | | ResMLP-24 | 2021-05-07 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 72.3 | | LeViT-256 | 2021-04-02 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 71.0 | | ResMLP-12 | 2021-05-07 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 70.8 | | LeViT-192 | 2021-04-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 68.4 | | LeViT-128 | 2021-04-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 66.5 | | LeViT-128S | 2021-04-02 |