Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 91.78% | | | | Baseline (ViT-G/14) | 2022-03-10 |
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond | ✓ Link | 91.2% | 644M | | | ViTAE-H
(MAE, 512) | 2022-02-21 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 91.20% | 1843M | | | Model soups (ViT-G/14) | 2022-03-10 |
Meta Pseudo Labels | ✓ Link | 91.12% | | | | Meta Pseudo Labels (EfficientNet-B6-Wide) | 2020-03-23 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 91.1% | | | | MAWS (ViT-6.5B) | 2023-03-23 |
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 91.05% | 460M | | | TokenLearner L/8 (24+11) | 2021-06-21 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 91.03% | 2440M | | | Model soups (BASIC-L) | 2022-03-10 |
Meta Pseudo Labels | ✓ Link | 91.02% | | | | Meta Pseudo Labels (EfficientNet-L2) | 2020-03-23 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 90.9% | | | | MAWS (ViT-2B) | 2023-03-23 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 90.9% | 480M | | | FixEfficientNet-L2 | 2020-03-18 |
Scaling Vision Transformers | ✓ Link | 90.81% | | | | ViT-G/14 | 2021-06-08 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 90.8% | | | | MAWS (ViT-H) | 2023-03-23 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 90.7% | | | | SWAG (RegNetY 128GF) | 2022-01-20 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 90.6% | | 87.7% | 277M | CvT-W24 (384 res, ImageNet-22k pretrain) | 2021-03-29 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 90.6% | | | | VOLO-D5 | 2021-06-24 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 90.55% | 480M | | | EfficientNet-L2 | 2019-11-11 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 90.54% | 928M | | | BiT-L | 2019-12-24 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 90.5% | | | | VOLO-D4 | 2021-06-24 |
Going deeper with Image Transformers | ✓ Link | 90.2% | | | | CAIT-M36-448 | 2021-03-31 |
MLP-Mixer: An all-MLP Architecture for Vision | ✓ Link | 90.18% | 409M | | | Mixer-H/14- 448 (JFT-300M pre-train) | 2021-05-04 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 90.0% | 87M | | | FixEfficientNet-B8 | 2020-03-18 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 89.8% | 10000M | | | SEER (RegNet10B) | 2022-02-16 |
Fixing the train-test resolution discrepancy | ✓ Link | 89.73% | 829M | | | FixResNeXt-101 32x48d | 2019-06-14 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 89.3% | 86M | | | DeiT-B-384 | 2020-12-23 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 89.02% | | | | BiT-M | 2019-12-24 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 88.7% | 86M | | | DeiT-B | 2020-12-23 |
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network | ✓ Link | 88.65% | | | | Assemble-ResNet152 | 2020-01-17 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 88.1% | | | | CeiT-S (384 finetune res) | 2021-03-22 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 87.9 | | | | Sequencer2D-L | 2022-05-04 |
MLP-Mixer: An all-MLP Architecture for Vision | ✓ Link | 87.86% | 409M | | | Mixer-H/14 (JFT-300M pre-train) | 2021-05-04 |
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network | ✓ Link | 87.82% | | | | Assemble ResNet-50 | 2020-01-17 |
Learning Transferable Architectures for Scalable Image Recognition | ✓ Link | 87.56% | | | | NASNet-A Large | 2017-07-21 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 87.5% | | | | LeViT-384 | 2021-04-02 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 87.3% | | | | CeiT-S | 2021-03-22 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 86.9% | | | | LeViT-256 | 2021-04-02 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 86.8% | 22M | | | DeiT-S | 2020-12-23 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 86.4% | | | | ResNet-152x2-SAM | 2021-06-03 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 85.8% | | | | LeViT-192 | 2021-04-02 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 85.7% | 25M | | | ResNet50 (A1) | 2021-10-01 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 85.6% | | | | LeViT-128 | 2021-04-02 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 85.6% | 45M | | | ResMLP-36 | 2021-05-07 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 85.3% | 30M | | | ResMLP-24 | 2021-05-07 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 85.2% | | | | ViT-B/16-SAM | 2021-06-03 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 84.6% | 15M | | | ResMLP-12 | 2021-05-07 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 84.4% | | | | Mixer-B/8-SAM | 2021-06-03 |
Revisiting a kNN-based Image Classification System with High-capacity Storage | | 84% | | | | kNN-CLIP | 2022-04-03 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 83.6% | | | | CeiT-T | 2021-03-22 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 82.6% | | | | LeViT-128S | 2021-04-02 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 82.1% | 5M | | | DeiT-Ti | 2020-12-23 |
Learning Transferable Architectures for Scalable Image Recognition | ✓ Link | 81.15% | | | | NASNet-A Mobile | 2017-07-21 |
Very Deep Convolutional Networks for Large-Scale Image Recognition | ✓ Link | 80.60% | | | | VGG-16 BN | 2014-09-04 |
Very Deep Convolutional Networks for Large-Scale Image Recognition | ✓ Link | 79.01% | | | | VGG-16 | 2014-09-04 |
ImageNet Classification with Deep Convolutional Neural Networks | ✓ Link | 62.88% | | | | AlexNet | 2012-12-01 |
DeiT III: Revenge of the ViT | ✓ Link | | | 87.7% | 304M | ViT-L @384 (DeiT III, 21k) | 2022-04-14 |
DeiT III: Revenge of the ViT | ✓ Link | | | 87.2% | 632M | ViT-H @224 (DeiT III, 21k) | 2022-04-14 |
DeiT III: Revenge of the ViT | ✓ Link | | | 87.0% | | ViT-L @224 (DeiT III, 21k) | 2022-04-14 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | | | 84.4% | | ResMLP-B24/8 (22k) | 2021-05-07 |