Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 84.63 | Model soups (BASIC-L) | 2022-03-10 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 84.3 | ViT-e | 2022-09-14 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 84.22 | Model soups (ViT-G/14) | 2022-03-10 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 84.00% | SwinV2-G | 2021-11-18 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 84.0 | MAWS (ViT-6.5B) | 2023-03-23 |
Scaling Vision Transformers | ✓ Link | 83.33 | ViT-G/14 | 2021-06-08 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 83.0 | MAWS (ViT-2B) | 2023-03-23 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 81.5 | MOAT-4 (IN-22K pretraining) | 2022-10-04 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 81.1 | SWAG (ViT H/14) | 2022-01-20 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 80.6 | MOAT-3 (IN-22K pretraining) | 2022-10-04 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 79.3 | MOAT-2 (IN-22K pretraining) | 2022-10-04 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 78.4 | MOAT-1 (IN-22K pretraining) | 2022-10-04 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 78.08 | SwinV2-B | 2021-11-18 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 78 | VOLO-D5 | 2021-06-24 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 77.8 | VOLO-D4 | 2021-06-24 |
Going deeper with Image Transformers | ✓ Link | 76.7 | CAIT-M36-448 | 2021-03-31 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 76.2 | SEER (RegNet10B) | 2022-02-16 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 74.2 | ResMLP-B24/8 22k | 2021-05-07 |
Three things everyone should know about Vision Transformers | ✓ Link | 73.9 | ViT-B-36x1 | 2022-03-18 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 73.4 | ResMLP-B24/8 | 2021-05-07 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 73.4 | Sequencer2D-L | 2022-05-04 |
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models | ✓ Link | 71.7 | Discrete Adversarial Distillation (ViT-B, 224) | 2023-11-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 71.4 | LeViT-384 | 2021-04-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 69.9 | LeViT-256 | 2021-04-02 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 69.8 | ResMLP-S24/16 | 2021-05-07 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 69.6 | ResNet-152x2-SAM | 2021-06-03 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 68.7 | LeViT-192 | 2021-04-02 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 68.7 | ResNet50 (A1) | 2021-10-01 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 67.5 | LeViT-128 | 2021-04-02 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 67.5 | ViT-B/16-SAM | 2021-06-03 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 66.0 | ResMLP-S12/16 | 2021-05-07 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 65.5 | Mixer-B/8-SAM | 2021-06-03 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 63.9 | LeViT-128S | 2021-04-02 |