OpenCodePapers

image-classification-on-imagenet

Image Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop 1 AccuracyNumber of paramsGFLOPsHardware BurdenTop 5 AccuracyOperations per network passModelNameReleaseDate
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link91.0%2100MCoCa (finetuned)2022-05-04
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time✓ Link90.98%2440MModel soups (BASIC-L)2022-03-10
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time✓ Link90.94%1843MModel soups (ViT-G/14)2022-03-10
DaViT: Dual Attention Vision Transformers✓ Link90.4%1437M1038DaViT-G2022-04-07
DaViT: Dual Attention Vision Transformers✓ Link90.2%362M334DaViT-H2022-04-07
Meta Pseudo Labels✓ Link90.2%480M95040G98.8Meta Pseudo Labels (EfficientNet-L2)2020-03-23
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link90.17%3000MSwinV2-G2021-11-18
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link90.1%6500MMAWS (ViT-6.5B)2023-03-23
Florence: A New Foundation Model for Computer Vision✓ Link90.05%893M99.02Florence-CoSwin-H2021-11-22
Meta Pseudo Labels✓ Link90%390MMeta Pseudo Labels (EfficientNet-B6-Wide)2020-03-23
Reversible Column Networks✓ Link90.0%2158MRevCol-H2022-12-22
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link89.8%2000MMAWS (ViT-2B)2023-03-23
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link89.7%1000MEVA2022-11-14
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information✓ Link89.6%M3I Pre-training (InternImage-H)2022-11-17
Scaling Vision Transformers to 22 Billion Parameters✓ Link89.6%307MViT-L/16 (384res, distilled from ViT-22B)2023-02-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link89.6%1080M1478InternImage-H2022-11-10
MaxViT: Multi-Axis Vision Transformer✓ Link89.53%MaxViT-XL (512res, JFT)2022-04-04
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link89.5%AIMv2-3B (448 res)2024-11-21
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link89.5%650MMAWS (ViT-H)2023-03-23
MaxViT: Multi-Axis Vision Transformer✓ Link89.41%MaxViT-L (512res, JFT)2022-04-04
MaxViT: Multi-Axis Vision Transformer✓ Link89.36%MaxViT-XL (384res, JFT)2022-04-04
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning89.3%OmniVec22024-01-01
High-Performance Large-Scale Image Recognition Without Normalization✓ Link89.2%527M367NFNet-F4+2021-02-11
MaxViT: Multi-Axis Vision Transformer✓ Link89.12%MaxViT-L (384res, JFT)2022-04-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link89.1%483.2M648.5MOAT-4 22K+1K2022-10-04
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation✓ Link89.0%307MFD (CLIP ViT-L-336)2022-05-27
Differentially Private Image Classification from Features✓ Link88.9%Last Layer Tuning with Newton Step (ViT-G/14))2022-11-24
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?✓ Link88.87%460MTokenLearner L/8 (24+11)2021-06-21
MaxViT: Multi-Axis Vision Transformer✓ Link88.82%MaxViT-B (512res, JFT)2022-04-04
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link88.8%MAWS (ViT-L)2023-03-23
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link88.8%667M763.5MViTv2-H (512 res, ImageNet-21k pretrain)2021-12-02
MaxViT: Multi-Axis Vision Transformer✓ Link88.7%MaxViT-XL (512res, 21K)2022-04-04
MaxViT: Multi-Axis Vision Transformer✓ Link88.69%MaxViT-B (384res, JFT)2022-04-04
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision✓ Link88.64%480MALIGN (EfficientNet-L2)2021-02-11
Sharpness-Aware Minimization for Efficiently Improving Generalization✓ Link88.61%480MEfficientNet-L2-475 (SAM)2020-10-03
Scaling Vision Transformers to 22 Billion Parameters✓ Link88.6%86MViT-B/162023-02-10
BEiT: BERT Pre-Training of Image Transformers✓ Link88.60%331MBEiT-L (ViT; ImageNet-22K pretrain)2021-06-15
Revisiting Weakly Supervised Pre-Training of Visual Perception Models✓ Link88.6%633.5M1018.8SWAG (ViT H/14)2022-01-20
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale✓ Link88.55%ViT-H/142020-10-22
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link88.52%114CoAtNet-3 @3842021-06-09
MaxViT: Multi-Axis Vision Transformer✓ Link88.51%MaxViT-XL (384res, 21K)2022-04-04
Reproducible scaling laws for contrastive language-image learning✓ Link88.5%OpenCLIP ViT-H/142022-12-14
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link88.5%AIMv2-3B2024-11-21
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link88.5%480M585FixEfficientNet-L22020-03-18
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond✓ Link88.5%644MViTAE-H + MAE (448)2022-02-21
MaxViT: Multi-Axis Vision Transformer✓ Link88.46%MaxViT-L (512res, 21K)2022-04-04
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link88.4%218M140.7MViTv2-L (384 res, ImageNet-21k pretrain)2021-12-02
Self-training with Noisy Student improves ImageNet classification✓ Link88.4%480M51800GNoisyStudent (EfficientNet-L2)2019-11-11
MaxViT: Multi-Axis Vision Transformer✓ Link88.38%MaxViT-B (512res, 21K)2022-04-04
Differentiable Top-k Classification Learning✓ Link88.37%Top-k DiffSortNets (EfficientNet-L2)2022-06-15
A ConvNet for the 2020s✓ Link88.36%1827MAdlik-ViT-SG+Swin_large+Convnext_xlarge(384)2022-01-10
Scaling Vision with Sparse Mixture of Experts✓ Link88.36%7200MV-MoE-H/14 (Every-2)2021-06-10
MaxViT: Multi-Axis Vision Transformer✓ Link88.32%MaxViT-L (384res, 21K)2022-04-04
Unicom: Universal and Compact Representation Learning for Image Retrieval✓ Link88.3Unicom (ViT-L/14@336px) (Finetuned)2023-04-12
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers✓ Link88.3%PeCo (ViT-H, 448)2021-11-24
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion✓ Link88.21%DFN-5B H/14-378 + PrefixedIter Decoder2024-07-15
Exploring Target Representations for Masked Autoencoders✓ Link88.2%dBOT ViT-H (CLIP as Teacher)2022-09-08
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link88.1%489.1MambaVision-L32024-07-10
MetaFormer Baselines for Vision✓ Link88.1%99M72.2CAFormer-B36 (384 res, 21K)2022-10-24
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link88.1%1200MAIMv2-1B2024-11-21
Scaling Vision with Sparse Mixture of Experts✓ Link88.08%656MVIT-H/142021-06-10
Co-training $2^L$ Submodels for Visual Recognition✓ Link88.0%ViT-H@224 (cosub)2022-12-09
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link88%UniRepLKNet-XL++2023-11-27
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link88%335M163InternImage-XL2022-11-10
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link88%667M120.6MViTv2-H (mageNet-21k pretrain)2021-12-02
MLP-Mixer: An all-MLP Architecture for Vision✓ Link87.94%Mixer-H/14 (JFT-300M pre-train)2021-05-04
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link87.9%UniRepLKNet-L++2023-11-27
Exploring Target Representations for Masked Autoencoders✓ Link87.8%dBOT ViT-L (CLIP as Teacher)2022-09-08
MogaNet: Multi-order Gated Aggregation Network✓ Link87.8%181M102MogaNet-XL (384res)2022-11-07
Visual Attention Network✓ Link87.8%200M114.3VAN-B6 (22K, 384res)2022-02-20
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs✓ Link87.8%335M128.7RepLKNet-XL2022-03-13
A ConvNet for the 2020s✓ Link87.8%350M179ConvNeXt-XL (ImageNet-22k)2022-01-10
Masked Autoencoders Are Scalable Vision Learners✓ Link87.8%656MMAE (ViT-H, 448)2021-11-11
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale✓ Link87.76%ViT-L/162020-10-22
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions✓ Link87.7%101.8HorNet-L (GF)2022-07-28
CvT: Introducing Convolutions to Vision Transformers✓ Link87.7%CvT-W24 (384 res, ImageNet-22k pretrain)2021-03-29
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link87.7%223M108InternImage-L2022-11-10
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link87.6%CoAtNet-3 (21k)2021-06-09
MetaFormer Baselines for Vision✓ Link87.6%100M66.5ConvFormer-B36 (384 res, 21K)2022-10-24
Big Transfer (BiT): General Visual Representation Learning✓ Link87.54%98.46BiT-L (ResNet)2019-12-24
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers✓ Link87.5%PeCo (ViT-H, 224)2021-11-24
Co-training $2^L$ Submodels for Visual Recognition✓ Link87.5%ViT-L@224 (cosub)2022-12-09
MetaFormer Baselines for Vision✓ Link87.5%56M42CAFormer-M36 (384 res, 21K)2022-10-24
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows✓ Link87.5%173M96.8CSWin-L (384 res,ImageNet-22k pretrain)2021-07-01
DaViT: Dual Attention Vision Transformers✓ Link87.5%196.8M103DaViT-L (ImageNet-22k)2022-04-07
Dilated Neighborhood Attention Transformer✓ Link87.5%200M92.4DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)2022-09-29
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link87.5%600MAIMv2-H2024-11-21
Scaling Vision with Sparse Mixture of Experts✓ Link87.41%3400MV-MoE-L/16 (Every-2)2021-06-10
Dilated Neighborhood Attention Transformer✓ Link87.4%89.7DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)2022-09-29
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language✓ Link87.4%data2vec 2.02022-12-14
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link87.4%UniRepLKNet-B++2023-11-27
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space✓ Link87.4%HVT Huge2024-09-25
MetaFormer Baselines for Vision✓ Link87.4%99M23.2CAFormer-B36 (224 res, 21K)2022-10-24
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP✓ Link87.4%117M51UniNet-B62022-07-12
Dilated Neighborhood Attention Transformer✓ Link87.4%197M101.5DiNAT_s-Large (384res; Pretrained on IN22K@224)2022-09-29
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link87.3%197M103.9Swin-L2021-03-25
EfficientNetV2: Smaller Models and Faster Training✓ Link87.3%208M94EfficientNetV2-XL (21k)2021-04-01
Improving Vision Transformers by Revisiting High-frequency Components✓ Link87.3%295.5M412VOLO-D5+HAT2022-04-03
PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions✓ Link87.2%EfficientNetV2 (PolyLoss)2022-04-26
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link87.2%298M437ELSA-VOLO-D5 (512*512)2021-12-23
A Study on Transformer Configuration and Training Objective87.1Bamboo (Bamboo-H)2022-05-21
Co-training $2^L$ Submodels for Visual Recognition✓ Link87.1%Swin-L@224 (cosub)2022-12-09
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link87.1%CoAtNet-2 (21k)2021-06-09
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link87.1%66M 82FixEfficientNet-B72020-03-18
Understanding The Robustness in Vision Transformers✓ Link87.1%76.8MFAN-L-Hybrid++2022-04-26
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link87.1%88MSwinV2-B2021-11-18
VOLO: Vision Outlooker for Visual Recognition✓ Link87.1%296M412VOLO-D52021-06-24
Augmenting Convolutional networks with attention-based aggregation✓ Link87.1%334.3MPatchConvNet-L120-21k-3842021-12-27
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?✓ Link87.07%16-TokenLearner B/16 (21)2021-06-21
Enhance the Visual Representation via Discrete Adversarial Training✓ Link87.02%MAE+DAT (ViT-H)2022-09-16
Visual Attention Network✓ Link87%50.6VAN-B5 (22K, 384res)2022-02-20
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP✓ Link87%72.9M20.4UniNet-B52022-07-12
MetaFormer Baselines for Vision✓ Link87.0%100M22.6ConvFormer-B36 (224 res, 21K)2022-10-24
Masked Autoencoders Are Scalable Vision Learners✓ Link86.9%MAE (ViT-H)2021-11-11
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link86.9%Hiera-H2023-06-01
MetaFormer Baselines for Vision✓ Link86.9%39M26.0CAFormer-S36 (384 res, 21K)2022-10-24
MetaFormer Baselines for Vision✓ Link86.9%57M37.7ConvFormer-M36 (384 res, 21K)2022-10-24
Self-training with Noisy Student improves ImageNet classification✓ Link86.9%66M37NoisyStudent (EfficientNet-B7)2019-11-11
DaViT: Dual Attention Vision Transformers✓ Link86.9%87.9M46.4DaViT-B (ImageNet-22k)2022-04-07
Visual Attention Network✓ Link86.9%200M38.9VAN-B6 (22K)2022-02-20
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link86.8%MAWS (ViT-B)2023-03-23
EfficientNetV2: Smaller Models and Faster Training✓ Link86.8%120M53EfficientNetV2-L (21k)2021-04-01
VOLO: Vision Outlooker for Visual Recognition✓ Link86.8%193M197VOLO-D42021-06-24
Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error86.78%377.2MNFNet-F5 w/ SAM w/ augmult=162021-05-27
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems✓ Link86.74%µ2Net (ViT-L/16)2022-05-25
DeiT III: Revenge of the ViT✓ Link86.7%ViT-B @384 (DeiT III, 21k)2022-04-14
MaxViT: Multi-Axis Vision Transformer✓ Link86.7%MaxViT-B (512res)2022-04-04
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link86.7%43MFixEfficientNet-B62020-03-18
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link86.7%190M271MOAT-3 1K only2022-10-04
An Algorithm for Routing Vectors in Sequences✓ Link86.7%312.8MHeinsen Routing + BEiT-large 16 2242022-11-20
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network✓ Link86.61%51.93CLCNet (S:ViT+D:EffNet-B7) (retrain)2022-05-19
MetaFormer Baselines for Vision✓ Link86.6%56M13.2CAFormer-M36 (224 res, 21K)2022-10-24
Visual Attention Network✓ Link86.6%60M35.9VAN-B4 (22K, 384res)2022-02-20
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link86.6%300MAIMv2-L2024-11-21
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language✓ Link86.6%656Mdata2vec (ViT-H)2022-02-07
Dilated Neighborhood Attention Transformer✓ Link86.5%34.5DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)2022-09-29
Meta Knowledge Distillation86.5%MKD ViT-L2022-02-16
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link86.5%21M27.0TinyViT-21M-512-distill (512 res, 21k)2022-07-21
Augmenting Convolutional networks with attention-based aggregation✓ Link86.5%99.4MPatchConvNet-B60-21k-3842021-12-27
Going deeper with Image Transformers✓ Link86.5%438M377.3CaiT-M-48-4482021-03-31
High-Performance Large-Scale Image Recognition Without Normalization✓ Link86.5%438.4M377.28NFNet-F6 w/ SAM2021-02-11
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network✓ Link86.46%57.46CLCNet (S:ViT+D:VOLO-D3) (retrain)2022-05-19
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network✓ Link86.42%45.43CLCNet (S:ConvNeXt-L+D:EffNet-B7) (retrain)2022-05-19
MaxViT: Multi-Axis Vision Transformer✓ Link86.4%MaxViT-L (384res)2022-04-04
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link86.4%UniRepLKNet-S++2023-11-27
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link86.4%30MFixEfficientNet-B52020-03-18
MetaFormer Baselines for Vision✓ Link86.4%40M22.4ConvFormer-S36 (384 res, 21K)2022-10-24
Self-training with Noisy Student improves ImageNet classification✓ Link86.4%43MNoisyStudent (EfficientNet-B6)2019-11-11
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link86.4%88M47Swin-B2021-03-25
MetaFormer Baselines for Vision✓ Link86.4%99M72.2CAFormer-B36 (384 res)2022-10-24
All Tokens Matter: Token Labeling for Training Better Vision Transformers✓ Link86.4%151M214.8LV-ViT-L2021-04-22
Fixing the train-test resolution discrepancy✓ Link86.4%829M62G98.0%FixResNeXt-101 32x48d2019-06-14
MaxViT: Multi-Axis Vision Transformer✓ Link86.34%MaxViT-B (384res)2022-04-04
A Study on Transformer Configuration and Training Objective86.3Bamboo (Bamboo-L)2022-05-21
Co-training $2^L$ Submodels for Visual Recognition✓ Link86.3%ViT-B@224 (cosub)2022-12-09
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link86.3%Our SP-ViT-L|3842022-06-15
VOLO: Vision Outlooker for Visual Recognition✓ Link86.3%86M67.9VOLO-D32021-06-24
BEiT: BERT Pre-Training of Image Transformers✓ Link86.3%86MBEiT-L (ViT; ImageNet 1k pretrain)2021-06-15
Visual Attention Network✓ Link86.3%90M17.2VAN-B5 (22K)2022-02-20
UniFormer: Unifying Convolution and Self-attention for Visual Recognition✓ Link86.3%100M39.2UniFormer-L (384 res)2022-01-24
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link86.3%218M140.2MViTv2-L (384 res)2021-12-02
Going deeper with Image Transformers✓ Link86.3%271M247.8CAIT-M36-4482021-03-31
High-Performance Large-Scale Image Recognition Without Normalization✓ Link86.3%377.2M289.76NFNet-F5 w/ SAM2021-02-11
Tiny Models are the Computational Saver for Large Models✓ Link86.2431.17TinySaver(ConvNeXtV2_h, 0.01 Acc drop)2024-03-26
Co-training $2^L$ Submodels for Visual Recognition✓ Link86.2%Swin-B@224 (cosub)2022-12-09
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link86.2%21M13.8TinyViT-21M-384-distill (384 res, 21k)2022-07-21
EfficientNetV2: Smaller Models and Faster Training✓ Link86.2%54M24EfficientNetV2-M (21k)2021-04-01
MetaFormer Baselines for Vision✓ Link86.2%56M42.0CAFormer-M36 (384 res)2022-10-24
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link86.2%89.7M56.3TransNeXt-Base (IN-1K supervised, 384)2023-11-28
Masked Image Residual Learning for Scaling Deeper Vision Transformers✓ Link86.2%341M67.0MIRL (ViT-B-48)2023-09-25
MaxViT: Multi-Axis Vision Transformer✓ Link86.19%MaxViT-S (512res)2022-04-04
Self-training with Noisy Student improves ImageNet classification✓ Link86.1%30MNoisyStudent (EfficientNet-B5)2019-11-11
MetaFormer Baselines for Vision✓ Link86.1%57M12.8ConvFormer-M36 (224 res, 21K)2022-10-24
Going deeper with Image Transformers✓ Link86.1%270.9M173.3CAIT-M-362021-03-31
Refiner: Refining Self-attention for Vision Transformers✓ Link86.0381MRefiner-ViT-L2021-06-07
Generalized Parametric Contrastive Learning✓ Link86.01%GPaCo (ViT-L)2022-09-26
Omnivore: A Single Model for Many Visual Modalities✓ Link86.0%Omnivore (Swin-L)2022-01-20
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link86%Our SP-ViT-M|3842022-06-15
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link86.0%49.7M32.1TransNeXt-Small (IN-1K supervised, 384)2023-11-28
VOLO: Vision Outlooker for Visual Recognition✓ Link86%59MVOLO-D22021-06-24
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link86%64M20EfficientViT-L2 (r384)2022-05-29
XCiT: Cross-Covariance Image Transformers✓ Link86%189M417.9XCiT-L242021-06-17
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling✓ Link86.0%198MSparK (ConvNeXt-Large, 384)2023-01-09
High-Performance Large-Scale Image Recognition Without Normalization✓ Link86.0%377.2M289.76NFNet-F52021-02-11
Masked Autoencoders Are Scalable Vision Learners✓ Link85.9%MAE (ViT-L)2021-11-11
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link85.9%19MFixEfficientNet-B42020-03-18
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link85.9%94M49.7DAT-B++ (384x384)2023-09-04
High-Performance Large-Scale Image Recognition Without Normalization✓ Link85.9%316.1M215.24NFNet-F42021-02-11
Co-training $2^L$ Submodels for Visual Recognition✓ Link85.8%ConvNeXt-B@224 (cosub)2022-12-09
Co-training $2^L$ Submodels for Visual Recognition✓ Link85.8%PiT-B@224 (cosub)2022-12-09
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link85.8%GTP-ViT-B-Patch8/P202023-11-06
MetaFormer Baselines for Vision✓ Link85.8%39M8.0CAFormer-S36 (224 res, 21K)2022-10-24
XCiT: Cross-Covariance Image Transformers✓ Link85.8%84M188XCiT-M242021-06-17
MaxUp: A Simple Way to Improve Generalization of Neural Network Training✓ Link85.8%87.42MFix-EfficientNet-B8 (MaxUp + CutMix)2020-02-20
Circumventing Outliers of AutoAugment with Knowledge Distillation✓ Link85.8%88MKDforAA (EfficientNet-B8)2020-03-25
Going deeper with Image Transformers✓ Link85.8%185.9M116.1CAIT-M-242021-03-31
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs✓ Link85.8%186M34.7RDNet-L (384 res)2024-03-28
DeiT III: Revenge of the ViT✓ Link85.8%304.8M191.2ViT-L2022-04-14
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link85.8%1360M142FasterViT-62023-06-09
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision✓ Link85.8%10000MSEER (RG-10B)2022-02-16
Tiny Models are the Computational Saver for Large Models✓ Link85.7519.41TinySaver(ConvNeXtV2_h, 0.5 Acc drop)2024-03-26
Tiny Models are the Computational Saver for Large Models✓ Link85.74TinySaver(Swin_large, 0.5 Acc drop)2024-03-26
MaxViT: Multi-Axis Vision Transformer✓ Link85.72%MaxViT-T (384res)2022-04-04
Visual Attention Network✓ Link85.7%12.2VAN-B4 (22K)2022-02-20
EfficientNetV2: Smaller Models and Faster Training✓ Link85.7%53EfficientNetV2-L2021-04-01
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link85.7%FixEfficientNet-B82020-03-18
DeiT III: Revenge of the ViT✓ Link85.7%ViT-B @224 (DeiT III, 21k)2022-04-14
Exploring Target Representations for Masked Autoencoders✓ Link85.7%dBOT ViT-B (CLIP as Teacher)2022-09-08
MetaFormer Baselines for Vision✓ Link85.7%39M26.0CAFormer-S36 (384 res)2022-10-24
MetaFormer Baselines for Vision✓ Link85.7%100M66.5ConvFormer-B36 (384 res)2022-10-24
High-Performance Large-Scale Image Recognition Without Normalization✓ Link85.7%254.9M114.76NFNet-F32021-02-11
Masking meets Supervision: A Strong Learning Alliance✓ Link85.7%632MViT-H @224 (DeiT-III + AugSub)2023-06-20
XCiT: Cross-Covariance Image Transformers✓ Link85.6%48M106XCiT-S242021-06-17
MetaFormer Baselines for Vision✓ Link85.6%57M37.7ConvFormer-M36 (384 res)2022-10-24
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link85.6%64M11EfficientViT-L2 (r288)2022-05-29
UniFormer: Unifying Convolution and Self-attention for Visual Recognition✓ Link85.6%100M12.6UniFormer-L2022-01-24
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link85.6%957.5M113FasterViT-52023-06-09
Three things everyone should know about Vision Transformers✓ Link85.5%ViT-L@384 (attn finetune)2022-03-18
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link85.5%Our SP-ViT-L2022-06-15
MiniViT: Compressing Vision Transformers with Weight Multiplexing✓ Link85.5%47M98.8Mini-Swin-B@3842022-04-14
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning✓ Link85.5%57.5M14.8Wave-ViT-L2022-07-11
Circumventing Outliers of AutoAugment with Knowledge Distillation✓ Link85.5%66MKDforAA (EfficientNet-B7)2020-03-25
Scaling Local Self-Attention for Parameter Efficient Visual Backbones✓ Link85.5%87MHaloNet4 (base 128, Conv-12)2021-03-23
Adversarial Examples Improve Image Recognition✓ Link85.5%88MAdvProp (EfficientNet-B8)2019-11-21
MetaFormer Baselines for Vision✓ Link85.5%99M23.2CAFormer-B36 (224 res)2022-10-24
A ConvNet for the 2020s✓ Link85.5%198M101ConvNeXt-L (384 res)2022-01-10
RandAugment: Practical automated data augmentation with a reduced search space✓ Link85.4%EfficientNet-B8 (RandAugment)2019-09-30
BiFormer: Vision Transformer with Bi-Level Routing Attention✓ Link85.4%BiFormer-B* (IN1k ptretrain)2023-03-15
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link85.4%GTP-EVA-L/P82023-11-06
Augmenting Convolutional networks with attention-based aggregation✓ Link85.4%25.2MPatchConvNet-S60-21k-5122021-12-27
MetaFormer Baselines for Vision✓ Link85.4%26M13.4CAFormer-S18 (384 res, 21K)2022-10-24
MetaFormer Baselines for Vision✓ Link85.4%40M7.6ConvFormer-S36 (224 res, 21K)2022-10-24
MetaFormer Baselines for Vision✓ Link85.4%40M22.4ConvFormer-S36 (384 res)2022-10-24
Going deeper with Image Transformers✓ Link85.4%68.2M48CAIT-S-362021-03-31
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link85.4%424.6M36.6FasterViT-42023-06-09
Exploring the Limits of Weakly Supervised Pretraining✓ Link85.4%829M306ResNeXt-101 32x48d2018-05-02
Big Transfer (BiT): General Visual Representation Learning✓ Link85.39%928MBiT-M (ResNet)2019-12-24
MLP-Mixer: An all-MLP Architecture for Vision✓ Link85.3%ViT-L/16 Dosovitskiy et al. (2021)2021-05-04
Omnivore: A Single Model for Many Visual Modalities✓ Link85.3%Omnivore (Swin-B)2022-01-20
Self-training with Noisy Student improves ImageNet classification✓ Link85.3%19MNoisyStudent (EfficientNet-B4)2019-11-11
Going deeper with Image Transformers✓ Link85.3%89.5M63.8CAIT-S-482021-03-31
Masking meets Supervision: A Strong Learning Alliance✓ Link85.3%304MViT-L @224 (DeiT-III + AugSub)2023-06-20
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network✓ Link85.28%47.43CLCNet (S:D1+D:D5)2022-05-19
Tiny Models are the Computational Saver for Large Models✓ Link85.24TinySaver(Swin_large, 1.0 Acc drop)2024-03-26
DeiT III: Revenge of the ViT✓ Link85.2%ViT-H @224 (DeiT III)2022-04-14
HyenaPixel: Global Image Context with Convolutions✓ Link85.2%HyenaPixel-Bidirectional-Former-B362024-02-29
VOLO: Vision Outlooker for Visual Recognition✓ Link85.2%27MVOLO-D12021-06-24
MetaFormer Baselines for Vision✓ Link85.2%56M13.2CAFormer-M36 (224 res)2022-10-24
Adversarial Examples Improve Image Recognition✓ Link85.2%66MAdvProp (EfficientNet-B7)2019-11-21
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP85.2%73.5M23.2UniNet-B52021-10-08
Training data-efficient image transformers & distillation through attention✓ Link85.2%87MDeiT-B 3842020-12-23
MaxViT: Multi-Axis Vision Transformer✓ Link85.17%212M43.9MaxViT-L (224res)2022-04-04
EfficientNetV2: Smaller Models and Faster Training✓ Link85.1%EfficientNetV2-M2021-04-01
Meta Knowledge Distillation85.1%MKD ViT-B2022-02-16
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link85.1%SP-ViT-S|3842022-06-15
XCiT: Cross-Covariance Image Transformers✓ Link85.1%26M55.6XCiT-S122021-06-17
Going deeper with Image Transformers✓ Link85.1%46.9M32.2CAIT-S-242021-03-31
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset✓ Link85.1%76MResNet200_vd_26w_4s_ssld2020-06-18
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers✓ Link85.1%88M16.3MixMIM-B2022-05-26
High-Performance Large-Scale Image Recognition Without Normalization✓ Link85.1%193.8M62.59NFNet-F22021-02-11
Exploring the Limits of Weakly Supervised Pretraining✓ Link85.1%466M174ResNeXt-101 32x32d2018-05-02
Discrete Representations Strengthen Vision Transformer Robustness✓ Link85.07%DiscreteViT2021-11-20
Co-training $2^L$ Submodels for Visual Recognition✓ Link85.0%ViT-M@224 (cosub)2022-12-09
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders✓ Link85%ViC-MAE (ViT-L)2023-03-21
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space✓ Link85%HVT Large2024-09-25
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link85%12MFixEfficientNet-B32020-03-18
MetaFormer Baselines for Vision✓ Link85.0%26M13.4CAFormer-S18 (384 res)2022-10-24
MetaFormer Baselines for Vision✓ Link85.0%27M11.6ConvFormer-S18 (384 res, 21K)2022-10-24
RandAugment: Practical automated data augmentation with a reduced search space✓ Link85%66MEfficientNet-B7 (RandAugment)2019-09-30
DeiT III: Revenge of the ViT✓ Link85.0%87MViT-B @384 (DeiT III)2022-04-14
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link85%227.9M34.9MambaVision-L2024-07-10
MaxViT: Multi-Axis Vision Transformer✓ Link84.94%120M23.4MaxViT-B (224res)2022-04-04
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link84.91%CaiT-S242023-08-18
DeiT III: Revenge of the ViT✓ Link84.9%ViT-L @224 (DeiT III)2022-04-14
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link84.9%Our SP-ViT-M2022-06-15
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link84.9%FastViT-MA362023-03-24
HyenaPixel: Global Image Context with Convolutions✓ Link84.9%HyenaPixel-Former-B362024-02-29
EfficientNetV2: Smaller Models and Faster Training✓ Link84.9%22M8.8EfficientNetV2-S (21k)2021-04-01
CvT: Introducing Convolutions to Vision Transformers✓ Link84.9%32M25CvT-21 (384 res, ImageNet-22k pretrain)2021-03-29
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link84.9%93M16.6DAT-B++ (224x224)2023-09-04
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link84.9%97M16InternImage-B2022-11-10
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link84.9%159.5M18.2FasterViT-32023-06-09
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link84.8%21M4.3TinyViT-21M-distill (21k)2022-07-21
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning✓ Link84.8%33.5M7.2Wave-ViT-B2022-07-11
Going deeper with Image Transformers✓ Link84.8%38.6M28.8CAIT-XS-362021-03-31
Sliced Recursive Transformer✓ Link84.8%71.2MSReT-B (384 res, ImageNet-1K only)2021-11-09
Multiscale Vision Transformers✓ Link84.8%72.9M32.7MViT-B-242021-04-22
Active Token Mixer✓ Link84.8%76.4M36.4ActiveMLP-L2022-03-11
Vision Transformer with Deformable Attention✓ Link84.8%88M49.8DAT-B (384 res, IN-1K only)2022-01-03
Masked Image Residual Learning for Scaling Deeper Vision Transformers✓ Link84.8%96M18.8MIRL(ViT-S-54)2023-09-25
MetaFormer Baselines for Vision✓ Link84.8%100M22.6ConvFormer-B36 (224 res)2022-10-24
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs✓ Link84.8%186M34.7RDNet-L2024-03-28
Billion-scale semi-supervised learning for image classification✓ Link84.8%193MResNeXt-101 32x16d (semi-weakly sup.)2019-05-02
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link84.7%27M8ELSA-VOLO-D12021-12-23
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link84.7%49.7M10.3TransNeXt-Small (IN-1K supervised, 224)2023-11-28
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios✓ Link84.7%57.8M32Next-ViT-L @3842022-07-12
Vicinity Vision Transformer✓ Link84.7%61.8M31.8VVT-L (384 res)2022-06-21
Bottleneck Transformers for Visual Recognition✓ Link84.7%75.1MBoTNet T72021-01-27
MogaNet: Multi-order Gated Aggregation Network✓ Link84.7%83M15.9MogaNet-L2022-11-07
Fast Vision Transformers with HiLo Attention✓ Link84.7%87M39.7LITv2-B|3842022-05-26
High-Performance Large-Scale Image Recognition Without Normalization✓ Link84.7%132.6M35.54NFNet-F12021-02-11
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link84.6%53M9.4DAT-S++2023-09-04
Sequencer: Deep LSTM for Image Classification✓ Link84.6%54M50.7Sequencer2D-L↑3922022-05-04
Contextual Transformer Networks for Visual Recognition✓ Link84.6%55.8M26.5SE-CoTNetD-1522021-07-26
Asymmetric Masked Distillation for Pre-Training Small Foundation Models84.6%87MAMD(ViT-B/16)2023-11-06
DaViT: Dual Attention Vision Transformers✓ Link84.6%87.9M15.5DaViT-B2022-04-07
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link84.5%FastViT-SA362023-03-24
Rethinking Channel Dimensions for Efficient Model Design✓ Link84.5%34.8MReXNet-R_3.02020-07-02
MetaFormer Baselines for Vision✓ Link84.5%39M8.0CAFormer-S36 (224 res)2022-10-24
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link84.5%53M5.3EfficientViT-L1 (r224)2022-05-29
MetaFormer Baselines for Vision✓ Link84.5%57M12.8ConvFormer-M36 (224 res)2022-10-24
Global Context Vision Transformers✓ Link84.5%90M14.8GC ViT-B2022-06-20
ResNeSt: Split-Attention Networks✓ Link84.5%111MResNeSt-2692020-04-19
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link84.5%168M34.7CoAtNet-32021-06-09
MaxViT: Multi-Axis Vision Transformer✓ Link84.45%69M11.7MaxViT-S (224res)2022-04-04
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism✓ Link84.4%GPIPE2018-11-16
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention✓ Link84.4%DeBiFormer-B2024-10-11
MetaFormer Baselines for Vision✓ Link84.4%27M11.6ConvFormer-S18 (384 res)2022-10-24
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link84.4%66M37EfficientNet-B72019-05-28
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs✓ Link84.4%87M15.4RDNet-B2024-03-28
Dilated Neighborhood Attention Transformer✓ Link84.4%90M13.7DiNAT-Base2022-09-29
Revisiting ResNets: Improved Training and Scaling Strategies✓ Link84.4%192M4.6ResNet-RS-50 (160 image res)2021-03-13
ColorNet: Investigating the importance of color spaces for image classification✓ Link84.32%ColorNet (RHYLH with Conv Layer)2019-02-01
Three things everyone should know about Vision Transformers✓ Link84.3%ViT-B@384 (attn finetune)2022-03-18
BiFormer: Vision Transformer with Bi-Level Routing Attention✓ Link84.3%BiFormer-S* (IN1k ptretrain)2023-03-15
Sliced Recursive Transformer✓ Link84.3%21.3M42.8SReT-S (512 res, ImageNet-1K only)2021-11-09
LambdaNetworks: Modeling Long-Range Interactions Without Attention✓ Link84.3%42MLambdaResNet2002021-02-17
MogaNet: Multi-order Gated Aggregation Network✓ Link84.3%44M9.9MogaNet-B2022-11-07
TResNet: High Performance GPU-Dedicated Architecture✓ Link84.3%77MTResNet-XL2020-03-30
Billion-scale semi-supervised learning for image classification✓ Link84.3%88MResNeXt-101 32x8d (semi-weakly sup.)2019-05-02
Neighborhood Attention Transformer✓ Link84.3%90M13.7NAT-Base2022-04-14
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network✓ Link84.2%15.8Assemble-ResNet1522020-01-17
Bottleneck Transformers for Visual Recognition✓ Link84.2%BoTNet T7-3202021-01-27
Visual Parser: Representing Part-whole Hierarchies with Transformers✓ Link84.2%ViP-B|3842021-07-13
A Study on Transformer Configuration and Training Objective84.2Bamboo (Bamboo-B)2022-05-21
Co-training $2^L$ Submodels for Visual Recognition✓ Link84.2%RegnetY16GF@224 (cosub)2022-12-09
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link84.2%49M6.5EfficientViT-B3 (r288)2022-05-29
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link84.2%50M8InternImage-S2022-11-10
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP84.2%73.5M9.9UniNet-B42021-10-08
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link84.2%75.9M8.7FasterViT-22023-06-09
Training data-efficient image transformers & distillation through attention✓ Link84.2%86MDeiT-B2020-12-23
Masking meets Supervision: A Strong Learning Alliance✓ Link84.2%86.6MViT-B @224 (DeiT-III + AugSub)2023-06-20
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link84.2%97.7M15MambaVision-B2024-07-10
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link84.2%142.3M38.1RevBiFPN-S62022-06-28
Exploring the Limits of Weakly Supervised Pretraining✓ Link84.2%194M72ResNeXt-101 32×16d2018-05-02
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run84.1%2.1FBNetV5-F-CLS2021-11-19
Three things everyone should know about Vision Transformers✓ Link84.1%ViT-B-36x12022-03-18
Three things everyone should know about Vision Transformers✓ Link84.1%ViT-B-18x22022-03-18
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link84.1%XCiT-M (+MixPro)2023-04-24
Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces✓ Link84.1%DGMMC-S2024-10-17
Self-training with Noisy Student improves ImageNet classification✓ Link84.1%12MNoisyStudent (EfficientNet-B3)2019-11-11
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications✓ Link84.1%21.76M3.597CAS-ViT-T2024-08-07
MetaFormer Baselines for Vision✓ Link84.1%26M4.1CAFormer-S18 (224 res, 21K)2022-10-24
Going deeper with Image Transformers✓ Link84.1%26.6M19.3CAIT-XS-242021-03-31
MetaFormer Baselines for Vision✓ Link84.1%40M7.6ConvFormer-S36 (224 res)2022-10-24
All Tokens Matter: Token Labeling for Training Better Vision Transformers✓ Link84.1%56M16LV-ViT-M2021-04-22
Vicinity Vision Transformer✓ Link84.1%61.8M10.8VVT-L (224 res)2022-06-21
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link84.1%75M15.7CoAtNet-22021-06-09
Conformer: Local Features Coupling Global Representations for Visual Recognition✓ Link84.1%83.3M46.6Conformer-B2021-05-09
Augmenting Convolutional networks with attention-based aggregation✓ Link84.1%188.6MPatchConvNet-B1202021-12-27
Generalized Parametric Contrastive Learning✓ Link84.0%GPaCo (Vit-B)2022-09-26
Scalable Pre-training of Large Autoregressive Image Models✓ Link84.0AIM-7B2024-01-16
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link84.0%19MFixEfficientNetB42020-03-18
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset✓ Link84.0%25.58MFix_ResNet50_vd_ssld2020-06-18
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link84.0%28.2M5.7TransNeXt-Tiny (IN-1K supervised, 224)2023-11-28
LambdaNetworks: Modeling Long-Range Interactions Without Attention✓ Link84.0%35MLambdaResNet1522021-02-17
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link84%43M19EfficientNet-B62019-05-28
Global Context Vision Transformers✓ Link84.0%51M8.5GC ViT-S2022-06-20
Bottleneck Transformers for Visual Recognition✓ Link84%53.9MBoTNet T62021-01-27
Rethinking Spatial Dimensions of Vision Transformers✓ Link84%73.8M12.5PiT-B2021-03-30
DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network✓ Link84%89M15.4DeepMAD-89M2023-03-05
EfficientNetV2: Smaller Models and Faster Training✓ Link83.9%EfficientNetV2-S2021-04-01
SP-ViT: Learning 2D Spatial Priors for Vision Transformers✓ Link83.9%Our SP-ViT-S2022-06-15
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link83.9%UniRepLKNet-S2023-11-27
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention✓ Link83.9%DeBiFormer-S2024-10-11
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning✓ Link83.9%22.7M4.7Wave-ViT-S2022-07-11
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link83.9%24M4.3DAT-T++2023-09-04
Adaptive Split-Fusion Transformer✓ Link83.9%56.7MASF-former-B2022-04-26
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification✓ Link83.957.1MDynamicViT-LV-M/0.82021-06-03
Transformer in Transformer✓ Link83.9%65.6MTNT-B2021-02-27
ResNeSt: Split-Attention Networks✓ Link83.9%70MResNeSt-2002020-04-19
Regularized Evolution for Image Classifier Architecture Search✓ Link83.9%469M208AmoebaNet-A2018-02-05
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network✓ Link83.88%18.58CLCNet (S:B4+D:B7)2022-05-19
Revisiting ResNets: Improved Training and Scaling Strategies✓ Link83.8%54ResNet-RS-270 (256 image res)2021-03-13
Bottleneck Transformers for Visual Recognition✓ Link83.8%SENet-3502021-01-27
DeiT III: Revenge of the ViT✓ Link83.8%ViT-B @224 (DeiT III)2022-04-14
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders✓ Link83.8%ColorMAE-Green-ViTB-16002024-07-17
Sliced Recursive Transformer✓ Link83.8%21M18.5SReT-S (384 res, ImageNet-1K only)2021-11-09
Dilated Neighborhood Attention Transformer✓ Link83.8%51M7.8DiNAT-Small2022-09-29
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding✓ Link83.8%68M17.9Transformer local-attention (NesT-B)2021-05-26
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link83.8%82M11.8PVTv2-B42021-06-25
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link83.7%CA-Swin-S (+MixPro)2023-04-24
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link83.7%GTP-ViT-L/P82023-11-06
MetaFormer Baselines for Vision✓ Link83.7%27M3.9ConvFormer-S18 (224 res, 21K)2022-10-24
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs✓ Link83.7%50M8.7RDNet-S2024-03-28
Vision Transformer with Deformable Attention✓ Link83.7%50M9.0DAT-S2022-01-03
Neighborhood Attention Transformer✓ Link83.7%51M7.8NAT-Small2022-04-14
Learned Queries for Efficient Local Attention✓ Link83.7%56M9.7QnA-ViT-Base2021-12-21
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link83.7%82M21.8RevBiFPN-S52022-06-28
Vision GNN: An Image is Worth Graph of Nodes✓ Link83.7%92.6M16.8Pyramid ViG-B2022-06-01
Twins: Revisiting the Design of Spatial Attention in Vision Transformers✓ Link83.7%99.2M15.1Twins-SVT-L2021-04-28
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link83.67%22.05MTransBoost-ViT-S2022-05-26
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link83.65%XCiT-S2023-08-18
MaxViT: Multi-Axis Vision Transformer✓ Link83.62%31M5.6MaxViT-T (224res)2022-04-04
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link83.61%Wave-ViT-S2023-08-18
Fast Vision Transformers with HiLo Attention✓ Link83.6%13.2LITv2-B2022-05-26
MultiGrain: a unified image embedding for classes and instances✓ Link83.6%MultiGrain PNASNet (500px)2019-02-14
Masked Autoencoders Are Scalable Vision Learners✓ Link83.6%MAE (ViT-L)2021-11-11
Pattern Attention Transformer with Doughnut Kernel83.6%PAT-B2022-11-30
HyenaPixel: Global Image Context with Convolutions✓ Link83.6%HyenaPixel-Attention-Former-S182024-02-29
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link83.6%9.2MFixEfficientNet-B22020-03-18
MetaFormer Baselines for Vision✓ Link83.6%26M4.1CAFormer-S18 (224 res)2022-10-24
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation✓ Link83.6%39.3M7.8IPT-B2022-12-06
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link83.6%48.5M27.6ViTAE-B-Stage2021-06-07
ResT: An Efficient Transformer for Visual Recognition✓ Link83.6%51.63M7.9ResT-Large2021-05-28
High-Performance Large-Scale Image Recognition Without Normalization✓ Link83.6%71.5M12.38NFNet-F02021-02-11
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training✓ Link83.6%98M38.2SE-ResNeXt-101, 64x4d, S=2(320px)2020-11-30
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link83.6%116MResMLP-B24/82021-05-07
Tiny Models are the Computational Saver for Large Models✓ Link83.52TinySaver(EfficientFormerV2_l, 0.01 Acc drop)2024-03-26
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link83.5%4EfficientViT-B3 (r224)2022-05-29
Bottleneck Transformers for Visual Recognition✓ Link83.5%19.3BoTNet T52021-01-27
HyenaPixel: Global Image Context with Convolutions✓ Link83.5%HyenaPixel-Bidirectional-Former-S182024-02-29
Augmenting Convolutional networks with attention-based aggregation✓ Link83.5%99.4MPatchConvNet-B602021-12-27
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion✓ Link83.46%SigLIP B/16 + PrefixedIter Decoder2024-07-15
Three things everyone should know about Vision Transformers✓ Link83.4%ViT-B (hMLP + BeiT)2022-03-18
MobileNetV4 -- Universal Models for the Mobile Ecosystem✓ Link83.4%MNv4-Hybrid-L2024-04-16
UniFormer: Unifying Convolution and Self-attention for Visual Recognition✓ Link83.4%22M3.6UniFormer-S2022-01-24
DeiT III: Revenge of the ViT✓ Link83.4%22M15.5ViT-S @384 (DeiT III)2022-04-14
MogaNet: Multi-order Gated Aggregation Network✓ Link83.4%25M5MogaNet-S2022-11-07
Global Context Vision Transformers✓ Link83.4%28M4.7GC ViT-T2022-06-20
Billion-scale semi-supervised learning for image classification✓ Link83.4%42MResNeXt-101 32x4d (semi-weakly sup.)2019-05-02
Sequencer: Deep LSTM for Image Classification✓ Link83.4%54M16.6Sequencer2D-L2022-05-04
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?✓ Link83.4%65.9MsMLPNet-B (ImageNet-1k)2021-09-12
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training✓ Link83.34%98M61.1SE-ResNeXt-101, 64x4d, S=2(416px)2020-11-30
CvT: Introducing Convolutions to Vision Transformers✓ Link83.3%24.9CvT-21 (384 res)2021-03-29
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link83.3%34.2T2T-ViT-14|3842021-01-28
Incorporating Convolution Designs into Visual Transformers✓ Link83.3%24.2M12.9CeiT-S (384 finetune res)2021-03-22
All Tokens Matter: Token Labeling for Training Better Vision Transformers✓ Link83.3%26M6.6LV-ViT-S2021-04-22
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link83.3%27.8M5.7MOAT-0 1K only2022-10-04
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link83.3%30M9.9EfficientNet-B52019-05-28
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding✓ Link83.3%38M10.4Transformer local-attention (NesT-S)2021-05-26
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link83.3%39.7M8.7ViL-Medium-D2021-03-29
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link83.3%42M8.4CoAtNet-12021-06-09
Fast Vision Transformers with HiLo Attention✓ Link83.3%49M7.5LITv2-M2022-05-26
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link83.3%50.1M7.5MambaVision-S2024-07-10
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link83.3%88M15.2Shift-B2022-01-26
MultiGrain: a unified image embedding for classes and instances✓ Link83.2%MultiGrain PNASNet (450px)2019-02-14
Meta Pseudo Labels✓ Link83.2%Meta Pseudo Labels (ResNet-50)2020-03-23
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link83.2%UniRepLKNet-T2023-11-27
HyenaPixel: Global Image Context with Convolutions✓ Link83.2%HyenaPixel-Former-S182024-02-29
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link83.2%11M2.0TinyViT-11M-distill (21k)2022-07-21
Rethinking Channel Dimensions for Efficient Model Design✓ Link83.2%16.5MReXNet-R_2.02020-07-02
Learned Queries for Efficient Local Attention✓ Link83.2%25M4.4QnA-ViT-Small2021-12-21
Neighborhood Attention Transformer✓ Link83.2%28M4.3NAT-Tiny2022-04-14
Contextual Transformer Networks for Visual Recognition✓ Link83.2%40.9M8.5SE-CoTNetD-1012021-07-26
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios✓ Link83.2%44.8M8.3Next-ViT-B2022-07-12
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link83.2%45.2M6.9PVTv2-B32021-06-25
Augmenting Convolutional networks with attention-based aggregation✓ Link83.2%47.7MPatchConvNet-S1202021-12-27
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link83.2%53.4M5.3FasterViT-12023-06-09
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link83.2%55.7M13.4ViL-Base-D2021-03-29
CycleMLP: A MLP-like Architecture for Dense Prediction✓ Link83.2%76M12.3CycleMLP-B52021-07-21
MultiGrain: a unified image embedding for classes and instances✓ Link83.1%MultiGrain SENet154 (450px)2019-02-14
DeepViT: Towards Deeper Vision Transformer✓ Link83.1%DeepVit-L* (DeiT training recipe)2021-03-22
DeiT III: Revenge of the ViT✓ Link83.1%ViT-S @224 (DeiT III, 21k)2022-04-14
Meta Knowledge Distillation83.1%MKD ViT-S2022-02-16
Co-training $2^L$ Submodels for Visual Recognition✓ Link83.1%ViT-S@224 (cosub)2022-12-09
Pattern Attention Transformer with Doughnut Kernel83.1%PAT-S2022-11-30
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link83.1%21M4.3TinyViT-21M2022-07-21
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?✓ Link83.1%48.6MsMLPNet-S (ImageNet-1k)2021-09-12
Vision GNN: An Image is Worth Graph of Nodes✓ Link83.1%51.7M8.9Pyramid ViG-M2022-06-01
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link83.09%SwinV2-Ti2023-08-18
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window83.01%39.8M7.0gSwin-S2022-08-24
MultiGrain: a unified image embedding for classes and instances✓ Link83.0%MultiGrain SENet154 (400px)2019-02-14
Graph Convolutions Enrich the Self-Attention in Transformers!✓ Link83%Swin-S + GFSA2023-12-07
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications✓ Link83.0%12.42M1.887CAS-ViT-M2024-08-07
CvT: Introducing Convolutions to Vision Transformers✓ Link83%20M16.3CvT-13 (384 res)2021-03-29
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset✓ Link83.0%25.58MResNet50_vd_ssld2020-06-18
MetaFormer Baselines for Vision✓ Link83.0%27M3.9ConvFormer-S18 (224 res)2022-10-24
Multiscale Vision Transformers✓ Link83.0%37M7.8MViT-B-162021-04-22
ResNeSt: Split-Attention Networks✓ Link83.0%48MResNeSt-1012020-04-19
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link83%48.7M10.6RevBiFPN-S42022-06-28
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition✓ Link83.0%183M13.9ZenNAS (0.8ms)2021-02-01
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link82.9%1.881NASViT (supernet)2021-09-29
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link82.9%DeiT-B (+MixPro)2023-04-24
MobileNetV4 -- Universal Models for the Mobile Ecosystem✓ Link82.9%MNv4-Conv-L2024-04-16
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation✓ Link82.9%24.3M4.7IPT-S2022-12-06
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link82.9%39.8MViL-Medium-W2021-03-29
Global Filter Networks for Image Classification✓ Link82.9%54M8.6GFNet-H-B2021-07-01
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution✓ Link82.9%66.8M22.220771G2.22GOct-ResNet-152 (SE)2019-04-10
Progressive Neural Architecture Search✓ Link82.9%86.1M5096.22.5GPNASNet-52017-12-02
Harmonic Convolutional Networks based on Discrete Cosine Transform✓ Link82.85%88.2M31.4Harm-SE-RNX-101 64x4d (320x320, Mean-Max Pooling)2020-01-18
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link82.8%8GTP-LV-ViT-M/P82023-11-06
Knowledge distillation: A good teacher is patient and consistent✓ Link82.8%FunMatch - T384+224 (ResNet-50)2021-06-09
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link82.8%CA-Swin-T (+MixPro)2023-04-24
Graph Convolutions Enrich the Self-Attention in Transformers!✓ Link82.8%CaiT-S + GFSA2023-12-07
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs✓ Link82.8%24M5.0RDNet-T2024-03-28
Visual Attention Network✓ Link82.8%26.6M5VAN-B22022-02-20
DaViT: Dual Attention Vision Transformers✓ Link82.8%28.3MDaViT-T2022-04-07
Rethinking Channel Dimensions for Efficient Model Design✓ Link82.8%34.7M3.4ReXNet_3.02020-07-02
Sequencer: Deep LSTM for Image Classification✓ Link82.8%38M11.1Sequencer2D-M2022-05-04
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification✓ Link82.8%44.3M9.5CrossViT-18+2021-03-27
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link82.8%50M8.5Shift-S2022-01-26
HRFormer: High-Resolution Transformer for Dense Prediction✓ Link82.8%50.3M13.7HRFormer-B2021-10-18
Bottleneck Transformers for Visual Recognition✓ Link82.8%54.7M10.9BoTNet T42021-01-27
Kolmogorov-Arnold Transformer✓ Link82.886.6M17.06KAT-B*2024-09-16
MultiGrain: a unified image embedding for classes and instances✓ Link82.7%MultiGrain SENet154 (500px)2019-02-14
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link82.7%PVT-M (+MixPro)2023-04-24
Adaptive Split-Fusion Transformer✓ Link82.7%19.3MASF-former-S2022-04-26
Container: Context Aggregation Network✓ Link82.7%22.1M8.1Container Container2021-06-02
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP82.7%22.5M2.4UniNet-B22021-10-08
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link82.7%24M2.1EfficientViT-B2 (r256)2022-05-29
Dilated Neighborhood Attention Transformer✓ Link82.7%28M4.3DiNAT-Tiny2022-09-29
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link82.7%28M4.8ELSA-Swin-T2021-12-23
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link82.7%35.1M5.1MambaVision-T22024-07-10
Learning Transferable Architectures for Scalable Image Recognition✓ Link82.7%88.9M23.81648G2.38GNASNET-A(6)2017-07-21
Towards Robust Vision Transformer✓ Link82.7%91.8M17.7RVT-B*2021-05-17
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations✓ Link82.64%CMA(ViT-B/16)2025-03-24
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run82.6%1FBNetV5-C-CLS2021-11-19
MultiGrain: a unified image embedding for classes and instances✓ Link82.6%MultiGrain PNASNet (400px)2019-02-14
Three things everyone should know about Vision Transformers✓ Link82.6%ViT-S-24x22022-03-18
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link82.6%FastViT-SA242023-03-24
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link82.6%7.8MFixEfficientNet-B12020-03-18
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link82.6%19M4.2EfficientNet-B42019-05-28
Training data-efficient image transformers & distillation through attention✓ Link82.6%22MDeiT-B2020-12-23
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link82.6%64.4M30T2T-ViTt-242021-01-28
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link82.54%ViT-S2023-08-18
CvT: Introducing Convolutions to Vision Transformers✓ Link82.5%7.1CvT-212021-03-29
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link82.5%12.8M2.7TransNeXt-Micro (IN-1K supervised, 224)2023-11-28
Fixing the train-test resolution discrepancy✓ Link82.5%25.6MFixResNet-50 Billion-scale@2242019-06-14
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios✓ Link82.5%31.7M5.8Next-ViT-S2022-07-12
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference✓ Link82.5%39.4M2.334LeViT-3842021-04-02
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification✓ Link82.5%43.3M9CrossViT-182021-03-27
MetaFormer Is Actually What You Need for Vision✓ Link82.5%73M23.2MetaFormer PoolFormer-M482021-11-22
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link82.5%152M30ConViT-B+2021-03-19
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link82.46%28.59MTransBoost-ConvNext-T2022-05-26
ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections✓ Link82.4ReViT-B2024-02-17
Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks✓ Link82.4%M2D-T2024-12-20
Self-training with Noisy Student improves ImageNet classification✓ Link82.4%9.2MNoisyStudent (EfficientNet-B2)2019-11-11
AutoFormer: Searching Transformers for Visual Recognition✓ Link82.4%54M11AutoFormer-base2021-07-01
ResNet strikes back: An improved training procedure in timm✓ Link82.4%60.2MResNet-152 (A2 + reg)2021-10-01
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link82.4%86M17ConViT-B2021-03-19
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link82.4%87M35.368DeiT-B with iRPE-K2021-07-29
Mega: Moving Average Equipped Gated Attention✓ Link82.4%90MMega2022-09-21
Spatial-Channel Token Distillation for Vision MLPs✓ Link82.4%122.6M24.1ResMLP-B24 + STD2022-07-23
TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers✓ Link82.37%ViT-B/16-224+HTM2022-10-14
ColorNet: Investigating the importance of color spaces for image classification✓ Link82.35%ColorNet2019-02-01
Polynomial, trigonometric, and tropical activations✓ Link82.3428M96.03ConvNeXt-T-Hermite2025-02-03
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link82.3%27.6T2T-ViT-242021-01-28
Three things everyone should know about Vision Transformers✓ Link82.3%ViT-S-48x12022-03-18
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link82.3%24M4.7MViTv2-T2021-12-02
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search✓ Link82.3%27.8M8.412G0.42GSCARLET-A42019-08-16
Sequencer: Deep LSTM for Image Classification✓ Link82.3%28M8.4Sequencer2D-S2022-05-04
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification✓ Link82.3%28.2M6.1CrossViT-15+2021-03-27
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link82.3%31.8M4.4MambaVision-T2024-07-10
GLiT: Neural Architecture Search for Global and Local Image Transformer✓ Link82.3%96.1M17GLiT-Bases2021-07-07
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link82.29%EViT (delete)2023-08-18
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link82.22%STViT-Swin-Ti2023-08-18
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search✓ Link82.2%15.8BossNet-T12021-03-23
Going deeper with Image Transformers✓ Link82.2%17.3M14.3CAIT-XXS-362021-03-31
CvT: Introducing Convolutions to Vision Transformers✓ Link82.2%18M4.1CvT-13-NAS2021-03-29
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link82.2%19.2M12.0ViTAE-S-Stage2021-06-07
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link82.2%39.2M19.6T2T-ViTt-192021-01-28
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer✓ Link82.2%39.6MEvo-LeViT-384*2021-08-03
Visformer: The Vision-friendly Transformer✓ Link82.2%40.2M4.9Visformer-S2021-04-26
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link82.2%48M10ConViT-S+2021-03-19
Patches Are All You Need?✓ Link82.2051.6MConvMixer-1536/202022-01-24
DeepViT: Towards Deeper Vision Transformer✓ Link82.2%55MDeepVit-L2021-03-22
Bottleneck Transformers for Visual Recognition✓ Link82.2%66.6MSENet-1522021-01-27
Exploring the Limits of Weakly Supervised Pretraining✓ Link82.2%88MResNeXt-101 32x8d2018-05-02
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link82.16%71.71MTransBoost-Swin-T2022-05-26
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training✓ Link82.13%88.6M18.8ResNeXt-101, 64x4d, S=2(224px)2020-11-30
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link82.11%ToMe-ViT-S2023-08-18
Asymmetric Masked Distillation for Pre-Training Small Foundation Models82.1%22MAMD(ViT-S/16)2023-11-06
Augmenting Convolutional networks with attention-based aggregation✓ Link82.1%25.2MPatchConvNet-S602021-12-27
Vision GNN: An Image is Worth Graph of Nodes✓ Link82.1%27.3M4.6Pyramid ViG-S2022-06-01
A ConvNet for the 2020s✓ Link82.1%29M4.5ConvNeXt-T2022-01-10
Spatial-Channel Token Distillation for Vision MLPs✓ Link82.1%30.1M4.0CycleMLP-B2 + STD2022-07-23
FasterViT: Fast Vision Transformers with Hierarchical Attention✓ Link82.1%31.4M3.3FasterViT-02023-06-09
Incorporating Convolution Designs into Visual Transformers✓ Link82%4.5CeiT-S2021-03-22
Differentiable Model Compression via Pseudo Quantization Noise✓ Link82.0DIFFQ (λ=1e−2)2021-04-20
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search✓ Link82%NEXcepTion-S2022-12-16
Global Context Vision Transformers✓ Link82.0%20M2.6GC ViT-XT2022-06-20
Container: Context Aggregation Network✓ Link82%20M3.2Container-Light2021-06-02
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link82%24.6M4.86ViL-Small2021-03-29
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link82%25.4M4PVTv2-B22021-06-25
Active Token Mixer✓ Link82%27.2M4ActiveMLP-T2022-03-11
Fast Vision Transformers with HiLo Attention✓ Link82%28M3.7LITv2-S2022-05-26
Vision Transformer with Deformable Attention✓ Link82.0%29M4.6DAT-T2022-01-03
[]()81.97%Swin-T (SAMix+DM)
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link81.96%EViT (fuse)2023-08-18
[]()81.92%Swin-T (AutoMix+DM)
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link81.9%4.8GTP-LV-ViT-S/P82023-11-06
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link81.9%17.0T2T-ViT-192021-01-28
A Fast Knowledge Distillation Framework for Visual Recognition✓ Link81.9%ResNet-101 (224 res, Fast Knowledge Distillation)2021-12-02
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models✓ Link81.9%Discrete Adversarial Distillation (ViT-B, 224)2023-11-02
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention✓ Link81.9%DeBiFormer-T2024-10-11
Towards Robust Vision Transformer✓ Link81.9%23.3M4.7RVT-S*2021-05-17
Rethinking Spatial Dimensions of Vision Transformers✓ Link81.9%23.5M2.9PiT-S2021-03-30
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?✓ Link81.9%24.1MsMLPNet-T (ImageNet-1k)2021-09-12
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link81.9%79M6.74ViL-Base-W2021-03-29
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles✓ Link81.89%Swin-T+SSA2023-06-02
Attentive Normalization✓ Link81.87%7.51AOGNet-40M-AN2019-08-04
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run81.8%0.726FBNetV52021-11-19
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link81.8%0.757NASViT-A52021-09-29
Parametric Contrastive Learning✓ Link81.8%ResNet-2002021-07-26
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality✓ Link81.8%RepMLPNet-L2562021-12-21
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search✓ Link81.8%NEXcepTion-TP2022-12-16
Neighborhood Attention Transformer✓ Link81.8%20M2.7NAT-Mini2022-04-14
Dilated Neighborhood Attention Transformer✓ Link81.8%20M2.7DiNAT-Mini2022-09-29
ResNet strikes back: An improved training procedure in timm✓ Link81.8%60.2MResNet-152 (A2)2021-10-01
Kolmogorov-Arnold Transformer✓ Link81.886.6M16.87DeiT-B2024-09-16
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks✓ Link81.72%25.6MMEAL V2 (ResNet-50) (380 res)2020-09-17
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window81.71%21.8M3.6gSwin-T2022-08-24
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run81.7%0.685FBNetV5-A-CLS2021-11-19
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks✓ Link81.7%T2T-ViT-142021-05-05
Learned Queries for Efficient Local Attention✓ Link81.7%16M2.5QnA-ViT-Tiny2021-12-21
AutoFormer: Searching Transformers for Visual Recognition✓ Link81.7%22.9M5.1AutoFormer-small2021-07-01
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link81.7%28M4.4Shift-T2022-01-26
Bottleneck Transformers for Visual Recognition✓ Link81.7%33.5M7.3BoTNet T32021-01-27
CvT: Introducing Convolutions to Vision Transformers✓ Link81.6%4.5CvT-132021-03-29
Sharpness-Aware Minimization for Efficiently Improving Generalization✓ Link81.6%ResNet-152 (SAM)2020-10-03
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link81.6%UniRepLKNet-N2023-11-27
Rethinking Local Perception in Lightweight Vision Transformer✓ Link81.6%12.3M2CloFormer-S2023-03-31
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference✓ Link81.6%17.8M1.066LeViT-2562021-04-02
Rethinking Channel Dimensions for Efficient Model Design✓ Link81.6%19M1.5ReXNet_2.02020-07-02
Contextual Transformer Networks for Visual Recognition✓ Link81.6%23.1M4.1SE-CoTNetD-502021-07-26
CoAtNet: Marrying Convolution and Attention for All Data Sizes✓ Link81.6%25M4.2CoAtNet-02021-06-09
Pay Attention to MLPs✓ Link81.6%73M31.6gMLP-B2021-05-17
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs81.5%0.214CoE-Large + CondConv2021-07-08
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link81.5%13.1GTP-DeiT-B/P82023-11-06
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search✓ Link81.5%NEXcepTion-T2022-12-16
Graph Convolutions Enrich the Self-Attention in Transformers!✓ Link81.5%DeiT-S-24 + GFSA2023-12-07
Self-training with Noisy Student improves ImageNet classification✓ Link81.5%7.8MNoisyStudent (EfficientNet-B1)2019-11-11
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link81.5%11M2.0TinyViT-11M2022-07-21
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding✓ Link81.5%17M5.8Transformer local-attention (NesT-T)2021-05-26
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet✓ Link81.5%21.5M9.6T2T-ViT-142021-01-28
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification✓ Link81.5%27.4M5.8CrossViT-152021-03-27
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition✓ Link81.49%42.3MPyConvResNet-1012020-06-20
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields✓ Link81.484%ViT-B/16 (RPE w/ GAB)2023-05-08
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link81.4%0.591NASViT-A42021-09-29
MobileOne: An Improved One millisecond Mobile Backbone✓ Link81.4%2.9MobileOne-S4 (distill)2022-06-08
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link81.4%9.770DeiT-S with iRPE-QKV2021-07-29
DeiT III: Revenge of the ViT✓ Link81.4%ViT-S @224 (DeiT III)2022-04-14
BiFormer: Vision Transformer with Bi-Level Routing Attention✓ Link81.4%BiFormer-T (IN1k ptretrain)2023-03-15
Bottleneck Transformers for Visual Recognition✓ Link81.4%49.2MSENet-1012021-01-27
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link81.33%GFNet-S2023-08-18
Adversarial AutoAugment81.32%ResNet-200 (Adversarial Autoaugment)2019-12-24
MultiGrain: a unified image embedding for classes and instances✓ Link81.3%MultiGrain PNASNet (300px)2019-02-14
Parametric Contrastive Learning✓ Link81.3%ResNet-1522021-07-26
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link81.3%27M5.4ConViT-S2021-03-19
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link81.3%29M4.5Swin-T2021-03-25
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link81.249.5MSimpleNetV1-9m-correct-labels2016-08-22
Res2Net: A New Multi-scale Backbone Architecture✓ Link81.23%Res2Net-1012019-04-02
Shape-Texture Debiased Neural Network Training✓ Link81.2ResNeXt-101 (Debiased+CutMix)2020-10-12
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link81.2%PVT-S (+MixPro)2023-04-24
[]()81.16%Swin-T (PuzzleMix+DM)
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link81.15%25.56MTransBoost-ResNet50-StrikesBack2022-05-26
ResNeSt: Split-Attention Networks✓ Link81.13%27.5M5.39ResNeSt-502020-04-19
[]()81.12%DeiT-S (SAMix+DM)
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link81.1%9.412DeiT-S with iRPE-QK2021-07-29
Graph Convolutions Enrich the Self-Attention in Transformers!✓ Link81.1%DeiT-S-12 + GFSA2023-12-07
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications✓ Link81.1%5.76M0.932CAS-ViT-S2024-08-07
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link81.1%12MEfficientNet-B32019-05-28
Visual Attention Network✓ Link81.1%13.9M2.5VAN-B12022-02-20
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link81.1%19.6M3.33RevBiFPN-S32022-06-28
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations✓ Link81.1%236MResNet-152x2-SAM2021-06-03
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link81.09%DynamicViT-S2023-08-18
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup✓ Link81.08%44.6MResNet-101 (SAMix)2021-11-30
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link81.0%0.528NASViT-A32021-09-29
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link81%13.2M6.8ViTAE-13M2021-06-07
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers✓ Link80.98%44.6MResNet-101 (AutoMix)2021-03-24
[]()80.91%DeiT-S (AutoMix+DM)
Parametric Contrastive Learning✓ Link80.9%ResNet-1012021-07-26
Going deeper with Image Transformers✓ Link80.9%12M9.6CAIT-XXS-242021-03-31
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link80.9%22M9.318DeiT-S with iRPE-K2021-07-29
Centroid Transformers: Learning to Abstract with Attention80.9%22.3M9.4CentroidViT-S (arXiv, 2021-02)2021-02-17
Aggregated Residual Transformations for Deep Neural Networks✓ Link80.9%83.6M31.594.7ResNeXt-101 64x42016-11-16
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link80.8%0.709AlphaNet-A62021-02-16
Supervised Contrastive Learning✓ Link80.8%ResNet-200 (Supervised Contrastive)2020-04-23
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP✓ Link80.8%11.5M0.555UniNet-B02022-07-12
LocalViT: Bringing Locality to Vision Transformers✓ Link80.8%22.4M4.6LocalViT-S2021-04-12
A Dot Product Attention Free Transformer80.8%23MDAFT-conv (384 heads, 300 epochs)2021-09-29
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link80.8%30M6ResMLP-S242021-05-07
MobileNetV4 -- Universal Models for the Mobile Ecosystem✓ Link80.7%MNv4-Hybrid-M2024-04-16
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link80.7%5.4M1.3TinyViT-5M-distill (21k)2022-07-21
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs80.7%95.3M0.194CoE-Large2021-07-08
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks✓ Link80.67%MEAL V2 (ResNet-50) (224 res)2020-09-17
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link80.66%TokenLearner-ViT-82023-08-18
ResNeSt: Split-Attention Networks✓ Link80.64%27.5M4.34ResNeSt-50-fast2020-04-19
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link80.64%60.19MTransBoost-ResNet1522022-05-26
Fast AutoAugment✓ Link80.6%ResNet-200 (Fast AA)2019-05-01
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link80.6%CaiT-XXS (+MixPro)2023-04-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link80.6%FastViT-SA122023-03-24
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features✓ Link80.53%ResNeXt-101 (CutMix)2019-05-13
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link80.5%0.421NASViT-A22021-09-29
Residual Attention Network for Image Classification✓ Link80.5%Attention-922017-04-23
Neural Architecture Transfer✓ Link80.5%9.1MNAT-M42020-05-12
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation✓ Link80.5%14.0M2.3IPT-T2022-12-06
GLiT: Neural Architecture Search for Global and Local Image Transformer✓ Link80.5%24.6M4.4GLiT-Smalls2021-07-07
Gated Convolutional Networks with Hybrid Connectivity for Image Classification✓ Link80.5%42.2M7.1HCGNet-C2019-08-26
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition✓ Link80.43%1.7DVT (T2T-ViT-12)2021-05-31
GhostNetV3: Exploring the Training Strategies for Compact Models✓ Link80.4%95.2GhostNetV3 1.6x2024-04-17
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP80.4%14M0.99UniNet-B12021-10-08
ResNet strikes back: An improved training procedure in timm✓ Link80.4%22MDeiT-S (T2)2021-10-01
ResNet strikes back: An improved training procedure in timm✓ Link80.4%25MResNet50 (A1)2021-10-01
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window80.32%15.5M2.3gSwin-VT2022-08-24
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link80.3%0.491AlphaNet-A52021-02-16
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks✓ Link80.3%ResNet-50+AutoDropout+RandAugment2021-01-05
Rethinking Channel Dimensions for Efficient Model Design✓ Link80.3%9.7M0.86ReXNet_1.52020-07-02
[]()80.25%DeiT-S (PuzzleMix+DM)
Attentional Feature Fusion✓ Link80.22%34.7MiAFF-ResNeXt-50-32x4d2020-09-29
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link80.2%UniRepLKNet-P2023-11-27
Fixing the train-test resolution discrepancy: FixEfficientNet✓ Link80.2%5.3M1.60FixEfficientNet-B02020-03-18
A Dot Product Attention Free Transformer80.2%20.3MDAFT-conv (16 heads)2021-09-29
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link80.2%42.7MConvMLP-L2021-09-09
A Fast Knowledge Distillation Framework for Visual Recognition✓ Link80.1%ResNet-50 (224 res, Fast Knowledge Distillation)2021-12-02
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space✓ Link80.1%HVT Base2024-09-25
A Dot Product Attention Free Transformer80.1%23MDAFT-conv (384 heads, 200 epochs)2021-09-29
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning✓ Link80.1%55.8MInception ResNet V22016-02-23
Exploring Randomly Wired Neural Networks for Image Recognition✓ Link80.1%61.5M7.9RandWire-WS2019-04-02
Go Wider Instead of Deeper✓ Link80.09%63MWideNet-H2021-07-25
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs80%0.100CoE-Small + CondConv + PWLU2021-07-08
BasisNet: Two-stage Model Synthesis for Efficient Inference80%0.198BasisNet-MV32021-05-07
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link80.0%0.444AlphaNet-A42021-02-16
MogaNet: Multi-order Gated Aggregation Network✓ Link80%5.2M1.44MogaNet-T (256res)2022-11-07
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference✓ Link80%10.4M0.624LeViT-1922021-04-02
Bottleneck Transformers for Visual Recognition✓ Link80%44.4MResNet-1012021-01-27
Identity Mappings in Deep Residual Networks✓ Link79.9%ResNet-2002016-03-16
MobileNetV4 -- Universal Models for the Mobile Ecosystem✓ Link79.9%MNv4-Conv-M2024-04-16
Designing Network Design Spaces✓ Link79.9%39.2M8RegNetY-8.0GF2020-03-30
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations✓ Link79.9%87MViT-B/16-SAM2021-06-03
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link79.86%44.55MTransBoost-ResNet1012022-05-26
Selective Kernel Networks✓ Link79.81%48.9M8.46SKNet-1012019-03-15
Fixing the train-test resolution discrepancy✓ Link79.8%FixResNet-50 CutMix2019-06-14
Mish: A Self Regularized Non-Monotonic Activation Function✓ Link79.8%CSPResNeXt-50 + Mish2019-08-23
Revisiting a kNN-based Image Classification System with High-capacity Storage79.8%kNN-CLIP2022-04-03
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link79.8%FastViT-S122023-03-24
Rethinking Local Perception in Lightweight Vision Transformer✓ Link79.8%7.2M1.1CloFormer-XS2023-03-31
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link79.8%9.2M1EfficientNet-B22019-05-28
Global Context Vision Transformers✓ Link79.8%12M2.1GC ViT-XXT2022-06-20
CSPNet: A New Backbone that can Enhance Learning Capability of CNN✓ Link79.8%20.5MCSPResNeXt-50 (Mish+Aug)2019-11-27
A Dot Product Attention Free Transformer79.8%22.6MDAFT-full2021-09-29
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition✓ Link79.74%0.7DVT (T2T-ViT-10)2021-05-31
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link79.7%0.309NASViT-A12021-09-29
Generalized Parametric Contrastive Learning✓ Link79.7%GPaCo (ResNet-50)2022-09-26
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link79.7%45MResMLP-362021-05-07
Grafit: Learning fine-grained image representations with coarse labels79.6%Grafit (ResNet-50)2020-11-25
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference✓ Link79.6%8.8M0.376LeViT-1282021-04-02
ResT: An Efficient Transformer for Visual Recognition✓ Link79.6%13.66M1.9ResT-Small2021-05-28
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation✓ Link79.5%3.4GTP-DeiT-S/P82023-11-06
Rethinking Channel Dimensions for Efficient Model Design✓ Link79.5%7.6M0.66ReXNet_1.32020-07-02
Go Wider Instead of Deeper✓ Link79.49%40MWideNet-L2021-07-25
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup✓ Link79.41%25.6MResNet-50 (SAMix)2021-11-30
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link79.4%0.357AlphaNet-A32021-02-16
MultiGrain: a unified image embedding for classes and instances✓ Link79.4%MultiGrain R50-AA-5002019-02-14
Adversarial AutoAugment79.4%ResNet-50 (Adversarial Autoaugment)2019-12-24
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link79.4%ResMLP-242021-05-07
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications✓ Link79.4%5.6M2.6EdgeNeXt-S2022-06-21
Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets✓ Link79.4%11.9M0.591TinyNet (GhostNet-A)2020-10-28
MobileOne: An Improved One millisecond Mobile Backbone✓ Link79.4%14.8M2.978MobileOne-S42022-06-08
Designing Network Design Spaces✓ Link79.4%20.6M4RegNetY-4.0GF2020-03-30
Bottleneck Transformers for Visual Recognition✓ Link79.4%28.02MSENet-502021-01-27
Data-Driven Neuron Allocation for Scale Aggregation Networks✓ Link79.38%11.2ScaleNet-1522019-04-20
LIP: Local Importance-based Pooling✓ Link79.33%42.9MLIP-ResNet-1012019-08-12
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link79.3%5.8M1.841MobileViTv3-S2022-09-30
Involution: Inverting the Inherence of Convolution for Visual Recognition✓ Link79.3%34M6.8RedNet-1522021-03-10
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers✓ Link79.25%25.6MResNet-50 (AutoMix)2021-03-24
Self-Knowledge Distillation with Progressive Refinement of Targets✓ Link79.24%PS-KD (ResNet-152 + CutMix)2020-06-22
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era✓ Link79.2%ResNet-101 (JFT-300M Finetuning)2017-07-10
Towards Robust Vision Transformer✓ Link79.2%10.9M1.3RVT-Ti*2021-05-17
Multiscale Deep Equilibrium Models✓ Link79.2%81MMultiscale DEQ (MDEQ-XL)2020-06-15
How to Use Dropout Correctly on Residual Networks with Batch Normalization✓ Link79.152%DenseNet-169 (H4*)2023-02-13
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link79.125.7MSimpleNetV1-5m-correct-labels2016-08-22
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link79.1%0.317AlphaNet-A22021-02-16
GhostNetV3: Exploring the Training Strategies for Compact Models✓ Link79.1%94.5GhostNetV3 1.3x2024-04-17
Attention Augmented Convolutional Networks✓ Link79.1%AA-ResNet-1522019-04-22
Fixing the train-test resolution discrepancy✓ Link79.1%FixResNet-502019-06-14
MobileOne: An Improved One millisecond Mobile Backbone✓ Link79.1%MobileOne-S2 (distill)2022-06-08
Your Diffusion Model is Secretly a Zero-Shot Classifier✓ Link79.1%Diffusion Classifier2023-03-28
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link79.1%FastViT-T122023-03-24
TinyViT: Fast Pretraining Distillation for Small Vision Transformers✓ Link79.1%5.4M1.3TinyViT-5M2022-07-21
Rethinking Spatial Dimensions of Vision Transformers✓ Link79.1%10.6M1.4PiT-XS2021-03-30
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP79.1%11.9M0.56UniNet-B02021-10-08
Involution: Inverting the Inherence of Convolution for Visual Recognition✓ Link79.1%25.6M4.7RedNet-1012021-03-10
Kolmogorov-Arnold Transformer✓ Link79.186.6M16.87ViT-B/162024-09-16
Unsupervised Data Augmentation for Consistency Training✓ Link79.04%ResNet-50 (UDA)2019-04-29
Data-Driven Neuron Allocation for Scale Aggregation Networks✓ Link79.03%7.5ScaleNet-1012019-04-20
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link79.03%TransBoost-ResNet502022-05-26
Contextual Convolutional Neural Networks✓ Link79.03%60MCo-ResNet-1522021-08-17
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset✓ Link79.0%5.47MMobileNetV3_large_x1_0_ssld2020-06-18
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link79%10.6M1.37RevBiFPN-S22022-06-28
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link79%17.4MConvMLP-M2021-09-09
Xception: Deep Learning with Depthwise Separable Convolutions✓ Link79%22.855952M87G0.838GXception2016-10-07
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization✓ Link79%60.5M9.1SpineNet-1432019-12-10
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations✓ Link79%64MMixer-B/8-SAM2021-06-03
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks✓ Link78.95%InceptionV3 (FRN layer)2019-11-21
Averaging Weights Leads to Wider Optima and Better Generalization✓ Link78.94%ResNet-152 + SWA2018-03-14
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks✓ Link78.92%57.40M10.83ECA-Net (ResNet-152)2019-10-08
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link78.9%0.279AlphaNet-A12021-02-16
MixConv: Mixed Depthwise Convolutional Kernels✓ Link78.9%7.3M0.565MixNet-L2019-07-22
Incorporating Convolution Designs into Visual Transformers✓ Link78.8%3.6CeiT-T (384 finetune res)2021-03-22
[]()78.894.4Inception V3
Self-training with Noisy Student improves ImageNet classification✓ Link78.8%5.3MNoisyStudent (EfficientNet-B0)2019-11-11
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link78.8%7.8M0.7EfficientNet-B12019-05-28
Bottleneck Transformers for Visual Recognition✓ Link78.8%25.5MResNet-502021-01-27
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks✓ Link78.798%44.55M7.858SGE-ResNet1012019-05-23
RepVGG: Making VGG-style ConvNets Great Again✓ Link78.78%80.31M18.4RepVGG-B22021-01-11
Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup✓ Link78.76%ResNet-502020-09-15
[]()78.75%SAMix+DM (ResNet-50 RSB A3)
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks✓ Link78.7%ResNet-502021-01-05
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications✓ Link78.7%3.2M0.56CAS-ViT-XS2024-08-07
A Fast Knowledge Distillation Framework for Visual Recognition✓ Link78.7%5M1.2SReT-LT (Fast Knowledge Distillation)2021-12-02
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link78.7%13.1M2.1PVTv2-B12021-06-25
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks✓ Link78.65%42.49M7.35ECA-Net (ResNet-101)2019-10-08
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link78.64%1.876MobileViTv3-1.02022-09-30
ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer✓ Link78.63%5M3.48EdgeFormer-S2022-03-08
[]()78.62%AutoMix+DM (ResNet-50 RSB A3)
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link78.6%UniRepLKNet-F2023-11-27
RCKD: Response-Based Cross-Task Knowledge Distillation for Pathological Image Analysis78.63MCSAT2023-10-29
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link78.60%5.29MTransBoost-EfficientNetB02022-05-26
Visformer: The Vision-friendly Transformer✓ Link78.6%10.3M1.3Visformer-Ti2021-04-26
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link78.6%17.7M3ResMLP-12 (distilled, class-MLP)2021-05-07
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition✓ Link78.60%52.77MRepMLP-Res502021-05-05
Res2Net: A New Multi-scale Backbone Architecture✓ Link78.59%Res2Net-50-2992019-04-02
Deep Residual Learning for Image Recognition✓ Link78.57%11.3ResNet-1522015-12-10
Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation✓ Link78.5%ResNet-50-DW (Deformable Kernels)2019-10-07
HRFormer: High-Resolution Transformer for Dense Prediction✓ Link78.5%8.0M1.8HRFormer-T2021-10-18
Gated Convolutional Networks with Hybrid Connectivity for Image Classification✓ Link78.5%12.9M2.0HCGNet-B2019-08-26
RepVGG: Making VGG-style ConvNets Great Again✓ Link78.5%55.77M11.3RepVGG-B2g42021-01-11
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition✓ Link78.48%0.6DVT (T2T-ViT-7)2021-05-31
SRM : A Style-based Recalibration Module for Convolutional Neural Networks✓ Link78.47%SRM-ResNet-1012019-03-26
Averaging Weights Leads to Wider Optima and Better Generalization✓ Link78.44%DenseNet-161 + SWA2018-03-14
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link78.42%CoaT-Ti2023-08-18
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run78.4%0.280FBNetV5-AC-CLS2021-11-19
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features✓ Link78.4%ResNet-50 (CutMix)2019-05-13
Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels✓ Link78.4%4.8MReXNet_1.0-relabel2021-01-13
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer✓ Link78.4%5.6MMobileViT-S2021-10-05
Involution: Inverting the Inherence of Convolution for Visual Recognition✓ Link78.4%15.5M2.7RedNet-502021-03-10
[]()78.36%ResNet-50 (SAMix+DM)
DropBlock: A regularization method for convolutional networks✓ Link78.35%ResNet-50 + DropBlock (0.9 kp, 0.1 label smoothing)2018-10-30
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link78.34%Poly-SA-ViT-S2023-08-18
CondConv: Conditionally Parameterized Convolutions for Efficient Inference✓ Link78.3%0.826EfficientNet-B0 (CondConv)2019-04-10
Deep Residual Learning for Image Recognition✓ Link78.25%40M7.6ResNet-1012015-12-10
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training✓ Link78.2%0.208NASViT-A02021-09-29
MultiGrain: a unified image embedding for classes and instances✓ Link78.2%MultiGrain R50-AA-2242019-02-14
Vision GNN: An Image is Worth Graph of Nodes✓ Link78.2%10.7M1.7Pyramid ViG-Ti2022-06-01
LocalViT: Bringing Locality to Vision Transformers✓ Link78.2%13.5M4.8LocalViT-PVT2021-04-12
[]()78.15%ResNet-50 (AutoMix+DM)
[]()78.15%PuzzleMix+DM (ResNet-50 RSB A3)
LIP: Local Importance-based Pooling✓ Link78.15%25.8MResNet-50 (LIP Bottleneck-256)2019-08-12
Wide Residual Networks✓ Link78.1%WRN-50-2-bottleneck2016-05-23
Separable Self-attention for Mobile Vision Transformers✓ Link78.1%4.9M1.8MobileViTv2-1.02022-06-06
MobileOne: An Improved One millisecond Mobile Backbone✓ Link78.1%10.1M1.896MobileOne-S32022-06-08
ResNet strikes back: An improved training procedure in timm✓ Link78.1%25MResNet50 (A3)2021-10-01
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition✓ Link78%5.7M0.820ZenNet-400M-SE2021-02-01
Designing Network Design Spaces✓ Link78%11.2M1.6RegNetY-1.6GF2020-03-30
Scalable Vision Transformers with Hierarchical Pooling✓ Link78.00%21.74M2.4HVT-S-12021-03-19
Perceiver: General Perception with Iterative Attention✓ Link78%44.9M707.2Perceiver (FF)2021-03-04
Rethinking Channel Dimensions for Efficient Model Design✓ Link77.9%4.8M0.40ReXNet_1.02020-07-02
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link77.9%6.5M4ViTAE-6M2021-06-07
Densely Connected Convolutional Networks✓ Link77.85%DenseNet-2642016-08-25
AlphaNet: Improved Training of Supernets with Alpha-Divergence✓ Link77.8%0.203AlphaNet-A02021-02-16
Data-Driven Neuron Allocation for Scale Aggregation Networks✓ Link77.8%3.8ScaleNet-502019-04-20
ResMLP: Feedforward networks for image classification with data-efficient training✓ Link77.8%15.4MResMLP-S122021-05-07
[]()77.71%ResNet-50 (PuzzleMix+DM)
Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets✓ Link77.7%5.1M0.339TinyNet-A + RA2020-10-28
Fast AutoAugment✓ Link77.6%ResNet-50 (Fast AA)2019-05-01
Sliced Recursive Transformer✓ Link77.6%4.8M1.1SReT-T2021-11-09
Involution: Inverting the Inherence of Convolution for Visual Recognition✓ Link77.6%12.4M2.2RedNet-382021-03-10
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks✓ Link77.584%25.56M4.127SGE-ResNet502019-05-23
Go Wider Instead of Deeper✓ Link77.54%29MWideNet-B2021-07-25
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks✓ Link77.5%EfficientNet-B02021-01-05
Adaptively Connected Neural Networks✓ Link77.5%29.38MACNet (ResNet-50)2019-04-07
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks✓ Link77.48%24.37M3.86ECA-Net (ResNet-50)2019-10-08
Densely Connected Convolutional Networks✓ Link77.42%DenseNet-2012016-08-25
MobileOne: An Improved One millisecond Mobile Backbone✓ Link77.4%7.8M1.299MobileOne-S22022-06-08
Expeditious Saliency-guided Mix-up through Random Gradient Thresholding✓ Link77.39%R-Mix (ResNet-50)2022-12-09
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks✓ Link77.21%ResnetV2 50 (FRN layer)2019-11-21
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run77.2%0.215FBNetV5-AR-CLS2021-11-19
MogaNet: Multi-order Gated Aggregation Network✓ Link77.2%3M1.04MogaNet-XT (256res)2022-11-07
Rethinking Channel Dimensions for Efficient Model Design✓ Link77.2%4.1M0.35ReXNet_0.92020-07-02
Deep Polynomial Neural Networks✓ Link77.17%Prodpoly2020-06-20
Bag of Tricks for Image Classification with Convolutional Neural Networks✓ Link77.16%25MResNet-50-D2018-12-04
What do Deep Networks Like to See?✓ Link77.12%Inception v32018-03-22
GhostNetV3: Exploring the Training Strategies for Compact Models✓ Link77.1%93.3GhostNetV3 1.0x2024-04-17
Meta Knowledge Distillation77.1%MKD ViT-T2022-02-16
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet77.1%6.5M0.366GreedyNAS-A2020-03-25
Bias Loss for Mobile Neural Networks✓ Link77.1%7.1M0.364SkipblockNet-L2021-07-23
Compress image to patches for Vision Transformer✓ Link77%6.442CI2P-ViT2025-02-14
Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks✓ Link77.0%SSAL-Resnet502021-01-07
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link77%UniRepLKNet-A2023-11-27
Rethinking Local Perception in Lightweight Vision Transformer✓ Link77%4.2M0.6CloFormer-XXS2023-03-31
MixConv: Mixed Depthwise Convolutional Kernels✓ Link77%5.0M0.360MixNet-M2019-07-22
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective✓ Link76.91%ResNet50 (FSGDM)2024-11-29
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search✓ Link76.9%6.7M0.730SCARLET-A2019-08-16
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link76.81%5.48MTransBoost-MobileNetV3-L2022-05-26
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link76.8%4.8M4.6ViTAE-T-Stage2021-06-07
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet76.8%5.2M0.324GreedyNAS-B2020-03-25
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link76.89MConvMLP-S2021-09-09
Learning Visual Representations for Transfer Learning by Suppressing Texture✓ Link76.71%Perona Malik (Perona and Malik, 1990)2020-11-03
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link76.7%PVT-T (+MixPro)2023-04-24
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link76.7%2.5M0.927MobileViTv3-XS2022-09-30
MnasNet: Platform-Aware Neural Architecture Search for Mobile✓ Link76.7%5.2M0.8060.0403GMnasNet-A32018-07-31
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link76.7%6.7M1.3ViL-Tiny-RPB2021-03-29
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link76.7%10M2ConViT-Ti+2021-03-19
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link76.70%21.8MTransBoost-ResNet342022-05-26
LIP: Local Importance-based Pooling✓ Link76.64%8.7MLIP-DenseNet-BC-1212019-08-12
X-volution: On the unification of convolution and self-attention76.6%ResNet-50 (X-volution, stage3)2021-06-04
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link76.6%4.0M0.636MUXNet-l2020-03-31
Training data-efficient image transformers & distillation through attention✓ Link76.6%5MDeiT-B2020-12-23
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link76.55%3M1.064MobileViTv3-0.752022-09-30
MLP-Mixer: An all-MLP Architecture for Vision✓ Link76.44%46MMixer-B/162021-05-04
Perceiver: General Perception with Iterative Attention✓ Link76.4%Perceiver2021-03-04
Incorporating Convolution Designs into Visual Transformers✓ Link76.4%6.4M1.2CeiT-T2021-03-22
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup✓ Link76.35%21.8MResNet-34 (SAMix)2021-11-30
[]()76.393.2VGG
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks✓ Link76.3%5.3M0.39EfficientNet-B02019-05-28
Designing Network Design Spaces✓ Link76.3%6.3M0.8RegNetY-800MF2020-03-30
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search✓ Link76.3%6.5M0.658SCARLET-B2019-08-16
GLiT: Neural Architecture Search for Global and Local Image Transformer✓ Link76.3%7.2M1.4GLiT-Tinys2021-07-07
Densely Connected Convolutional Networks✓ Link76.2%DenseNet-1692016-08-25
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet76.2%4.7M0.284GreedyNAS-C2020-03-25
Bias Loss for Mobile Neural Networks✓ Link76.2%5.5M0.246SkipblockNet-M2021-07-23
A Simple Episodic Linear Probe Improves Visual Recognition in the Wild✓ Link76.13ELP (naive ResNet50)2022-01-01
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers✓ Link76.1%21.8MResNet-34 (AutoMix)2021-03-24
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes75.92%ResNet-50 MLPerf v0.7 - 2512 steps2021-02-12
Densely Connected Search Space for More Flexible Neural Architecture Search✓ Link75.9%DenseNAS-A2019-06-23
MobileOne: An Improved One millisecond Mobile Backbone✓ Link75.9%4.8M0.825MobileOne-S12022-06-08
MoGA: Searching Beyond MobileNetV3✓ Link75.9%5.1M0.6080.0304GMoGA-A2019-08-04
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link75.9%5.11M0.62RevBiFPN-S12022-06-28
LocalViT: Bringing Locality to Vision Transformers✓ Link75.9%6.3M1.4LocalViT-TNT2021-04-12
Semantic-Aware Local-Global Vision Transformer75.9%6.5MSALG-ST2022-11-27
Involution: Inverting the Inherence of Convolution for Visual Recognition✓ Link75.9%9.2M1.7RedNet-262021-03-10
FractalNet: Ultra-Deep Neural Networks without Residuals✓ Link75.88%FractalNet-342016-05-24
MixConv: Mixed Depthwise Convolutional Kernels✓ Link75.8%4.1M0.256MixNet-S2019-07-22
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution✓ Link75.74%CoordConv ResNet-502018-07-09
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference✓ Link75.7%4.7M0.288LeViT-128S2021-04-02
GhostNet: More Features from Cheap Operations✓ Link75.7%7.3M0.226GhostNet ×1.32019-11-27
Local Relation Networks for Image Recognition✓ Link75.7%14.7M2.6LR-Net-262019-04-25
Spatial-Channel Token Distillation for Vision MLPs✓ Link75.7%22.2M4.3Mixer-S16 + STD2022-07-23
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link75.663MSimpleNetV1-small-075-correct-labels2016-08-22
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link75.6%FastViT-T82023-03-24
Separable Self-attention for Mobile Vision Transformers✓ Link75.6%2.9M1.0MobileViTv2-0.752022-06-06
MnasNet: Platform-Aware Neural Architecture Search for Mobile✓ Link75.6%4.8M0.680MnasNet-A22018-07-31
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search✓ Link75.6%6M0.560SCARLET-C2019-08-16
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples✓ Link75.5%PAWS (ResNet-50, 10% labels)2021-04-28
Designing Network Design Spaces✓ Link75.5%6.1M0.6RegNetY-600MF2020-03-30
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design✓ Link75.4%0.597ShuffleNet V22018-07-30
Visual Attention Network✓ Link75.4%4.1M0.9VAN-B02022-02-20
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks✓ Link75.4%5.99M0.4338AsymmNet-Large ×1.02021-04-15
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search✓ Link75.34%4.6M0.776FairNAS-A2019-07-03
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias✓ Link75.3%3.0ViTAE-T2021-06-07
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link75.3%3.4M0.436MUXNet-m2020-03-31
Deep Residual Learning for Image Recognition✓ Link75.3%25M3.8ResNet-502015-12-10
MnasNet: Platform-Aware Neural Architecture Search for Mobile✓ Link75.2%3.9M0.624MnasNet-A12018-07-31
Searching for MobileNetV3✓ Link75.2%5.4M0.438MobileNet V3-Large 1.02019-05-06
DiCENet: Dimension-wise Convolutions for Efficient Networks✓ Link75.1%0.553DiCENet2019-06-08
MultiGrain: a unified image embedding for classes and instances✓ Link75.1%MultiGrain NASNet-A-Mobile (350px)2019-02-14
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search✓ Link75.10%4.5M0.690FairNAS-B2019-07-03
X-volution: On the unification of convolution and self-attention75%ResNet-34 (X-volution, stage3)2021-06-04
GhostNet: More Features from Cheap Operations✓ Link75%13M2.2Ghost-ResNet-50 (s=2)2019-11-27
Densely Connected Convolutional Networks✓ Link74.98%DenseNet-1212016-08-25
Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours✓ Link74.96%Single-Path NAS2019-04-05
WaveMix: A Resource-efficient Neural Network for Image Analysis✓ Link74.93%WaveMix-192/16 (level 3)2022-05-28
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet✓ Link74.9FF2021-05-06
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search✓ Link74.9%5.5M0.375FBNet-C2018-12-09
ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network✓ Link74.9%5.9M0.602ESPNetv22018-11-28
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer✓ Link74.8%2.3M0.7MobileViT-XS2021-10-05
LocalViT: Bringing Locality to Vision Transformers✓ Link74.8%5.9M1.3LocalViT-T2021-04-12
Exploring Randomly Wired Neural Networks for Image Recognition✓ Link74.7%5.6M0.583RandWire-WS (small)2019-04-02
AutoFormer: Searching Transformers for Visual Recognition✓ Link74.7%5.7M1.3AutoFormer-tiny2021-07-01
MobileNetV2: Inverted Residuals and Linear Bottlenecks✓ Link74.7%6.9M1.170MobileNetV2 (1.4)2018-01-13
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search✓ Link74.69%4.4M0.642FairNAS-C2019-07-03
Rethinking Channel Dimensions for Efficient Model Design✓ Link74.6%2.7MReXNet_0.62020-07-02
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware✓ Link74.6%4.0MProxyless2018-12-02
Rethinking Spatial Dimensions of Vision Transformers✓ Link74.6%4.9M0.7PiT-Ti2021-03-30
Dynamic Convolution: Attention over Convolution Kernels✓ Link74.4%11.1M0,626DY-MobileNetV2 ×1.02019-12-07
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link74.179.5MSimpleNetV1-9m2016-08-22
Designing Network Design Spaces✓ Link74.1%4.3M0.4RegNetY-400MF2020-03-30
GhostNet: More Features from Cheap Operations✓ Link74.1%6.5M1.2Ghost-ResNet-50 (s=4)2019-11-27
Sliced Recursive Transformer✓ Link74.0%4M0.7SReT-ExT2021-11-09
GhostNet: More Features from Cheap Operations✓ Link73.9%5.2M0.141GhostNet ×1.02019-11-27
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer✓ Link73.8%DeiT-T (+MixPro)2023-04-24
MobileNetV4 -- Universal Models for the Mobile Ecosystem✓ Link73.8%MNv4-Conv-S2024-04-16
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link73.7%6M2.568DeiT-Ti with iRPE-K2021-07-29
Distilled Gradual Pruning with Pruned Fine-tuning✓ Link73.66%2.56M0.4DGPPF-ResNet502024-02-15
TransBoost: Improving the Best ImageNet Performance using Deep Transduction✓ Link73.36%11.69MTransBoost-ResNet182022-05-26
What's Hidden in a Randomly Weighted Neural Network?✓ Link73.3%20.6MWide ResNet-50 (edge-popup)2019-11-29
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks✓ Link73.19%ResNet-18 (MEAL V2)2020-09-17
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases✓ Link73.1%6M1ConViT-Ti2021-03-19
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network✓ Link72.8%3.42M0.31RevBiFPN-S02022-06-28
Dynamic Convolution: Attention over Convolution Kernels✓ Link72.8%7M0.435DY-MobileNetV2 ×0.752019-12-07
Dynamic Convolution: Attention over Convolution Kernels✓ Link72.7%42.7M3.7DY-ResNet-182019-12-07
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks✓ Link72.56%3.34M0.320ECA-Net (MobileNetV2)2019-10-08
Compact Global Descriptor for Neural Networks✓ Link72.56%4.26M1.198MobileNet-224 (CGD)2019-07-23
MobileOne: An Improved One millisecond Mobile Backbone✓ Link72.5%2.1M0.275MobileOne-S0 (distill)2022-06-08
LocalViT: Bringing Locality to Vision Transformers✓ Link72.5%4.3M1.2LocalViT-T2T2021-04-12
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link72.33%1.4M0.481MobileViTv3-0.52022-09-30
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup✓ Link72.33%11.7MResNet-18 (SAMix)2021-11-30
On the adequacy of untuned warmup for adaptive optimization✓ Link72.1%ResNet-502019-10-09
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers✓ Link72.05%11.7MResNet-18 (AutoMix)2021-03-24
MobileNetV2: Inverted Residuals and Linear Bottlenecks✓ Link72%3.4M0.600MobileNetV22018-01-13
QuantNet: Learning to Quantize by Learning within Fully Differentiable Framework71.97%Ours2020-09-10
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link71.945.7MSimpleNetV1-5m2016-08-22
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link71.71%ResNet-18 (PAD-L2 w/ ResNet-34 teacher)2020-11-25
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link71.6%2.4M0.234MUXNet-s2020-03-31
Augmenting Deep Classifiers with Polynomial Neural Networks✓ Link71.6%11.51MPDC2021-04-16
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link71.56%ResNet-18 (FT w/ ResNet-34 teacher)2020-11-25
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers✓ Link71.53%EfficientFormer-V2-S02023-08-18
MobileOne: An Improved One millisecond Mobile Backbone✓ Link71.4%2.1MMobileOne-S02022-06-08
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link71.37%ResNet-18 (KD w/ ResNet-34 teacher)2020-11-25
Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks71.24Dspike (VGG-16)2021-12-01
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications✓ Link71.2%1.3M0.522EdgeNeXt-XXS2022-06-21
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link71.08%ResNet-18 (L2 w/ ResNet-34 teacher)2020-11-25
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features✓ Link70.98%1.2M0.289MobileViTv3-XXS2022-09-30
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link70.93%ResNet-18 (CRD w/ ResNet-34 teacher)2020-11-25
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices✓ Link70.9%ShuffleNet2017-07-04
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications✓ Link70.6%1.138MobileNet-224 ×1.252017-04-17
[]()70.54PSN (SEW ResNet-34)
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link70.52%ResNet-18 (tf-KD w/ ResNet-18 teacher)2020-11-25
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link70.5%3.4M0.6PVTv2-B02021-06-25
Gated Attention Coding for Training High-performance and Efficient Spiking Neural Networks✓ Link70.42GAC-SNN MS-ResNet-342023-08-12
Separable Self-attention for Mobile Vision Transformers✓ Link70.2%1.4M0.5MobileViTv2-0.52022-06-06
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation✓ Link70.09%ResNet-18 (SSKD w/ ResNet-34 teacher)2020-11-25
Dynamic Convolution: Attention over Convolution Kernels✓ Link69.7%4.8M0.137DY-MobileNetV3-Small2019-12-07
Scalable Vision Transformers with Hierarchical Pooling✓ Link69.64%5.74M0.64HVT-Ti-12021-03-19
GhostNetV3: Exploring the Training Strategies for Compact Models✓ Link69.4%88.5GhostNetV3 0.5x2024-04-17
Dynamic Convolution: Attention over Convolution Kernels✓ Link69.4%4M0.203DY-MobileNetV2 ×0.52019-12-07
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks✓ Link69.2%2.8M0.1344AsymmNet-Large ×0.52021-04-15
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link69.111.5MSimpleNetV1-small-05-correct-labels2016-08-22
Correlated Input-Dependent Label Noise in Large-Scale Image Classification68.6%Heteroscedastic (InceptionResNet-v2)2021-05-19
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks✓ Link68.4%3.1M0.1154AsymmNet-Small ×1.02021-04-15
FireCaffe: near-linear acceleration of deep neural network training on compute clusters68.3%FireCaffe (GoogLeNet)2015-10-31
Graph-RISE: Graph-Regularized Image Semantic Embedding✓ Link68.29%Graph-RISE (40M)2019-02-14
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link68.153MSimpleNetV1-small-0752016-08-22
"BNN - BN = ?": Training Binary Neural Networks without Batch Normalization✓ Link68.0%ReActNet-A (BN-Free)2021-04-16
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective✓ Link67.74%ResNet34 (FSGDM)2024-11-29
Dynamic Convolution: Attention over Convolution Kernels✓ Link67.7%18.6M1.82DY-ResNet-102019-12-07
WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis✓ Link67.7%32.4MWaveMixLite-256/242022-10-13
[]()67.63PSN (SEW ResNet-18)
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link66.7%1.8M0.132MUXNet-xs2020-03-31
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples✓ Link66.5%PAWS (ResNet-50, 1% labels)2021-04-28
GhostNet: More Features from Cheap Operations✓ Link66.2%2.6M0.042GhostNet ×0.52019-11-27
[]()66.0486.76OverFeat
Distilled Gradual Pruning with Pruned Fine-tuning✓ Link65.59%1.03M0.1DGPPF-MobileNetV22024-02-15
Distilled Gradual Pruning with Pruned Fine-tuning✓ Link65.221.15M0.2DGPPF-ResNet182024-02-15
Online Training Through Time for Spiking Neural Networks✓ Link65.15%OTTT2022-10-09
Dynamic Convolution: Attention over Convolution Kernels✓ Link64.9%2.8M0.124DY-MobileNetV2 ×0.352019-12-07
[]()63.362M84.6Alexnet
Balanced Binary Neural Networks with Gated Residual✓ Link62.6%BBG (ResNet-34)2019-09-26
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures✓ Link61.521.5MSimpleNetV1-small-052016-08-22
Balanced Binary Neural Networks with Gated Residual✓ Link59.4%BBG (ResNet-18)2019-09-26
FireCaffe: near-linear acceleration of deep neural network training on compute clusters58.9%FireCaffe (AlexNet)2015-10-31
0/1 Deep Neural Networks via Block Coordinate Descent38.3%HMAX2022-06-19
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale✓ Link24%ViT-Large2020-10-22
Escaping the Big Data Paradigm with Compact Transformers✓ Link22.36M11.06CCT-14/7x22021-04-12
MambaVision: A Hybrid Mamba-Transformer Vision Backbone✓ Link241.5MMambaVision-L22024-07-10
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link1520MONE-PEACE2023-05-18
Multimodal Autoregressive Pre-training of Large Vision Encoders✓ Link2700MAIMv2-2B2024-11-21
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link3000MInternImage-DCNv3-G (M3I Pre-training)2022-11-10