CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 91.0% | 2100M | | | | | CoCa (finetuned) | 2022-05-04 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 90.98% | 2440M | | | | | Model soups (BASIC-L) | 2022-03-10 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 90.94% | 1843M | | | | | Model soups (ViT-G/14) | 2022-03-10 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 90.4% | 1437M | 1038 | | | | DaViT-G | 2022-04-07 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 90.2% | 362M | 334 | | | | DaViT-H | 2022-04-07 |
Meta Pseudo Labels | ✓ Link | 90.2% | 480M | | 95040G | 98.8 | | Meta Pseudo Labels (EfficientNet-L2) | 2020-03-23 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 90.17% | 3000M | | | | | SwinV2-G | 2021-11-18 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 90.1% | 6500M | | | | | MAWS (ViT-6.5B) | 2023-03-23 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 90.05% | 893M | | | 99.02 | | Florence-CoSwin-H | 2021-11-22 |
Meta Pseudo Labels | ✓ Link | 90% | 390M | | | | | Meta Pseudo Labels (EfficientNet-B6-Wide) | 2020-03-23 |
Reversible Column Networks | ✓ Link | 90.0% | 2158M | | | | | RevCol-H | 2022-12-22 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 89.8% | 2000M | | | | | MAWS (ViT-2B) | 2023-03-23 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 89.7% | 1000M | | | | | EVA | 2022-11-14 |
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | ✓ Link | 89.6% | | | | | | M3I Pre-training (InternImage-H) | 2022-11-17 |
Scaling Vision Transformers to 22 Billion Parameters | ✓ Link | 89.6% | 307M | | | | | ViT-L/16 (384res, distilled from ViT-22B) | 2023-02-10 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 89.6% | 1080M | 1478 | | | | InternImage-H | 2022-11-10 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 89.53% | | | | | | MaxViT-XL (512res, JFT) | 2022-04-04 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 89.5% | | | | | | AIMv2-3B (448 res) | 2024-11-21 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 89.5% | 650M | | | | | MAWS (ViT-H) | 2023-03-23 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 89.41% | | | | | | MaxViT-L (512res, JFT) | 2022-04-04 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 89.36% | | | | | | MaxViT-XL (384res, JFT) | 2022-04-04 |
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 89.3% | | | | | | OmniVec2 | 2024-01-01 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 89.2% | 527M | 367 | | | | NFNet-F4+ | 2021-02-11 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 89.12% | | | | | | MaxViT-L (384res, JFT) | 2022-04-04 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 89.1% | 483.2M | 648.5 | | | | MOAT-4 22K+1K | 2022-10-04 |
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | ✓ Link | 89.0% | 307M | | | | | FD (CLIP ViT-L-336) | 2022-05-27 |
Differentially Private Image Classification from Features | ✓ Link | 88.9% | | | | | | Last Layer Tuning with Newton Step (ViT-G/14)) | 2022-11-24 |
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 88.87% | 460M | | | | | TokenLearner L/8 (24+11) | 2021-06-21 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.82% | | | | | | MaxViT-B (512res, JFT) | 2022-04-04 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 88.8% | | | | | | MAWS (ViT-L) | 2023-03-23 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 88.8% | 667M | 763.5 | | | | MViTv2-H (512 res, ImageNet-21k pretrain) | 2021-12-02 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.7% | | | | | | MaxViT-XL (512res, 21K) | 2022-04-04 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.69% | | | | | | MaxViT-B (384res, JFT) | 2022-04-04 |
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ✓ Link | 88.64% | 480M | | | | | ALIGN (EfficientNet-L2) | 2021-02-11 |
Sharpness-Aware Minimization for Efficiently Improving Generalization | ✓ Link | 88.61% | 480M | | | | | EfficientNet-L2-475 (SAM) | 2020-10-03 |
Scaling Vision Transformers to 22 Billion Parameters | ✓ Link | 88.6% | 86M | | | | | ViT-B/16 | 2023-02-10 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 88.60% | 331M | | | | | BEiT-L (ViT; ImageNet-22K pretrain) | 2021-06-15 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 88.6% | 633.5M | 1018.8 | | | | SWAG (ViT H/14) | 2022-01-20 |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | ✓ Link | 88.55% | | | | | | ViT-H/14 | 2020-10-22 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 88.52% | | 114 | | | | CoAtNet-3 @384 | 2021-06-09 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.51% | | | | | | MaxViT-XL (384res, 21K) | 2022-04-04 |
Reproducible scaling laws for contrastive language-image learning | ✓ Link | 88.5% | | | | | | OpenCLIP ViT-H/14 | 2022-12-14 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 88.5% | | | | | | AIMv2-3B | 2024-11-21 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 88.5% | 480M | 585 | | | | FixEfficientNet-L2 | 2020-03-18 |
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond | ✓ Link | 88.5% | 644M | | | | | ViTAE-H + MAE (448) | 2022-02-21 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.46% | | | | | | MaxViT-L (512res, 21K) | 2022-04-04 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 88.4% | 218M | 140.7 | | | | MViTv2-L (384 res, ImageNet-21k pretrain) | 2021-12-02 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 88.4% | 480M | | 51800G | | | NoisyStudent (EfficientNet-L2) | 2019-11-11 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.38% | | | | | | MaxViT-B (512res, 21K) | 2022-04-04 |
Differentiable Top-k Classification Learning | ✓ Link | 88.37% | | | | | | Top-k DiffSortNets (EfficientNet-L2) | 2022-06-15 |
A ConvNet for the 2020s | ✓ Link | 88.36% | 1827M | | | | | Adlik-ViT-SG+Swin_large+Convnext_xlarge(384) | 2022-01-10 |
Scaling Vision with Sparse Mixture of Experts | ✓ Link | 88.36% | 7200M | | | | | V-MoE-H/14 (Every-2) | 2021-06-10 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 88.32% | | | | | | MaxViT-L (384res, 21K) | 2022-04-04 |
Unicom: Universal and Compact Representation Learning for Image Retrieval | ✓ Link | 88.3 | | | | | | Unicom (ViT-L/14@336px) (Finetuned) | 2023-04-12 |
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers | ✓ Link | 88.3% | | | | | | PeCo (ViT-H, 448) | 2021-11-24 |
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion | ✓ Link | 88.21% | | | | | | DFN-5B H/14-378 + PrefixedIter Decoder | 2024-07-15 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 88.2% | | | | | | dBOT ViT-H (CLIP as Teacher) | 2022-09-08 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 88.1% | | 489.1 | | | | MambaVision-L3 | 2024-07-10 |
MetaFormer Baselines for Vision | ✓ Link | 88.1% | 99M | 72.2 | | | | CAFormer-B36 (384 res, 21K) | 2022-10-24 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 88.1% | 1200M | | | | | AIMv2-1B | 2024-11-21 |
Scaling Vision with Sparse Mixture of Experts | ✓ Link | 88.08% | 656M | | | | | VIT-H/14 | 2021-06-10 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 88.0% | | | | | | ViT-H@224 (cosub) | 2022-12-09 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 88% | | | | | | UniRepLKNet-XL++ | 2023-11-27 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 88% | 335M | 163 | | | | InternImage-XL | 2022-11-10 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 88% | 667M | 120.6 | | | | MViTv2-H (mageNet-21k pretrain) | 2021-12-02 |
MLP-Mixer: An all-MLP Architecture for Vision | ✓ Link | 87.94% | | | | | | Mixer-H/14 (JFT-300M pre-train) | 2021-05-04 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 87.9% | | | | | | UniRepLKNet-L++ | 2023-11-27 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 87.8% | | | | | | dBOT ViT-L (CLIP as Teacher) | 2022-09-08 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 87.8% | 181M | 102 | | | | MogaNet-XL (384res) | 2022-11-07 |
Visual Attention Network | ✓ Link | 87.8% | 200M | 114.3 | | | | VAN-B6 (22K, 384res) | 2022-02-20 |
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs | ✓ Link | 87.8% | 335M | 128.7 | | | | RepLKNet-XL | 2022-03-13 |
A ConvNet for the 2020s | ✓ Link | 87.8% | 350M | 179 | | | | ConvNeXt-XL (ImageNet-22k) | 2022-01-10 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 87.8% | 656M | | | | | MAE (ViT-H, 448) | 2021-11-11 |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | ✓ Link | 87.76% | | | | | | ViT-L/16 | 2020-10-22 |
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions | ✓ Link | 87.7% | | 101.8 | | | | HorNet-L (GF) | 2022-07-28 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 87.7% | | | | | | CvT-W24 (384 res, ImageNet-22k pretrain) | 2021-03-29 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 87.7% | 223M | 108 | | | | InternImage-L | 2022-11-10 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 87.6% | | | | | | CoAtNet-3 (21k) | 2021-06-09 |
MetaFormer Baselines for Vision | ✓ Link | 87.6% | 100M | 66.5 | | | | ConvFormer-B36 (384 res, 21K) | 2022-10-24 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 87.54% | | | | 98.46 | | BiT-L (ResNet) | 2019-12-24 |
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers | ✓ Link | 87.5% | | | | | | PeCo (ViT-H, 224) | 2021-11-24 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 87.5% | | | | | | ViT-L@224 (cosub) | 2022-12-09 |
MetaFormer Baselines for Vision | ✓ Link | 87.5% | 56M | 42 | | | | CAFormer-M36 (384 res, 21K) | 2022-10-24 |
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows | ✓ Link | 87.5% | 173M | 96.8 | | | | CSWin-L (384 res,ImageNet-22k pretrain) | 2021-07-01 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 87.5% | 196.8M | 103 | | | | DaViT-L (ImageNet-22k) | 2022-04-07 |
Dilated Neighborhood Attention Transformer | ✓ Link | 87.5% | 200M | 92.4 | | | | DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224) | 2022-09-29 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 87.5% | 600M | | | | | AIMv2-H | 2024-11-21 |
Scaling Vision with Sparse Mixture of Experts | ✓ Link | 87.41% | 3400M | | | | | V-MoE-L/16 (Every-2) | 2021-06-10 |
Dilated Neighborhood Attention Transformer | ✓ Link | 87.4% | | 89.7 | | | | DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224) | 2022-09-29 |
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language | ✓ Link | 87.4% | | | | | | data2vec 2.0 | 2022-12-14 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 87.4% | | | | | | UniRepLKNet-B++ | 2023-11-27 |
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space | ✓ Link | 87.4% | | | | | | HVT Huge | 2024-09-25 |
MetaFormer Baselines for Vision | ✓ Link | 87.4% | 99M | 23.2 | | | | CAFormer-B36 (224 res, 21K) | 2022-10-24 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | ✓ Link | 87.4% | 117M | 51 | | | | UniNet-B6 | 2022-07-12 |
Dilated Neighborhood Attention Transformer | ✓ Link | 87.4% | 197M | 101.5 | | | | DiNAT_s-Large (384res; Pretrained on IN22K@224) | 2022-09-29 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 87.3% | 197M | 103.9 | | | | Swin-L | 2021-03-25 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 87.3% | 208M | 94 | | | | EfficientNetV2-XL (21k) | 2021-04-01 |
Improving Vision Transformers by Revisiting High-frequency Components | ✓ Link | 87.3% | 295.5M | 412 | | | | VOLO-D5+HAT | 2022-04-03 |
PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions | ✓ Link | 87.2% | | | | | | EfficientNetV2 (PolyLoss) | 2022-04-26 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 87.2% | 298M | 437 | | | | ELSA-VOLO-D5 (512*512) | 2021-12-23 |
A Study on Transformer Configuration and Training Objective | | 87.1 | | | | | | Bamboo (Bamboo-H) | 2022-05-21 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 87.1% | | | | | | Swin-L@224 (cosub) | 2022-12-09 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 87.1% | | | | | | CoAtNet-2 (21k) | 2021-06-09 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 87.1% | 66M | 82 | | | | FixEfficientNet-B7 | 2020-03-18 |
Understanding The Robustness in Vision Transformers | ✓ Link | 87.1% | 76.8M | | | | | FAN-L-Hybrid++ | 2022-04-26 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 87.1% | 88M | | | | | SwinV2-B | 2021-11-18 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 87.1% | 296M | 412 | | | | VOLO-D5 | 2021-06-24 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 87.1% | 334.3M | | | | | PatchConvNet-L120-21k-384 | 2021-12-27 |
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 87.07% | | | | | | 16-TokenLearner B/16 (21) | 2021-06-21 |
Enhance the Visual Representation via Discrete Adversarial Training | ✓ Link | 87.02% | | | | | | MAE+DAT (ViT-H) | 2022-09-16 |
Visual Attention Network | ✓ Link | 87% | | 50.6 | | | | VAN-B5 (22K, 384res) | 2022-02-20 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | ✓ Link | 87% | 72.9M | 20.4 | | | | UniNet-B5 | 2022-07-12 |
MetaFormer Baselines for Vision | ✓ Link | 87.0% | 100M | 22.6 | | | | ConvFormer-B36 (224 res, 21K) | 2022-10-24 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 86.9% | | | | | | MAE (ViT-H) | 2021-11-11 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 86.9% | | | | | | Hiera-H | 2023-06-01 |
MetaFormer Baselines for Vision | ✓ Link | 86.9% | 39M | 26.0 | | | | CAFormer-S36 (384 res, 21K) | 2022-10-24 |
MetaFormer Baselines for Vision | ✓ Link | 86.9% | 57M | 37.7 | | | | ConvFormer-M36 (384 res, 21K) | 2022-10-24 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 86.9% | 66M | 37 | | | | NoisyStudent (EfficientNet-B7) | 2019-11-11 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 86.9% | 87.9M | 46.4 | | | | DaViT-B (ImageNet-22k) | 2022-04-07 |
Visual Attention Network | ✓ Link | 86.9% | 200M | 38.9 | | | | VAN-B6 (22K) | 2022-02-20 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 86.8% | | | | | | MAWS (ViT-B) | 2023-03-23 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 86.8% | 120M | 53 | | | | EfficientNetV2-L (21k) | 2021-04-01 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 86.8% | 193M | 197 | | | | VOLO-D4 | 2021-06-24 |
Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error | | 86.78% | 377.2M | | | | | NFNet-F5 w/ SAM w/ augmult=16 | 2021-05-27 |
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems | ✓ Link | 86.74% | | | | | | µ2Net (ViT-L/16) | 2022-05-25 |
DeiT III: Revenge of the ViT | ✓ Link | 86.7% | | | | | | ViT-B @384 (DeiT III, 21k) | 2022-04-14 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 86.7% | | | | | | MaxViT-B (512res) | 2022-04-04 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 86.7% | 43M | | | | | FixEfficientNet-B6 | 2020-03-18 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 86.7% | 190M | 271 | | | | MOAT-3 1K only | 2022-10-04 |
An Algorithm for Routing Vectors in Sequences | ✓ Link | 86.7% | 312.8M | | | | | Heinsen Routing + BEiT-large 16 224 | 2022-11-20 |
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network | ✓ Link | 86.61% | | 51.93 | | | | CLCNet (S:ViT+D:EffNet-B7) (retrain) | 2022-05-19 |
MetaFormer Baselines for Vision | ✓ Link | 86.6% | 56M | 13.2 | | | | CAFormer-M36 (224 res, 21K) | 2022-10-24 |
Visual Attention Network | ✓ Link | 86.6% | 60M | 35.9 | | | | VAN-B4 (22K, 384res) | 2022-02-20 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 86.6% | 300M | | | | | AIMv2-L | 2024-11-21 |
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language | ✓ Link | 86.6% | 656M | | | | | data2vec (ViT-H) | 2022-02-07 |
Dilated Neighborhood Attention Transformer | ✓ Link | 86.5% | | 34.5 | | | | DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224) | 2022-09-29 |
Meta Knowledge Distillation | | 86.5% | | | | | | MKD ViT-L | 2022-02-16 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 86.5% | 21M | 27.0 | | | | TinyViT-21M-512-distill (512 res, 21k) | 2022-07-21 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 86.5% | 99.4M | | | | | PatchConvNet-B60-21k-384 | 2021-12-27 |
Going deeper with Image Transformers | ✓ Link | 86.5% | 438M | 377.3 | | | | CaiT-M-48-448 | 2021-03-31 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 86.5% | 438.4M | 377.28 | | | | NFNet-F6 w/ SAM | 2021-02-11 |
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network | ✓ Link | 86.46% | | 57.46 | | | | CLCNet (S:ViT+D:VOLO-D3) (retrain) | 2022-05-19 |
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network | ✓ Link | 86.42% | | 45.43 | | | | CLCNet (S:ConvNeXt-L+D:EffNet-B7) (retrain) | 2022-05-19 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 86.4% | | | | | | MaxViT-L (384res) | 2022-04-04 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 86.4% | | | | | | UniRepLKNet-S++ | 2023-11-27 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 86.4% | 30M | | | | | FixEfficientNet-B5 | 2020-03-18 |
MetaFormer Baselines for Vision | ✓ Link | 86.4% | 40M | 22.4 | | | | ConvFormer-S36 (384 res, 21K) | 2022-10-24 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 86.4% | 43M | | | | | NoisyStudent (EfficientNet-B6) | 2019-11-11 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 86.4% | 88M | 47 | | | | Swin-B | 2021-03-25 |
MetaFormer Baselines for Vision | ✓ Link | 86.4% | 99M | 72.2 | | | | CAFormer-B36 (384 res) | 2022-10-24 |
All Tokens Matter: Token Labeling for Training Better Vision Transformers | ✓ Link | 86.4% | 151M | 214.8 | | | | LV-ViT-L | 2021-04-22 |
Fixing the train-test resolution discrepancy | ✓ Link | 86.4% | 829M | | 62G | 98.0% | | FixResNeXt-101 32x48d | 2019-06-14 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 86.34% | | | | | | MaxViT-B (384res) | 2022-04-04 |
A Study on Transformer Configuration and Training Objective | | 86.3 | | | | | | Bamboo (Bamboo-L) | 2022-05-21 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 86.3% | | | | | | ViT-B@224 (cosub) | 2022-12-09 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 86.3% | | | | | | Our SP-ViT-L|384 | 2022-06-15 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 86.3% | 86M | 67.9 | | | | VOLO-D3 | 2021-06-24 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 86.3% | 86M | | | | | BEiT-L (ViT; ImageNet 1k pretrain) | 2021-06-15 |
Visual Attention Network | ✓ Link | 86.3% | 90M | 17.2 | | | | VAN-B5 (22K) | 2022-02-20 |
UniFormer: Unifying Convolution and Self-attention for Visual Recognition | ✓ Link | 86.3% | 100M | 39.2 | | | | UniFormer-L (384 res) | 2022-01-24 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 86.3% | 218M | 140.2 | | | | MViTv2-L (384 res) | 2021-12-02 |
Going deeper with Image Transformers | ✓ Link | 86.3% | 271M | 247.8 | | | | CAIT-M36-448 | 2021-03-31 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 86.3% | 377.2M | 289.76 | | | | NFNet-F5 w/ SAM | 2021-02-11 |
Tiny Models are the Computational Saver for Large Models | ✓ Link | 86.24 | | 31.17 | | | | TinySaver(ConvNeXtV2_h, 0.01 Acc drop) | 2024-03-26 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 86.2% | | | | | | Swin-B@224 (cosub) | 2022-12-09 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 86.2% | 21M | 13.8 | | | | TinyViT-21M-384-distill (384 res, 21k) | 2022-07-21 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 86.2% | 54M | 24 | | | | EfficientNetV2-M (21k) | 2021-04-01 |
MetaFormer Baselines for Vision | ✓ Link | 86.2% | 56M | 42.0 | | | | CAFormer-M36 (384 res) | 2022-10-24 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 86.2% | 89.7M | 56.3 | | | | TransNeXt-Base (IN-1K supervised, 384) | 2023-11-28 |
Masked Image Residual Learning for Scaling Deeper Vision Transformers | ✓ Link | 86.2% | 341M | 67.0 | | | | MIRL (ViT-B-48) | 2023-09-25 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 86.19% | | | | | | MaxViT-S (512res) | 2022-04-04 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 86.1% | 30M | | | | | NoisyStudent (EfficientNet-B5) | 2019-11-11 |
MetaFormer Baselines for Vision | ✓ Link | 86.1% | 57M | 12.8 | | | | ConvFormer-M36 (224 res, 21K) | 2022-10-24 |
Going deeper with Image Transformers | ✓ Link | 86.1% | 270.9M | 173.3 | | | | CAIT-M-36 | 2021-03-31 |
Refiner: Refining Self-attention for Vision Transformers | ✓ Link | 86.03 | 81M | | | | | Refiner-ViT-L | 2021-06-07 |
Generalized Parametric Contrastive Learning | ✓ Link | 86.01% | | | | | | GPaCo (ViT-L) | 2022-09-26 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 86.0% | | | | | | Omnivore (Swin-L) | 2022-01-20 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 86% | | | | | | Our SP-ViT-M|384 | 2022-06-15 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 86.0% | 49.7M | 32.1 | | | | TransNeXt-Small (IN-1K supervised, 384) | 2023-11-28 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 86% | 59M | | | | | VOLO-D2 | 2021-06-24 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 86% | 64M | 20 | | | | EfficientViT-L2 (r384) | 2022-05-29 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 86% | 189M | 417.9 | | | | XCiT-L24 | 2021-06-17 |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling | ✓ Link | 86.0% | 198M | | | | | SparK (ConvNeXt-Large, 384) | 2023-01-09 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 86.0% | 377.2M | 289.76 | | | | NFNet-F5 | 2021-02-11 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 85.9% | | | | | | MAE (ViT-L) | 2021-11-11 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 85.9% | 19M | | | | | FixEfficientNet-B4 | 2020-03-18 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 85.9% | 94M | 49.7 | | | | DAT-B++ (384x384) | 2023-09-04 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 85.9% | 316.1M | 215.24 | | | | NFNet-F4 | 2021-02-11 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 85.8% | | | | | | ConvNeXt-B@224 (cosub) | 2022-12-09 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 85.8% | | | | | | PiT-B@224 (cosub) | 2022-12-09 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 85.8% | | | | | | GTP-ViT-B-Patch8/P20 | 2023-11-06 |
MetaFormer Baselines for Vision | ✓ Link | 85.8% | 39M | 8.0 | | | | CAFormer-S36 (224 res, 21K) | 2022-10-24 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 85.8% | 84M | 188 | | | | XCiT-M24 | 2021-06-17 |
MaxUp: A Simple Way to Improve Generalization of Neural Network Training | ✓ Link | 85.8% | 87.42M | | | | | Fix-EfficientNet-B8 (MaxUp + CutMix) | 2020-02-20 |
Circumventing Outliers of AutoAugment with Knowledge Distillation | ✓ Link | 85.8% | 88M | | | | | KDforAA (EfficientNet-B8) | 2020-03-25 |
Going deeper with Image Transformers | ✓ Link | 85.8% | 185.9M | 116.1 | | | | CAIT-M-24 | 2021-03-31 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 85.8% | 186M | 34.7 | | | | RDNet-L (384 res) | 2024-03-28 |
DeiT III: Revenge of the ViT | ✓ Link | 85.8% | 304.8M | 191.2 | | | | ViT-L | 2022-04-14 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 85.8% | 1360M | 142 | | | | FasterViT-6 | 2023-06-09 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 85.8% | 10000M | | | | | SEER (RG-10B) | 2022-02-16 |
Tiny Models are the Computational Saver for Large Models | ✓ Link | 85.75 | | 19.41 | | | | TinySaver(ConvNeXtV2_h, 0.5 Acc drop) | 2024-03-26 |
Tiny Models are the Computational Saver for Large Models | ✓ Link | 85.74 | | | | | | TinySaver(Swin_large, 0.5 Acc drop) | 2024-03-26 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 85.72% | | | | | | MaxViT-T (384res) | 2022-04-04 |
Visual Attention Network | ✓ Link | 85.7% | | 12.2 | | | | VAN-B4 (22K) | 2022-02-20 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 85.7% | | 53 | | | | EfficientNetV2-L | 2021-04-01 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 85.7% | | | | | | FixEfficientNet-B8 | 2020-03-18 |
DeiT III: Revenge of the ViT | ✓ Link | 85.7% | | | | | | ViT-B @224 (DeiT III, 21k) | 2022-04-14 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 85.7% | | | | | | dBOT ViT-B (CLIP as Teacher) | 2022-09-08 |
MetaFormer Baselines for Vision | ✓ Link | 85.7% | 39M | 26.0 | | | | CAFormer-S36 (384 res) | 2022-10-24 |
MetaFormer Baselines for Vision | ✓ Link | 85.7% | 100M | 66.5 | | | | ConvFormer-B36 (384 res) | 2022-10-24 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 85.7% | 254.9M | 114.76 | | | | NFNet-F3 | 2021-02-11 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 85.7% | 632M | | | | | ViT-H @224 (DeiT-III + AugSub) | 2023-06-20 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 85.6% | 48M | 106 | | | | XCiT-S24 | 2021-06-17 |
MetaFormer Baselines for Vision | ✓ Link | 85.6% | 57M | 37.7 | | | | ConvFormer-M36 (384 res) | 2022-10-24 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 85.6% | 64M | 11 | | | | EfficientViT-L2 (r288) | 2022-05-29 |
UniFormer: Unifying Convolution and Self-attention for Visual Recognition | ✓ Link | 85.6% | 100M | 12.6 | | | | UniFormer-L | 2022-01-24 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 85.6% | 957.5M | 113 | | | | FasterViT-5 | 2023-06-09 |
Three things everyone should know about Vision Transformers | ✓ Link | 85.5% | | | | | | ViT-L@384 (attn finetune) | 2022-03-18 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 85.5% | | | | | | Our SP-ViT-L | 2022-06-15 |
MiniViT: Compressing Vision Transformers with Weight Multiplexing | ✓ Link | 85.5% | 47M | 98.8 | | | | Mini-Swin-B@384 | 2022-04-14 |
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning | ✓ Link | 85.5% | 57.5M | 14.8 | | | | Wave-ViT-L | 2022-07-11 |
Circumventing Outliers of AutoAugment with Knowledge Distillation | ✓ Link | 85.5% | 66M | | | | | KDforAA (EfficientNet-B7) | 2020-03-25 |
Scaling Local Self-Attention for Parameter Efficient Visual Backbones | ✓ Link | 85.5% | 87M | | | | | HaloNet4 (base 128, Conv-12) | 2021-03-23 |
Adversarial Examples Improve Image Recognition | ✓ Link | 85.5% | 88M | | | | | AdvProp (EfficientNet-B8) | 2019-11-21 |
MetaFormer Baselines for Vision | ✓ Link | 85.5% | 99M | 23.2 | | | | CAFormer-B36 (224 res) | 2022-10-24 |
A ConvNet for the 2020s | ✓ Link | 85.5% | 198M | 101 | | | | ConvNeXt-L (384 res) | 2022-01-10 |
RandAugment: Practical automated data augmentation with a reduced search space | ✓ Link | 85.4% | | | | | | EfficientNet-B8 (RandAugment) | 2019-09-30 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | 85.4% | | | | | | BiFormer-B* (IN1k ptretrain) | 2023-03-15 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 85.4% | | | | | | GTP-EVA-L/P8 | 2023-11-06 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 85.4% | 25.2M | | | | | PatchConvNet-S60-21k-512 | 2021-12-27 |
MetaFormer Baselines for Vision | ✓ Link | 85.4% | 26M | 13.4 | | | | CAFormer-S18 (384 res, 21K) | 2022-10-24 |
MetaFormer Baselines for Vision | ✓ Link | 85.4% | 40M | 7.6 | | | | ConvFormer-S36 (224 res, 21K) | 2022-10-24 |
MetaFormer Baselines for Vision | ✓ Link | 85.4% | 40M | 22.4 | | | | ConvFormer-S36 (384 res) | 2022-10-24 |
Going deeper with Image Transformers | ✓ Link | 85.4% | 68.2M | 48 | | | | CAIT-S-36 | 2021-03-31 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 85.4% | 424.6M | 36.6 | | | | FasterViT-4 | 2023-06-09 |
Exploring the Limits of Weakly Supervised Pretraining | ✓ Link | 85.4% | 829M | 306 | | | | ResNeXt-101 32x48d | 2018-05-02 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 85.39% | 928M | | | | | BiT-M (ResNet) | 2019-12-24 |
MLP-Mixer: An all-MLP Architecture for Vision | ✓ Link | 85.3% | | | | | | ViT-L/16 Dosovitskiy et al. (2021) | 2021-05-04 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 85.3% | | | | | | Omnivore (Swin-B) | 2022-01-20 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 85.3% | 19M | | | | | NoisyStudent (EfficientNet-B4) | 2019-11-11 |
Going deeper with Image Transformers | ✓ Link | 85.3% | 89.5M | 63.8 | | | | CAIT-S-48 | 2021-03-31 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 85.3% | 304M | | | | | ViT-L @224 (DeiT-III + AugSub) | 2023-06-20 |
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network | ✓ Link | 85.28% | | 47.43 | | | | CLCNet (S:D1+D:D5) | 2022-05-19 |
Tiny Models are the Computational Saver for Large Models | ✓ Link | 85.24 | | | | | | TinySaver(Swin_large, 1.0 Acc drop) | 2024-03-26 |
DeiT III: Revenge of the ViT | ✓ Link | 85.2% | | | | | | ViT-H @224 (DeiT III) | 2022-04-14 |
HyenaPixel: Global Image Context with Convolutions | ✓ Link | 85.2% | | | | | | HyenaPixel-Bidirectional-Former-B36 | 2024-02-29 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 85.2% | 27M | | | | | VOLO-D1 | 2021-06-24 |
MetaFormer Baselines for Vision | ✓ Link | 85.2% | 56M | 13.2 | | | | CAFormer-M36 (224 res) | 2022-10-24 |
Adversarial Examples Improve Image Recognition | ✓ Link | 85.2% | 66M | | | | | AdvProp (EfficientNet-B7) | 2019-11-21 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | | 85.2% | 73.5M | 23.2 | | | | UniNet-B5 | 2021-10-08 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 85.2% | 87M | | | | | DeiT-B 384 | 2020-12-23 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 85.17% | 212M | 43.9 | | | | MaxViT-L (224res) | 2022-04-04 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 85.1% | | | | | | EfficientNetV2-M | 2021-04-01 |
Meta Knowledge Distillation | | 85.1% | | | | | | MKD ViT-B | 2022-02-16 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 85.1% | | | | | | SP-ViT-S|384 | 2022-06-15 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 85.1% | 26M | 55.6 | | | | XCiT-S12 | 2021-06-17 |
Going deeper with Image Transformers | ✓ Link | 85.1% | 46.9M | 32.2 | | | | CAIT-S-24 | 2021-03-31 |
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset | ✓ Link | 85.1% | 76M | | | | | ResNet200_vd_26w_4s_ssld | 2020-06-18 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 85.1% | 88M | 16.3 | | | | MixMIM-B | 2022-05-26 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 85.1% | 193.8M | 62.59 | | | | NFNet-F2 | 2021-02-11 |
Exploring the Limits of Weakly Supervised Pretraining | ✓ Link | 85.1% | 466M | 174 | | | | ResNeXt-101 32x32d | 2018-05-02 |
Discrete Representations Strengthen Vision Transformer Robustness | ✓ Link | 85.07% | | | | | | DiscreteViT | 2021-11-20 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 85.0% | | | | | | ViT-M@224 (cosub) | 2022-12-09 |
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders | ✓ Link | 85% | | | | | | ViC-MAE (ViT-L) | 2023-03-21 |
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space | ✓ Link | 85% | | | | | | HVT Large | 2024-09-25 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 85% | 12M | | | | | FixEfficientNet-B3 | 2020-03-18 |
MetaFormer Baselines for Vision | ✓ Link | 85.0% | 26M | 13.4 | | | | CAFormer-S18 (384 res) | 2022-10-24 |
MetaFormer Baselines for Vision | ✓ Link | 85.0% | 27M | 11.6 | | | | ConvFormer-S18 (384 res, 21K) | 2022-10-24 |
RandAugment: Practical automated data augmentation with a reduced search space | ✓ Link | 85% | 66M | | | | | EfficientNet-B7 (RandAugment) | 2019-09-30 |
DeiT III: Revenge of the ViT | ✓ Link | 85.0% | 87M | | | | | ViT-B @384 (DeiT III) | 2022-04-14 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 85% | 227.9M | 34.9 | | | | MambaVision-L | 2024-07-10 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 84.94% | 120M | 23.4 | | | | MaxViT-B (224res) | 2022-04-04 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 84.91% | | | | | | CaiT-S24 | 2023-08-18 |
DeiT III: Revenge of the ViT | ✓ Link | 84.9% | | | | | | ViT-L @224 (DeiT III) | 2022-04-14 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 84.9% | | | | | | Our SP-ViT-M | 2022-06-15 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 84.9% | | | | | | FastViT-MA36 | 2023-03-24 |
HyenaPixel: Global Image Context with Convolutions | ✓ Link | 84.9% | | | | | | HyenaPixel-Former-B36 | 2024-02-29 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 84.9% | 22M | 8.8 | | | | EfficientNetV2-S (21k) | 2021-04-01 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 84.9% | 32M | 25 | | | | CvT-21 (384 res, ImageNet-22k pretrain) | 2021-03-29 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 84.9% | 93M | 16.6 | | | | DAT-B++ (224x224) | 2023-09-04 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 84.9% | 97M | 16 | | | | InternImage-B | 2022-11-10 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 84.9% | 159.5M | 18.2 | | | | FasterViT-3 | 2023-06-09 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 84.8% | 21M | 4.3 | | | | TinyViT-21M-distill (21k) | 2022-07-21 |
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning | ✓ Link | 84.8% | 33.5M | 7.2 | | | | Wave-ViT-B | 2022-07-11 |
Going deeper with Image Transformers | ✓ Link | 84.8% | 38.6M | 28.8 | | | | CAIT-XS-36 | 2021-03-31 |
Sliced Recursive Transformer | ✓ Link | 84.8% | 71.2M | | | | | SReT-B (384 res, ImageNet-1K only) | 2021-11-09 |
Multiscale Vision Transformers | ✓ Link | 84.8% | 72.9M | 32.7 | | | | MViT-B-24 | 2021-04-22 |
Active Token Mixer | ✓ Link | 84.8% | 76.4M | 36.4 | | | | ActiveMLP-L | 2022-03-11 |
Vision Transformer with Deformable Attention | ✓ Link | 84.8% | 88M | 49.8 | | | | DAT-B (384 res, IN-1K only) | 2022-01-03 |
Masked Image Residual Learning for Scaling Deeper Vision Transformers | ✓ Link | 84.8% | 96M | 18.8 | | | | MIRL(ViT-S-54) | 2023-09-25 |
MetaFormer Baselines for Vision | ✓ Link | 84.8% | 100M | 22.6 | | | | ConvFormer-B36 (224 res) | 2022-10-24 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 84.8% | 186M | 34.7 | | | | RDNet-L | 2024-03-28 |
Billion-scale semi-supervised learning for image classification | ✓ Link | 84.8% | 193M | | | | | ResNeXt-101 32x16d (semi-weakly sup.) | 2019-05-02 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 84.7% | 27M | 8 | | | | ELSA-VOLO-D1 | 2021-12-23 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 84.7% | 49.7M | 10.3 | | | | TransNeXt-Small (IN-1K supervised, 224) | 2023-11-28 |
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios | ✓ Link | 84.7% | 57.8M | 32 | | | | Next-ViT-L @384 | 2022-07-12 |
Vicinity Vision Transformer | ✓ Link | 84.7% | 61.8M | 31.8 | | | | VVT-L (384 res) | 2022-06-21 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 84.7% | 75.1M | | | | | BoTNet T7 | 2021-01-27 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 84.7% | 83M | 15.9 | | | | MogaNet-L | 2022-11-07 |
Fast Vision Transformers with HiLo Attention | ✓ Link | 84.7% | 87M | 39.7 | | | | LITv2-B|384 | 2022-05-26 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 84.7% | 132.6M | 35.54 | | | | NFNet-F1 | 2021-02-11 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 84.6% | 53M | 9.4 | | | | DAT-S++ | 2023-09-04 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 84.6% | 54M | 50.7 | | | | Sequencer2D-L↑392 | 2022-05-04 |
Contextual Transformer Networks for Visual Recognition | ✓ Link | 84.6% | 55.8M | 26.5 | | | | SE-CoTNetD-152 | 2021-07-26 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 84.6% | 87M | | | | | AMD(ViT-B/16) | 2023-11-06 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 84.6% | 87.9M | 15.5 | | | | DaViT-B | 2022-04-07 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 84.5% | | | | | | FastViT-SA36 | 2023-03-24 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 84.5% | 34.8M | | | | | ReXNet-R_3.0 | 2020-07-02 |
MetaFormer Baselines for Vision | ✓ Link | 84.5% | 39M | 8.0 | | | | CAFormer-S36 (224 res) | 2022-10-24 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 84.5% | 53M | 5.3 | | | | EfficientViT-L1 (r224) | 2022-05-29 |
MetaFormer Baselines for Vision | ✓ Link | 84.5% | 57M | 12.8 | | | | ConvFormer-M36 (224 res) | 2022-10-24 |
Global Context Vision Transformers | ✓ Link | 84.5% | 90M | 14.8 | | | | GC ViT-B | 2022-06-20 |
ResNeSt: Split-Attention Networks | ✓ Link | 84.5% | 111M | | | | | ResNeSt-269 | 2020-04-19 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 84.5% | 168M | 34.7 | | | | CoAtNet-3 | 2021-06-09 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 84.45% | 69M | 11.7 | | | | MaxViT-S (224res) | 2022-04-04 |
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | ✓ Link | 84.4% | | | | | | GPIPE | 2018-11-16 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | 84.4% | | | | | | DeBiFormer-B | 2024-10-11 |
MetaFormer Baselines for Vision | ✓ Link | 84.4% | 27M | 11.6 | | | | ConvFormer-S18 (384 res) | 2022-10-24 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 84.4% | 66M | 37 | | | | EfficientNet-B7 | 2019-05-28 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 84.4% | 87M | 15.4 | | | | RDNet-B | 2024-03-28 |
Dilated Neighborhood Attention Transformer | ✓ Link | 84.4% | 90M | 13.7 | | | | DiNAT-Base | 2022-09-29 |
Revisiting ResNets: Improved Training and Scaling Strategies | ✓ Link | 84.4% | 192M | 4.6 | | | | ResNet-RS-50 (160 image res) | 2021-03-13 |
ColorNet: Investigating the importance of color spaces for image classification | ✓ Link | 84.32% | | | | | | ColorNet (RHYLH with Conv Layer) | 2019-02-01 |
Three things everyone should know about Vision Transformers | ✓ Link | 84.3% | | | | | | ViT-B@384 (attn finetune) | 2022-03-18 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | 84.3% | | | | | | BiFormer-S* (IN1k ptretrain) | 2023-03-15 |
Sliced Recursive Transformer | ✓ Link | 84.3% | 21.3M | 42.8 | | | | SReT-S (512 res, ImageNet-1K only) | 2021-11-09 |
LambdaNetworks: Modeling Long-Range Interactions Without Attention | ✓ Link | 84.3% | 42M | | | | | LambdaResNet200 | 2021-02-17 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 84.3% | 44M | 9.9 | | | | MogaNet-B | 2022-11-07 |
TResNet: High Performance GPU-Dedicated Architecture | ✓ Link | 84.3% | 77M | | | | | TResNet-XL | 2020-03-30 |
Billion-scale semi-supervised learning for image classification | ✓ Link | 84.3% | 88M | | | | | ResNeXt-101 32x8d (semi-weakly sup.) | 2019-05-02 |
Neighborhood Attention Transformer | ✓ Link | 84.3% | 90M | 13.7 | | | | NAT-Base | 2022-04-14 |
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network | ✓ Link | 84.2% | | 15.8 | | | | Assemble-ResNet152 | 2020-01-17 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 84.2% | | | | | | BoTNet T7-320 | 2021-01-27 |
Visual Parser: Representing Part-whole Hierarchies with Transformers | ✓ Link | 84.2% | | | | | | ViP-B|384 | 2021-07-13 |
A Study on Transformer Configuration and Training Objective | | 84.2 | | | | | | Bamboo (Bamboo-B) | 2022-05-21 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 84.2% | | | | | | RegnetY16GF@224 (cosub) | 2022-12-09 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 84.2% | 49M | 6.5 | | | | EfficientViT-B3 (r288) | 2022-05-29 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 84.2% | 50M | 8 | | | | InternImage-S | 2022-11-10 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | | 84.2% | 73.5M | 9.9 | | | | UniNet-B4 | 2021-10-08 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 84.2% | 75.9M | 8.7 | | | | FasterViT-2 | 2023-06-09 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 84.2% | 86M | | | | | DeiT-B | 2020-12-23 |
Masking meets Supervision: A Strong Learning Alliance | ✓ Link | 84.2% | 86.6M | | | | | ViT-B @224 (DeiT-III + AugSub) | 2023-06-20 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 84.2% | 97.7M | 15 | | | | MambaVision-B | 2024-07-10 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 84.2% | 142.3M | 38.1 | | | | RevBiFPN-S6 | 2022-06-28 |
Exploring the Limits of Weakly Supervised Pretraining | ✓ Link | 84.2% | 194M | 72 | | | | ResNeXt-101 32×16d | 2018-05-02 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 84.1% | | 2.1 | | | | FBNetV5-F-CLS | 2021-11-19 |
Three things everyone should know about Vision Transformers | ✓ Link | 84.1% | | | | | | ViT-B-36x1 | 2022-03-18 |
Three things everyone should know about Vision Transformers | ✓ Link | 84.1% | | | | | | ViT-B-18x2 | 2022-03-18 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 84.1% | | | | | | XCiT-M (+MixPro) | 2023-04-24 |
Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces | ✓ Link | 84.1% | | | | | | DGMMC-S | 2024-10-17 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 84.1% | 12M | | | | | NoisyStudent (EfficientNet-B3) | 2019-11-11 |
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications | ✓ Link | 84.1% | 21.76M | 3.597 | | | | CAS-ViT-T | 2024-08-07 |
MetaFormer Baselines for Vision | ✓ Link | 84.1% | 26M | 4.1 | | | | CAFormer-S18 (224 res, 21K) | 2022-10-24 |
Going deeper with Image Transformers | ✓ Link | 84.1% | 26.6M | 19.3 | | | | CAIT-XS-24 | 2021-03-31 |
MetaFormer Baselines for Vision | ✓ Link | 84.1% | 40M | 7.6 | | | | ConvFormer-S36 (224 res) | 2022-10-24 |
All Tokens Matter: Token Labeling for Training Better Vision Transformers | ✓ Link | 84.1% | 56M | 16 | | | | LV-ViT-M | 2021-04-22 |
Vicinity Vision Transformer | ✓ Link | 84.1% | 61.8M | 10.8 | | | | VVT-L (224 res) | 2022-06-21 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 84.1% | 75M | 15.7 | | | | CoAtNet-2 | 2021-06-09 |
Conformer: Local Features Coupling Global Representations for Visual Recognition | ✓ Link | 84.1% | 83.3M | 46.6 | | | | Conformer-B | 2021-05-09 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 84.1% | 188.6M | | | | | PatchConvNet-B120 | 2021-12-27 |
Generalized Parametric Contrastive Learning | ✓ Link | 84.0% | | | | | | GPaCo (Vit-B) | 2022-09-26 |
Scalable Pre-training of Large Autoregressive Image Models | ✓ Link | 84.0 | | | | | | AIM-7B | 2024-01-16 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 84.0% | 19M | | | | | FixEfficientNetB4 | 2020-03-18 |
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset | ✓ Link | 84.0% | 25.58M | | | | | Fix_ResNet50_vd_ssld | 2020-06-18 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 84.0% | 28.2M | 5.7 | | | | TransNeXt-Tiny (IN-1K supervised, 224) | 2023-11-28 |
LambdaNetworks: Modeling Long-Range Interactions Without Attention | ✓ Link | 84.0% | 35M | | | | | LambdaResNet152 | 2021-02-17 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 84% | 43M | 19 | | | | EfficientNet-B6 | 2019-05-28 |
Global Context Vision Transformers | ✓ Link | 84.0% | 51M | 8.5 | | | | GC ViT-S | 2022-06-20 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 84% | 53.9M | | | | | BoTNet T6 | 2021-01-27 |
Rethinking Spatial Dimensions of Vision Transformers | ✓ Link | 84% | 73.8M | 12.5 | | | | PiT-B | 2021-03-30 |
DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network | ✓ Link | 84% | 89M | 15.4 | | | | DeepMAD-89M | 2023-03-05 |
EfficientNetV2: Smaller Models and Faster Training | ✓ Link | 83.9% | | | | | | EfficientNetV2-S | 2021-04-01 |
SP-ViT: Learning 2D Spatial Priors for Vision Transformers | ✓ Link | 83.9% | | | | | | Our SP-ViT-S | 2022-06-15 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 83.9% | | | | | | UniRepLKNet-S | 2023-11-27 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | 83.9% | | | | | | DeBiFormer-S | 2024-10-11 |
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning | ✓ Link | 83.9% | 22.7M | 4.7 | | | | Wave-ViT-S | 2022-07-11 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 83.9% | 24M | 4.3 | | | | DAT-T++ | 2023-09-04 |
Adaptive Split-Fusion Transformer | ✓ Link | 83.9% | 56.7M | | | | | ASF-former-B | 2022-04-26 |
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification | ✓ Link | 83.9 | 57.1M | | | | | DynamicViT-LV-M/0.8 | 2021-06-03 |
Transformer in Transformer | ✓ Link | 83.9% | 65.6M | | | | | TNT-B | 2021-02-27 |
ResNeSt: Split-Attention Networks | ✓ Link | 83.9% | 70M | | | | | ResNeSt-200 | 2020-04-19 |
Regularized Evolution for Image Classifier Architecture Search | ✓ Link | 83.9% | 469M | 208 | | | | AmoebaNet-A | 2018-02-05 |
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network | ✓ Link | 83.88% | | 18.58 | | | | CLCNet (S:B4+D:B7) | 2022-05-19 |
Revisiting ResNets: Improved Training and Scaling Strategies | ✓ Link | 83.8% | | 54 | | | | ResNet-RS-270 (256 image res) | 2021-03-13 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 83.8% | | | | | | SENet-350 | 2021-01-27 |
DeiT III: Revenge of the ViT | ✓ Link | 83.8% | | | | | | ViT-B @224 (DeiT III) | 2022-04-14 |
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders | ✓ Link | 83.8% | | | | | | ColorMAE-Green-ViTB-1600 | 2024-07-17 |
Sliced Recursive Transformer | ✓ Link | 83.8% | 21M | 18.5 | | | | SReT-S (384 res, ImageNet-1K only) | 2021-11-09 |
Dilated Neighborhood Attention Transformer | ✓ Link | 83.8% | 51M | 7.8 | | | | DiNAT-Small | 2022-09-29 |
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding | ✓ Link | 83.8% | 68M | 17.9 | | | | Transformer local-attention (NesT-B) | 2021-05-26 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 83.8% | 82M | 11.8 | | | | PVTv2-B4 | 2021-06-25 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 83.7% | | | | | | CA-Swin-S (+MixPro) | 2023-04-24 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 83.7% | | | | | | GTP-ViT-L/P8 | 2023-11-06 |
MetaFormer Baselines for Vision | ✓ Link | 83.7% | 27M | 3.9 | | | | ConvFormer-S18 (224 res, 21K) | 2022-10-24 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 83.7% | 50M | 8.7 | | | | RDNet-S | 2024-03-28 |
Vision Transformer with Deformable Attention | ✓ Link | 83.7% | 50M | 9.0 | | | | DAT-S | 2022-01-03 |
Neighborhood Attention Transformer | ✓ Link | 83.7% | 51M | 7.8 | | | | NAT-Small | 2022-04-14 |
Learned Queries for Efficient Local Attention | ✓ Link | 83.7% | 56M | 9.7 | | | | QnA-ViT-Base | 2021-12-21 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 83.7% | 82M | 21.8 | | | | RevBiFPN-S5 | 2022-06-28 |
Vision GNN: An Image is Worth Graph of Nodes | ✓ Link | 83.7% | 92.6M | 16.8 | | | | Pyramid ViG-B | 2022-06-01 |
Twins: Revisiting the Design of Spatial Attention in Vision Transformers | ✓ Link | 83.7% | 99.2M | 15.1 | | | | Twins-SVT-L | 2021-04-28 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 83.67% | 22.05M | | | | | TransBoost-ViT-S | 2022-05-26 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 83.65% | | | | | | XCiT-S | 2023-08-18 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 83.62% | 31M | 5.6 | | | | MaxViT-T (224res) | 2022-04-04 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 83.61% | | | | | | Wave-ViT-S | 2023-08-18 |
Fast Vision Transformers with HiLo Attention | ✓ Link | 83.6% | | 13.2 | | | | LITv2-B | 2022-05-26 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 83.6% | | | | | | MultiGrain PNASNet (500px) | 2019-02-14 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 83.6% | | | | | | MAE (ViT-L) | 2021-11-11 |
Pattern Attention Transformer with Doughnut Kernel | | 83.6% | | | | | | PAT-B | 2022-11-30 |
HyenaPixel: Global Image Context with Convolutions | ✓ Link | 83.6% | | | | | | HyenaPixel-Attention-Former-S18 | 2024-02-29 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 83.6% | 9.2M | | | | | FixEfficientNet-B2 | 2020-03-18 |
MetaFormer Baselines for Vision | ✓ Link | 83.6% | 26M | 4.1 | | | | CAFormer-S18 (224 res) | 2022-10-24 |
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation | ✓ Link | 83.6% | 39.3M | 7.8 | | | | IPT-B | 2022-12-06 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 83.6% | 48.5M | 27.6 | | | | ViTAE-B-Stage | 2021-06-07 |
ResT: An Efficient Transformer for Visual Recognition | ✓ Link | 83.6% | 51.63M | 7.9 | | | | ResT-Large | 2021-05-28 |
High-Performance Large-Scale Image Recognition Without Normalization | ✓ Link | 83.6% | 71.5M | 12.38 | | | | NFNet-F0 | 2021-02-11 |
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training | ✓ Link | 83.6% | 98M | 38.2 | | | | SE-ResNeXt-101, 64x4d, S=2(320px) | 2020-11-30 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 83.6% | 116M | | | | | ResMLP-B24/8 | 2021-05-07 |
Tiny Models are the Computational Saver for Large Models | ✓ Link | 83.52 | | | | | | TinySaver(EfficientFormerV2_l, 0.01 Acc drop) | 2024-03-26 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 83.5% | | 4 | | | | EfficientViT-B3 (r224) | 2022-05-29 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 83.5% | | 19.3 | | | | BoTNet T5 | 2021-01-27 |
HyenaPixel: Global Image Context with Convolutions | ✓ Link | 83.5% | | | | | | HyenaPixel-Bidirectional-Former-S18 | 2024-02-29 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 83.5% | 99.4M | | | | | PatchConvNet-B60 | 2021-12-27 |
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion | ✓ Link | 83.46% | | | | | | SigLIP B/16 + PrefixedIter Decoder | 2024-07-15 |
Three things everyone should know about Vision Transformers | ✓ Link | 83.4% | | | | | | ViT-B (hMLP + BeiT) | 2022-03-18 |
MobileNetV4 -- Universal Models for the Mobile Ecosystem | ✓ Link | 83.4% | | | | | | MNv4-Hybrid-L | 2024-04-16 |
UniFormer: Unifying Convolution and Self-attention for Visual Recognition | ✓ Link | 83.4% | 22M | 3.6 | | | | UniFormer-S | 2022-01-24 |
DeiT III: Revenge of the ViT | ✓ Link | 83.4% | 22M | 15.5 | | | | ViT-S @384 (DeiT III) | 2022-04-14 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 83.4% | 25M | 5 | | | | MogaNet-S | 2022-11-07 |
Global Context Vision Transformers | ✓ Link | 83.4% | 28M | 4.7 | | | | GC ViT-T | 2022-06-20 |
Billion-scale semi-supervised learning for image classification | ✓ Link | 83.4% | 42M | | | | | ResNeXt-101 32x4d (semi-weakly sup.) | 2019-05-02 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 83.4% | 54M | 16.6 | | | | Sequencer2D-L | 2022-05-04 |
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? | ✓ Link | 83.4% | 65.9M | | | | | sMLPNet-B (ImageNet-1k) | 2021-09-12 |
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training | ✓ Link | 83.34% | 98M | 61.1 | | | | SE-ResNeXt-101, 64x4d, S=2(416px) | 2020-11-30 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 83.3% | | 24.9 | | | | CvT-21 (384 res) | 2021-03-29 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 83.3% | | 34.2 | | | | T2T-ViT-14|384 | 2021-01-28 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 83.3% | 24.2M | 12.9 | | | | CeiT-S (384 finetune res) | 2021-03-22 |
All Tokens Matter: Token Labeling for Training Better Vision Transformers | ✓ Link | 83.3% | 26M | 6.6 | | | | LV-ViT-S | 2021-04-22 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 83.3% | 27.8M | 5.7 | | | | MOAT-0 1K only | 2022-10-04 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 83.3% | 30M | 9.9 | | | | EfficientNet-B5 | 2019-05-28 |
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding | ✓ Link | 83.3% | 38M | 10.4 | | | | Transformer local-attention (NesT-S) | 2021-05-26 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 83.3% | 39.7M | 8.7 | | | | ViL-Medium-D | 2021-03-29 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 83.3% | 42M | 8.4 | | | | CoAtNet-1 | 2021-06-09 |
Fast Vision Transformers with HiLo Attention | ✓ Link | 83.3% | 49M | 7.5 | | | | LITv2-M | 2022-05-26 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 83.3% | 50.1M | 7.5 | | | | MambaVision-S | 2024-07-10 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 83.3% | 88M | 15.2 | | | | Shift-B | 2022-01-26 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 83.2% | | | | | | MultiGrain PNASNet (450px) | 2019-02-14 |
Meta Pseudo Labels | ✓ Link | 83.2% | | | | | | Meta Pseudo Labels (ResNet-50) | 2020-03-23 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 83.2% | | | | | | UniRepLKNet-T | 2023-11-27 |
HyenaPixel: Global Image Context with Convolutions | ✓ Link | 83.2% | | | | | | HyenaPixel-Former-S18 | 2024-02-29 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 83.2% | 11M | 2.0 | | | | TinyViT-11M-distill (21k) | 2022-07-21 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 83.2% | 16.5M | | | | | ReXNet-R_2.0 | 2020-07-02 |
Learned Queries for Efficient Local Attention | ✓ Link | 83.2% | 25M | 4.4 | | | | QnA-ViT-Small | 2021-12-21 |
Neighborhood Attention Transformer | ✓ Link | 83.2% | 28M | 4.3 | | | | NAT-Tiny | 2022-04-14 |
Contextual Transformer Networks for Visual Recognition | ✓ Link | 83.2% | 40.9M | 8.5 | | | | SE-CoTNetD-101 | 2021-07-26 |
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios | ✓ Link | 83.2% | 44.8M | 8.3 | | | | Next-ViT-B | 2022-07-12 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 83.2% | 45.2M | 6.9 | | | | PVTv2-B3 | 2021-06-25 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 83.2% | 47.7M | | | | | PatchConvNet-S120 | 2021-12-27 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 83.2% | 53.4M | 5.3 | | | | FasterViT-1 | 2023-06-09 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 83.2% | 55.7M | 13.4 | | | | ViL-Base-D | 2021-03-29 |
CycleMLP: A MLP-like Architecture for Dense Prediction | ✓ Link | 83.2% | 76M | 12.3 | | | | CycleMLP-B5 | 2021-07-21 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 83.1% | | | | | | MultiGrain SENet154 (450px) | 2019-02-14 |
DeepViT: Towards Deeper Vision Transformer | ✓ Link | 83.1% | | | | | | DeepVit-L* (DeiT training recipe) | 2021-03-22 |
DeiT III: Revenge of the ViT | ✓ Link | 83.1% | | | | | | ViT-S @224 (DeiT III, 21k) | 2022-04-14 |
Meta Knowledge Distillation | | 83.1% | | | | | | MKD ViT-S | 2022-02-16 |
Co-training $2^L$ Submodels for Visual Recognition | ✓ Link | 83.1% | | | | | | ViT-S@224 (cosub) | 2022-12-09 |
Pattern Attention Transformer with Doughnut Kernel | | 83.1% | | | | | | PAT-S | 2022-11-30 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 83.1% | 21M | 4.3 | | | | TinyViT-21M | 2022-07-21 |
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? | ✓ Link | 83.1% | 48.6M | | | | | sMLPNet-S (ImageNet-1k) | 2021-09-12 |
Vision GNN: An Image is Worth Graph of Nodes | ✓ Link | 83.1% | 51.7M | 8.9 | | | | Pyramid ViG-M | 2022-06-01 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 83.09% | | | | | | SwinV2-Ti | 2023-08-18 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 83.01% | 39.8M | 7.0 | | | | gSwin-S | 2022-08-24 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 83.0% | | | | | | MultiGrain SENet154 (400px) | 2019-02-14 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 83% | | | | | | Swin-S + GFSA | 2023-12-07 |
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications | ✓ Link | 83.0% | 12.42M | 1.887 | | | | CAS-ViT-M | 2024-08-07 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 83% | 20M | 16.3 | | | | CvT-13 (384 res) | 2021-03-29 |
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset | ✓ Link | 83.0% | 25.58M | | | | | ResNet50_vd_ssld | 2020-06-18 |
MetaFormer Baselines for Vision | ✓ Link | 83.0% | 27M | 3.9 | | | | ConvFormer-S18 (224 res) | 2022-10-24 |
Multiscale Vision Transformers | ✓ Link | 83.0% | 37M | 7.8 | | | | MViT-B-16 | 2021-04-22 |
ResNeSt: Split-Attention Networks | ✓ Link | 83.0% | 48M | | | | | ResNeSt-101 | 2020-04-19 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 83% | 48.7M | 10.6 | | | | RevBiFPN-S4 | 2022-06-28 |
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition | ✓ Link | 83.0% | 183M | 13.9 | | | | ZenNAS (0.8ms) | 2021-02-01 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 82.9% | | 1.881 | | | | NASViT (supernet) | 2021-09-29 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 82.9% | | | | | | DeiT-B (+MixPro) | 2023-04-24 |
MobileNetV4 -- Universal Models for the Mobile Ecosystem | ✓ Link | 82.9% | | | | | | MNv4-Conv-L | 2024-04-16 |
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation | ✓ Link | 82.9% | 24.3M | 4.7 | | | | IPT-S | 2022-12-06 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 82.9% | 39.8M | | | | | ViL-Medium-W | 2021-03-29 |
Global Filter Networks for Image Classification | ✓ Link | 82.9% | 54M | 8.6 | | | | GFNet-H-B | 2021-07-01 |
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution | ✓ Link | 82.9% | 66.8M | 22.2 | 20771G | | 2.22G | Oct-ResNet-152 (SE) | 2019-04-10 |
Progressive Neural Architecture Search | ✓ Link | 82.9% | 86.1M | 50 | | 96.2 | 2.5G | PNASNet-5 | 2017-12-02 |
Harmonic Convolutional Networks based on Discrete Cosine Transform | ✓ Link | 82.85% | 88.2M | 31.4 | | | | Harm-SE-RNX-101 64x4d (320x320, Mean-Max Pooling) | 2020-01-18 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 82.8% | | 8 | | | | GTP-LV-ViT-M/P8 | 2023-11-06 |
Knowledge distillation: A good teacher is patient and consistent | ✓ Link | 82.8% | | | | | | FunMatch - T384+224 (ResNet-50) | 2021-06-09 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 82.8% | | | | | | CA-Swin-T (+MixPro) | 2023-04-24 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 82.8% | | | | | | CaiT-S + GFSA | 2023-12-07 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 82.8% | 24M | 5.0 | | | | RDNet-T | 2024-03-28 |
Visual Attention Network | ✓ Link | 82.8% | 26.6M | 5 | | | | VAN-B2 | 2022-02-20 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 82.8% | 28.3M | | | | | DaViT-T | 2022-04-07 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 82.8% | 34.7M | 3.4 | | | | ReXNet_3.0 | 2020-07-02 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 82.8% | 38M | 11.1 | | | | Sequencer2D-M | 2022-05-04 |
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification | ✓ Link | 82.8% | 44.3M | 9.5 | | | | CrossViT-18+ | 2021-03-27 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 82.8% | 50M | 8.5 | | | | Shift-S | 2022-01-26 |
HRFormer: High-Resolution Transformer for Dense Prediction | ✓ Link | 82.8% | 50.3M | 13.7 | | | | HRFormer-B | 2021-10-18 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 82.8% | 54.7M | 10.9 | | | | BoTNet T4 | 2021-01-27 |
Kolmogorov-Arnold Transformer | ✓ Link | 82.8 | 86.6M | 17.06 | | | | KAT-B* | 2024-09-16 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 82.7% | | | | | | MultiGrain SENet154 (500px) | 2019-02-14 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 82.7% | | | | | | PVT-M (+MixPro) | 2023-04-24 |
Adaptive Split-Fusion Transformer | ✓ Link | 82.7% | 19.3M | | | | | ASF-former-S | 2022-04-26 |
Container: Context Aggregation Network | ✓ Link | 82.7% | 22.1M | 8.1 | | | | Container Container | 2021-06-02 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | | 82.7% | 22.5M | 2.4 | | | | UniNet-B2 | 2021-10-08 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 82.7% | 24M | 2.1 | | | | EfficientViT-B2 (r256) | 2022-05-29 |
Dilated Neighborhood Attention Transformer | ✓ Link | 82.7% | 28M | 4.3 | | | | DiNAT-Tiny | 2022-09-29 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 82.7% | 28M | 4.8 | | | | ELSA-Swin-T | 2021-12-23 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 82.7% | 35.1M | 5.1 | | | | MambaVision-T2 | 2024-07-10 |
Learning Transferable Architectures for Scalable Image Recognition | ✓ Link | 82.7% | 88.9M | 23.8 | 1648G | | 2.38G | NASNET-A(6) | 2017-07-21 |
Towards Robust Vision Transformer | ✓ Link | 82.7% | 91.8M | 17.7 | | | | RVT-B* | 2021-05-17 |
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations | ✓ Link | 82.64% | | | | | | CMA(ViT-B/16) | 2025-03-24 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 82.6% | | 1 | | | | FBNetV5-C-CLS | 2021-11-19 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 82.6% | | | | | | MultiGrain PNASNet (400px) | 2019-02-14 |
Three things everyone should know about Vision Transformers | ✓ Link | 82.6% | | | | | | ViT-S-24x2 | 2022-03-18 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 82.6% | | | | | | FastViT-SA24 | 2023-03-24 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 82.6% | 7.8M | | | | | FixEfficientNet-B1 | 2020-03-18 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 82.6% | 19M | 4.2 | | | | EfficientNet-B4 | 2019-05-28 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 82.6% | 22M | | | | | DeiT-B | 2020-12-23 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 82.6% | 64.4M | 30 | | | | T2T-ViTt-24 | 2021-01-28 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 82.54% | | | | | | ViT-S | 2023-08-18 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 82.5% | | 7.1 | | | | CvT-21 | 2021-03-29 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 82.5% | 12.8M | 2.7 | | | | TransNeXt-Micro (IN-1K supervised, 224) | 2023-11-28 |
Fixing the train-test resolution discrepancy | ✓ Link | 82.5% | 25.6M | | | | | FixResNet-50 Billion-scale@224 | 2019-06-14 |
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios | ✓ Link | 82.5% | 31.7M | 5.8 | | | | Next-ViT-S | 2022-07-12 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 82.5% | 39.4M | 2.334 | | | | LeViT-384 | 2021-04-02 |
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification | ✓ Link | 82.5% | 43.3M | 9 | | | | CrossViT-18 | 2021-03-27 |
MetaFormer Is Actually What You Need for Vision | ✓ Link | 82.5% | 73M | 23.2 | | | | MetaFormer PoolFormer-M48 | 2021-11-22 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 82.5% | 152M | 30 | | | | ConViT-B+ | 2021-03-19 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 82.46% | 28.59M | | | | | TransBoost-ConvNext-T | 2022-05-26 |
ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections | ✓ Link | 82.4 | | | | | | ReViT-B | 2024-02-17 |
Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks | ✓ Link | 82.4% | | | | | | M2D-T | 2024-12-20 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 82.4% | 9.2M | | | | | NoisyStudent (EfficientNet-B2) | 2019-11-11 |
AutoFormer: Searching Transformers for Visual Recognition | ✓ Link | 82.4% | 54M | 11 | | | | AutoFormer-base | 2021-07-01 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 82.4% | 60.2M | | | | | ResNet-152 (A2 + reg) | 2021-10-01 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 82.4% | 86M | 17 | | | | ConViT-B | 2021-03-19 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 82.4% | 87M | 35.368 | | | | DeiT-B with iRPE-K | 2021-07-29 |
Mega: Moving Average Equipped Gated Attention | ✓ Link | 82.4% | 90M | | | | | Mega | 2022-09-21 |
Spatial-Channel Token Distillation for Vision MLPs | ✓ Link | 82.4% | 122.6M | 24.1 | | | | ResMLP-B24 + STD | 2022-07-23 |
TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers | ✓ Link | 82.37% | | | | | | ViT-B/16-224+HTM | 2022-10-14 |
ColorNet: Investigating the importance of color spaces for image classification | ✓ Link | 82.35% | | | | | | ColorNet | 2019-02-01 |
Polynomial, trigonometric, and tropical activations | ✓ Link | 82.34 | 28M | | | 96.03 | | ConvNeXt-T-Hermite | 2025-02-03 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 82.3% | | 27.6 | | | | T2T-ViT-24 | 2021-01-28 |
Three things everyone should know about Vision Transformers | ✓ Link | 82.3% | | | | | | ViT-S-48x1 | 2022-03-18 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 82.3% | 24M | 4.7 | | | | MViTv2-T | 2021-12-02 |
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search | ✓ Link | 82.3% | 27.8M | 8.4 | 12G | | 0.42G | SCARLET-A4 | 2019-08-16 |
Sequencer: Deep LSTM for Image Classification | ✓ Link | 82.3% | 28M | 8.4 | | | | Sequencer2D-S | 2022-05-04 |
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification | ✓ Link | 82.3% | 28.2M | 6.1 | | | | CrossViT-15+ | 2021-03-27 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | 82.3% | 31.8M | 4.4 | | | | MambaVision-T | 2024-07-10 |
GLiT: Neural Architecture Search for Global and Local Image Transformer | ✓ Link | 82.3% | 96.1M | 17 | | | | GLiT-Bases | 2021-07-07 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 82.29% | | | | | | EViT (delete) | 2023-08-18 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 82.22% | | | | | | STViT-Swin-Ti | 2023-08-18 |
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search | ✓ Link | 82.2% | | 15.8 | | | | BossNet-T1 | 2021-03-23 |
Going deeper with Image Transformers | ✓ Link | 82.2% | 17.3M | 14.3 | | | | CAIT-XXS-36 | 2021-03-31 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 82.2% | 18M | 4.1 | | | | CvT-13-NAS | 2021-03-29 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 82.2% | 19.2M | 12.0 | | | | ViTAE-S-Stage | 2021-06-07 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 82.2% | 39.2M | 19.6 | | | | T2T-ViTt-19 | 2021-01-28 |
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer | ✓ Link | 82.2% | 39.6M | | | | | Evo-LeViT-384* | 2021-08-03 |
Visformer: The Vision-friendly Transformer | ✓ Link | 82.2% | 40.2M | 4.9 | | | | Visformer-S | 2021-04-26 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 82.2% | 48M | 10 | | | | ConViT-S+ | 2021-03-19 |
Patches Are All You Need? | ✓ Link | 82.20 | 51.6M | | | | | ConvMixer-1536/20 | 2022-01-24 |
DeepViT: Towards Deeper Vision Transformer | ✓ Link | 82.2% | 55M | | | | | DeepVit-L | 2021-03-22 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 82.2% | 66.6M | | | | | SENet-152 | 2021-01-27 |
Exploring the Limits of Weakly Supervised Pretraining | ✓ Link | 82.2% | 88M | | | | | ResNeXt-101 32x8d | 2018-05-02 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 82.16% | 71.71M | | | | | TransBoost-Swin-T | 2022-05-26 |
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training | ✓ Link | 82.13% | 88.6M | 18.8 | | | | ResNeXt-101, 64x4d, S=2(224px) | 2020-11-30 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 82.11% | | | | | | ToMe-ViT-S | 2023-08-18 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 82.1% | 22M | | | | | AMD(ViT-S/16) | 2023-11-06 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 82.1% | 25.2M | | | | | PatchConvNet-S60 | 2021-12-27 |
Vision GNN: An Image is Worth Graph of Nodes | ✓ Link | 82.1% | 27.3M | 4.6 | | | | Pyramid ViG-S | 2022-06-01 |
A ConvNet for the 2020s | ✓ Link | 82.1% | 29M | 4.5 | | | | ConvNeXt-T | 2022-01-10 |
Spatial-Channel Token Distillation for Vision MLPs | ✓ Link | 82.1% | 30.1M | 4.0 | | | | CycleMLP-B2 + STD | 2022-07-23 |
FasterViT: Fast Vision Transformers with Hierarchical Attention | ✓ Link | 82.1% | 31.4M | 3.3 | | | | FasterViT-0 | 2023-06-09 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 82% | | 4.5 | | | | CeiT-S | 2021-03-22 |
Differentiable Model Compression via Pseudo Quantization Noise | ✓ Link | 82.0 | | | | | | DIFFQ (λ=1e−2) | 2021-04-20 |
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search | ✓ Link | 82% | | | | | | NEXcepTion-S | 2022-12-16 |
Global Context Vision Transformers | ✓ Link | 82.0% | 20M | 2.6 | | | | GC ViT-XT | 2022-06-20 |
Container: Context Aggregation Network | ✓ Link | 82% | 20M | 3.2 | | | | Container-Light | 2021-06-02 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 82% | 24.6M | 4.86 | | | | ViL-Small | 2021-03-29 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 82% | 25.4M | 4 | | | | PVTv2-B2 | 2021-06-25 |
Active Token Mixer | ✓ Link | 82% | 27.2M | 4 | | | | ActiveMLP-T | 2022-03-11 |
Fast Vision Transformers with HiLo Attention | ✓ Link | 82% | 28M | 3.7 | | | | LITv2-S | 2022-05-26 |
Vision Transformer with Deformable Attention | ✓ Link | 82.0% | 29M | 4.6 | | | | DAT-T | 2022-01-03 |
[]() | | 81.97% | | | | | | Swin-T (SAMix+DM) | |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 81.96% | | | | | | EViT (fuse) | 2023-08-18 |
[]() | | 81.92% | | | | | | Swin-T (AutoMix+DM) | |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 81.9% | | 4.8 | | | | GTP-LV-ViT-S/P8 | 2023-11-06 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 81.9% | | 17.0 | | | | T2T-ViT-19 | 2021-01-28 |
A Fast Knowledge Distillation Framework for Visual Recognition | ✓ Link | 81.9% | | | | | | ResNet-101 (224 res, Fast Knowledge Distillation) | 2021-12-02 |
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models | ✓ Link | 81.9% | | | | | | Discrete Adversarial Distillation (ViT-B, 224) | 2023-11-02 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | 81.9% | | | | | | DeBiFormer-T | 2024-10-11 |
Towards Robust Vision Transformer | ✓ Link | 81.9% | 23.3M | 4.7 | | | | RVT-S* | 2021-05-17 |
Rethinking Spatial Dimensions of Vision Transformers | ✓ Link | 81.9% | 23.5M | 2.9 | | | | PiT-S | 2021-03-30 |
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? | ✓ Link | 81.9% | 24.1M | | | | | sMLPNet-T (ImageNet-1k) | 2021-09-12 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 81.9% | 79M | 6.74 | | | | ViL-Base-W | 2021-03-29 |
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles | ✓ Link | 81.89% | | | | | | Swin-T+SSA | 2023-06-02 |
Attentive Normalization | ✓ Link | 81.87% | | 7.51 | | | | AOGNet-40M-AN | 2019-08-04 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 81.8% | | 0.726 | | | | FBNetV5 | 2021-11-19 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 81.8% | | 0.757 | | | | NASViT-A5 | 2021-09-29 |
Parametric Contrastive Learning | ✓ Link | 81.8% | | | | | | ResNet-200 | 2021-07-26 |
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality | ✓ Link | 81.8% | | | | | | RepMLPNet-L256 | 2021-12-21 |
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search | ✓ Link | 81.8% | | | | | | NEXcepTion-TP | 2022-12-16 |
Neighborhood Attention Transformer | ✓ Link | 81.8% | 20M | 2.7 | | | | NAT-Mini | 2022-04-14 |
Dilated Neighborhood Attention Transformer | ✓ Link | 81.8% | 20M | 2.7 | | | | DiNAT-Mini | 2022-09-29 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 81.8% | 60.2M | | | | | ResNet-152 (A2) | 2021-10-01 |
Kolmogorov-Arnold Transformer | ✓ Link | 81.8 | 86.6M | 16.87 | | | | DeiT-B | 2024-09-16 |
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks | ✓ Link | 81.72% | 25.6M | | | | | MEAL V2 (ResNet-50) (380 res) | 2020-09-17 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 81.71% | 21.8M | 3.6 | | | | gSwin-T | 2022-08-24 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 81.7% | | 0.685 | | | | FBNetV5-A-CLS | 2021-11-19 |
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks | ✓ Link | 81.7% | | | | | | T2T-ViT-14 | 2021-05-05 |
Learned Queries for Efficient Local Attention | ✓ Link | 81.7% | 16M | 2.5 | | | | QnA-ViT-Tiny | 2021-12-21 |
AutoFormer: Searching Transformers for Visual Recognition | ✓ Link | 81.7% | 22.9M | 5.1 | | | | AutoFormer-small | 2021-07-01 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 81.7% | 28M | 4.4 | | | | Shift-T | 2022-01-26 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 81.7% | 33.5M | 7.3 | | | | BoTNet T3 | 2021-01-27 |
CvT: Introducing Convolutions to Vision Transformers | ✓ Link | 81.6% | | 4.5 | | | | CvT-13 | 2021-03-29 |
Sharpness-Aware Minimization for Efficiently Improving Generalization | ✓ Link | 81.6% | | | | | | ResNet-152 (SAM) | 2020-10-03 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 81.6% | | | | | | UniRepLKNet-N | 2023-11-27 |
Rethinking Local Perception in Lightweight Vision Transformer | ✓ Link | 81.6% | 12.3M | 2 | | | | CloFormer-S | 2023-03-31 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 81.6% | 17.8M | 1.066 | | | | LeViT-256 | 2021-04-02 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 81.6% | 19M | 1.5 | | | | ReXNet_2.0 | 2020-07-02 |
Contextual Transformer Networks for Visual Recognition | ✓ Link | 81.6% | 23.1M | 4.1 | | | | SE-CoTNetD-50 | 2021-07-26 |
CoAtNet: Marrying Convolution and Attention for All Data Sizes | ✓ Link | 81.6% | 25M | 4.2 | | | | CoAtNet-0 | 2021-06-09 |
Pay Attention to MLPs | ✓ Link | 81.6% | 73M | 31.6 | | | | gMLP-B | 2021-05-17 |
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs | | 81.5% | | 0.214 | | | | CoE-Large + CondConv | 2021-07-08 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 81.5% | | 13.1 | | | | GTP-DeiT-B/P8 | 2023-11-06 |
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search | ✓ Link | 81.5% | | | | | | NEXcepTion-T | 2022-12-16 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 81.5% | | | | | | DeiT-S-24 + GFSA | 2023-12-07 |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 81.5% | 7.8M | | | | | NoisyStudent (EfficientNet-B1) | 2019-11-11 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 81.5% | 11M | 2.0 | | | | TinyViT-11M | 2022-07-21 |
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding | ✓ Link | 81.5% | 17M | 5.8 | | | | Transformer local-attention (NesT-T) | 2021-05-26 |
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet | ✓ Link | 81.5% | 21.5M | 9.6 | | | | T2T-ViT-14 | 2021-01-28 |
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification | ✓ Link | 81.5% | 27.4M | 5.8 | | | | CrossViT-15 | 2021-03-27 |
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition | ✓ Link | 81.49% | 42.3M | | | | | PyConvResNet-101 | 2020-06-20 |
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields | ✓ Link | 81.484% | | | | | | ViT-B/16 (RPE w/ GAB) | 2023-05-08 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 81.4% | | 0.591 | | | | NASViT-A4 | 2021-09-29 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 81.4% | | 2.9 | | | | MobileOne-S4 (distill) | 2022-06-08 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 81.4% | | 9.770 | | | | DeiT-S with iRPE-QKV | 2021-07-29 |
DeiT III: Revenge of the ViT | ✓ Link | 81.4% | | | | | | ViT-S @224 (DeiT III) | 2022-04-14 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | 81.4% | | | | | | BiFormer-T (IN1k ptretrain) | 2023-03-15 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 81.4% | 49.2M | | | | | SENet-101 | 2021-01-27 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 81.33% | | | | | | GFNet-S | 2023-08-18 |
Adversarial AutoAugment | | 81.32% | | | | | | ResNet-200 (Adversarial Autoaugment) | 2019-12-24 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 81.3% | | | | | | MultiGrain PNASNet (300px) | 2019-02-14 |
Parametric Contrastive Learning | ✓ Link | 81.3% | | | | | | ResNet-152 | 2021-07-26 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 81.3% | 27M | 5.4 | | | | ConViT-S | 2021-03-19 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 81.3% | 29M | 4.5 | | | | Swin-T | 2021-03-25 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 81.24 | 9.5M | | | | | SimpleNetV1-9m-correct-labels | 2016-08-22 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 81.23% | | | | | | Res2Net-101 | 2019-04-02 |
Shape-Texture Debiased Neural Network Training | ✓ Link | 81.2 | | | | | | ResNeXt-101 (Debiased+CutMix) | 2020-10-12 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 81.2% | | | | | | PVT-S (+MixPro) | 2023-04-24 |
[]() | | 81.16% | | | | | | Swin-T (PuzzleMix+DM) | |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 81.15% | 25.56M | | | | | TransBoost-ResNet50-StrikesBack | 2022-05-26 |
ResNeSt: Split-Attention Networks | ✓ Link | 81.13% | 27.5M | 5.39 | | | | ResNeSt-50 | 2020-04-19 |
[]() | | 81.12% | | | | | | DeiT-S (SAMix+DM) | |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 81.1% | | 9.412 | | | | DeiT-S with iRPE-QK | 2021-07-29 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 81.1% | | | | | | DeiT-S-12 + GFSA | 2023-12-07 |
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications | ✓ Link | 81.1% | 5.76M | 0.932 | | | | CAS-ViT-S | 2024-08-07 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 81.1% | 12M | | | | | EfficientNet-B3 | 2019-05-28 |
Visual Attention Network | ✓ Link | 81.1% | 13.9M | 2.5 | | | | VAN-B1 | 2022-02-20 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 81.1% | 19.6M | 3.33 | | | | RevBiFPN-S3 | 2022-06-28 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 81.1% | 236M | | | | | ResNet-152x2-SAM | 2021-06-03 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 81.09% | | | | | | DynamicViT-S | 2023-08-18 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 81.08% | 44.6M | | | | | ResNet-101 (SAMix) | 2021-11-30 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 81.0% | | 0.528 | | | | NASViT-A3 | 2021-09-29 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 81% | 13.2M | 6.8 | | | | ViTAE-13M | 2021-06-07 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 80.98% | 44.6M | | | | | ResNet-101 (AutoMix) | 2021-03-24 |
[]() | | 80.91% | | | | | | DeiT-S (AutoMix+DM) | |
Parametric Contrastive Learning | ✓ Link | 80.9% | | | | | | ResNet-101 | 2021-07-26 |
Going deeper with Image Transformers | ✓ Link | 80.9% | 12M | 9.6 | | | | CAIT-XXS-24 | 2021-03-31 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 80.9% | 22M | 9.318 | | | | DeiT-S with iRPE-K | 2021-07-29 |
Centroid Transformers: Learning to Abstract with Attention | | 80.9% | 22.3M | 9.4 | | | | CentroidViT-S (arXiv, 2021-02) | 2021-02-17 |
Aggregated Residual Transformations for Deep Neural Networks | ✓ Link | 80.9% | 83.6M | 31.5 | | 94.7 | | ResNeXt-101 64x4 | 2016-11-16 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 80.8% | | 0.709 | | | | AlphaNet-A6 | 2021-02-16 |
Supervised Contrastive Learning | ✓ Link | 80.8% | | | | | | ResNet-200 (Supervised Contrastive) | 2020-04-23 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | ✓ Link | 80.8% | 11.5M | 0.555 | | | | UniNet-B0 | 2022-07-12 |
LocalViT: Bringing Locality to Vision Transformers | ✓ Link | 80.8% | 22.4M | 4.6 | | | | LocalViT-S | 2021-04-12 |
A Dot Product Attention Free Transformer | | 80.8% | 23M | | | | | DAFT-conv (384 heads, 300 epochs) | 2021-09-29 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 80.8% | 30M | 6 | | | | ResMLP-S24 | 2021-05-07 |
MobileNetV4 -- Universal Models for the Mobile Ecosystem | ✓ Link | 80.7% | | | | | | MNv4-Hybrid-M | 2024-04-16 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 80.7% | 5.4M | 1.3 | | | | TinyViT-5M-distill (21k) | 2022-07-21 |
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs | | 80.7% | 95.3M | 0.194 | | | | CoE-Large | 2021-07-08 |
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks | ✓ Link | 80.67% | | | | | | MEAL V2 (ResNet-50) (224 res) | 2020-09-17 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 80.66% | | | | | | TokenLearner-ViT-8 | 2023-08-18 |
ResNeSt: Split-Attention Networks | ✓ Link | 80.64% | 27.5M | 4.34 | | | | ResNeSt-50-fast | 2020-04-19 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 80.64% | 60.19M | | | | | TransBoost-ResNet152 | 2022-05-26 |
Fast AutoAugment | ✓ Link | 80.6% | | | | | | ResNet-200 (Fast AA) | 2019-05-01 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 80.6% | | | | | | CaiT-XXS (+MixPro) | 2023-04-24 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 80.6% | | | | | | FastViT-SA12 | 2023-03-24 |
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features | ✓ Link | 80.53% | | | | | | ResNeXt-101 (CutMix) | 2019-05-13 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 80.5% | | 0.421 | | | | NASViT-A2 | 2021-09-29 |
Residual Attention Network for Image Classification | ✓ Link | 80.5% | | | | | | Attention-92 | 2017-04-23 |
Neural Architecture Transfer | ✓ Link | 80.5% | 9.1M | | | | | NAT-M4 | 2020-05-12 |
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation | ✓ Link | 80.5% | 14.0M | 2.3 | | | | IPT-T | 2022-12-06 |
GLiT: Neural Architecture Search for Global and Local Image Transformer | ✓ Link | 80.5% | 24.6M | 4.4 | | | | GLiT-Smalls | 2021-07-07 |
Gated Convolutional Networks with Hybrid Connectivity for Image Classification | ✓ Link | 80.5% | 42.2M | 7.1 | | | | HCGNet-C | 2019-08-26 |
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition | ✓ Link | 80.43% | | 1.7 | | | | DVT (T2T-ViT-12) | 2021-05-31 |
GhostNetV3: Exploring the Training Strategies for Compact Models | ✓ Link | 80.4% | | | | 95.2 | | GhostNetV3 1.6x | 2024-04-17 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | | 80.4% | 14M | 0.99 | | | | UniNet-B1 | 2021-10-08 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 80.4% | 22M | | | | | DeiT-S (T2) | 2021-10-01 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 80.4% | 25M | | | | | ResNet50 (A1) | 2021-10-01 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 80.32% | 15.5M | 2.3 | | | | gSwin-VT | 2022-08-24 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 80.3% | | 0.491 | | | | AlphaNet-A5 | 2021-02-16 |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 80.3% | | | | | | ResNet-50+AutoDropout+RandAugment | 2021-01-05 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 80.3% | 9.7M | 0.86 | | | | ReXNet_1.5 | 2020-07-02 |
[]() | | 80.25% | | | | | | DeiT-S (PuzzleMix+DM) | |
Attentional Feature Fusion | ✓ Link | 80.22% | 34.7M | | | | | iAFF-ResNeXt-50-32x4d | 2020-09-29 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 80.2% | | | | | | UniRepLKNet-P | 2023-11-27 |
Fixing the train-test resolution discrepancy: FixEfficientNet | ✓ Link | 80.2% | 5.3M | 1.60 | | | | FixEfficientNet-B0 | 2020-03-18 |
A Dot Product Attention Free Transformer | | 80.2% | 20.3M | | | | | DAFT-conv (16 heads) | 2021-09-29 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 80.2% | 42.7M | | | | | ConvMLP-L | 2021-09-09 |
A Fast Knowledge Distillation Framework for Visual Recognition | ✓ Link | 80.1% | | | | | | ResNet-50 (224 res, Fast Knowledge Distillation) | 2021-12-02 |
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space | ✓ Link | 80.1% | | | | | | HVT Base | 2024-09-25 |
A Dot Product Attention Free Transformer | | 80.1% | 23M | | | | | DAFT-conv (384 heads, 200 epochs) | 2021-09-29 |
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning | ✓ Link | 80.1% | 55.8M | | | | | Inception ResNet V2 | 2016-02-23 |
Exploring Randomly Wired Neural Networks for Image Recognition | ✓ Link | 80.1% | 61.5M | 7.9 | | | | RandWire-WS | 2019-04-02 |
Go Wider Instead of Deeper | ✓ Link | 80.09% | 63M | | | | | WideNet-H | 2021-07-25 |
Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs | | 80% | | 0.100 | | | | CoE-Small + CondConv + PWLU | 2021-07-08 |
BasisNet: Two-stage Model Synthesis for Efficient Inference | | 80% | | 0.198 | | | | BasisNet-MV3 | 2021-05-07 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 80.0% | | 0.444 | | | | AlphaNet-A4 | 2021-02-16 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 80% | 5.2M | 1.44 | | | | MogaNet-T (256res) | 2022-11-07 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 80% | 10.4M | 0.624 | | | | LeViT-192 | 2021-04-02 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 80% | 44.4M | | | | | ResNet-101 | 2021-01-27 |
Identity Mappings in Deep Residual Networks | ✓ Link | 79.9% | | | | | | ResNet-200 | 2016-03-16 |
MobileNetV4 -- Universal Models for the Mobile Ecosystem | ✓ Link | 79.9% | | | | | | MNv4-Conv-M | 2024-04-16 |
Designing Network Design Spaces | ✓ Link | 79.9% | 39.2M | 8 | | | | RegNetY-8.0GF | 2020-03-30 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 79.9% | 87M | | | | | ViT-B/16-SAM | 2021-06-03 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 79.86% | 44.55M | | | | | TransBoost-ResNet101 | 2022-05-26 |
Selective Kernel Networks | ✓ Link | 79.81% | 48.9M | 8.46 | | | | SKNet-101 | 2019-03-15 |
Fixing the train-test resolution discrepancy | ✓ Link | 79.8% | | | | | | FixResNet-50 CutMix | 2019-06-14 |
Mish: A Self Regularized Non-Monotonic Activation Function | ✓ Link | 79.8% | | | | | | CSPResNeXt-50 + Mish | 2019-08-23 |
Revisiting a kNN-based Image Classification System with High-capacity Storage | | 79.8% | | | | | | kNN-CLIP | 2022-04-03 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 79.8% | | | | | | FastViT-S12 | 2023-03-24 |
Rethinking Local Perception in Lightweight Vision Transformer | ✓ Link | 79.8% | 7.2M | 1.1 | | | | CloFormer-XS | 2023-03-31 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 79.8% | 9.2M | 1 | | | | EfficientNet-B2 | 2019-05-28 |
Global Context Vision Transformers | ✓ Link | 79.8% | 12M | 2.1 | | | | GC ViT-XXT | 2022-06-20 |
CSPNet: A New Backbone that can Enhance Learning Capability of CNN | ✓ Link | 79.8% | 20.5M | | | | | CSPResNeXt-50 (Mish+Aug) | 2019-11-27 |
A Dot Product Attention Free Transformer | | 79.8% | 22.6M | | | | | DAFT-full | 2021-09-29 |
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition | ✓ Link | 79.74% | | 0.7 | | | | DVT (T2T-ViT-10) | 2021-05-31 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 79.7% | | 0.309 | | | | NASViT-A1 | 2021-09-29 |
Generalized Parametric Contrastive Learning | ✓ Link | 79.7% | | | | | | GPaCo (ResNet-50) | 2022-09-26 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 79.7% | 45M | | | | | ResMLP-36 | 2021-05-07 |
Grafit: Learning fine-grained image representations with coarse labels | | 79.6% | | | | | | Grafit (ResNet-50) | 2020-11-25 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 79.6% | 8.8M | 0.376 | | | | LeViT-128 | 2021-04-02 |
ResT: An Efficient Transformer for Visual Recognition | ✓ Link | 79.6% | 13.66M | 1.9 | | | | ResT-Small | 2021-05-28 |
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation | ✓ Link | 79.5% | | 3.4 | | | | GTP-DeiT-S/P8 | 2023-11-06 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 79.5% | 7.6M | 0.66 | | | | ReXNet_1.3 | 2020-07-02 |
Go Wider Instead of Deeper | ✓ Link | 79.49% | 40M | | | | | WideNet-L | 2021-07-25 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 79.41% | 25.6M | | | | | ResNet-50 (SAMix) | 2021-11-30 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 79.4% | | 0.357 | | | | AlphaNet-A3 | 2021-02-16 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 79.4% | | | | | | MultiGrain R50-AA-500 | 2019-02-14 |
Adversarial AutoAugment | | 79.4% | | | | | | ResNet-50 (Adversarial Autoaugment) | 2019-12-24 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 79.4% | | | | | | ResMLP-24 | 2021-05-07 |
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications | ✓ Link | 79.4% | 5.6M | 2.6 | | | | EdgeNeXt-S | 2022-06-21 |
Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets | ✓ Link | 79.4% | 11.9M | 0.591 | | | | TinyNet (GhostNet-A) | 2020-10-28 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 79.4% | 14.8M | 2.978 | | | | MobileOne-S4 | 2022-06-08 |
Designing Network Design Spaces | ✓ Link | 79.4% | 20.6M | 4 | | | | RegNetY-4.0GF | 2020-03-30 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 79.4% | 28.02M | | | | | SENet-50 | 2021-01-27 |
Data-Driven Neuron Allocation for Scale Aggregation Networks | ✓ Link | 79.38% | | 11.2 | | | | ScaleNet-152 | 2019-04-20 |
LIP: Local Importance-based Pooling | ✓ Link | 79.33% | 42.9M | | | | | LIP-ResNet-101 | 2019-08-12 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 79.3% | 5.8M | 1.841 | | | | MobileViTv3-S | 2022-09-30 |
Involution: Inverting the Inherence of Convolution for Visual Recognition | ✓ Link | 79.3% | 34M | 6.8 | | | | RedNet-152 | 2021-03-10 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 79.25% | 25.6M | | | | | ResNet-50 (AutoMix) | 2021-03-24 |
Self-Knowledge Distillation with Progressive Refinement of Targets | ✓ Link | 79.24% | | | | | | PS-KD (ResNet-152 + CutMix) | 2020-06-22 |
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era | ✓ Link | 79.2% | | | | | | ResNet-101 (JFT-300M Finetuning) | 2017-07-10 |
Towards Robust Vision Transformer | ✓ Link | 79.2% | 10.9M | 1.3 | | | | RVT-Ti* | 2021-05-17 |
Multiscale Deep Equilibrium Models | ✓ Link | 79.2% | 81M | | | | | Multiscale DEQ (MDEQ-XL) | 2020-06-15 |
How to Use Dropout Correctly on Residual Networks with Batch Normalization | ✓ Link | 79.152% | | | | | | DenseNet-169 (H4*) | 2023-02-13 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 79.12 | 5.7M | | | | | SimpleNetV1-5m-correct-labels | 2016-08-22 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 79.1% | | 0.317 | | | | AlphaNet-A2 | 2021-02-16 |
GhostNetV3: Exploring the Training Strategies for Compact Models | ✓ Link | 79.1% | | | | 94.5 | | GhostNetV3 1.3x | 2024-04-17 |
Attention Augmented Convolutional Networks | ✓ Link | 79.1% | | | | | | AA-ResNet-152 | 2019-04-22 |
Fixing the train-test resolution discrepancy | ✓ Link | 79.1% | | | | | | FixResNet-50 | 2019-06-14 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 79.1% | | | | | | MobileOne-S2 (distill) | 2022-06-08 |
Your Diffusion Model is Secretly a Zero-Shot Classifier | ✓ Link | 79.1% | | | | | | Diffusion Classifier | 2023-03-28 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 79.1% | | | | | | FastViT-T12 | 2023-03-24 |
TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ✓ Link | 79.1% | 5.4M | 1.3 | | | | TinyViT-5M | 2022-07-21 |
Rethinking Spatial Dimensions of Vision Transformers | ✓ Link | 79.1% | 10.6M | 1.4 | | | | PiT-XS | 2021-03-30 |
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP | | 79.1% | 11.9M | 0.56 | | | | UniNet-B0 | 2021-10-08 |
Involution: Inverting the Inherence of Convolution for Visual Recognition | ✓ Link | 79.1% | 25.6M | 4.7 | | | | RedNet-101 | 2021-03-10 |
Kolmogorov-Arnold Transformer | ✓ Link | 79.1 | 86.6M | 16.87 | | | | ViT-B/16 | 2024-09-16 |
Unsupervised Data Augmentation for Consistency Training | ✓ Link | 79.04% | | | | | | ResNet-50 (UDA) | 2019-04-29 |
Data-Driven Neuron Allocation for Scale Aggregation Networks | ✓ Link | 79.03% | | 7.5 | | | | ScaleNet-101 | 2019-04-20 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 79.03% | | | | | | TransBoost-ResNet50 | 2022-05-26 |
Contextual Convolutional Neural Networks | ✓ Link | 79.03% | 60M | | | | | Co-ResNet-152 | 2021-08-17 |
Semi-Supervised Recognition under a Noisy and Fine-grained Dataset | ✓ Link | 79.0% | 5.47M | | | | | MobileNetV3_large_x1_0_ssld | 2020-06-18 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 79% | 10.6M | 1.37 | | | | RevBiFPN-S2 | 2022-06-28 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 79% | 17.4M | | | | | ConvMLP-M | 2021-09-09 |
Xception: Deep Learning with Depthwise Separable Convolutions | ✓ Link | 79% | 22.855952M | | 87G | | 0.838G | Xception | 2016-10-07 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 79% | 60.5M | 9.1 | | | | SpineNet-143 | 2019-12-10 |
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | ✓ Link | 79% | 64M | | | | | Mixer-B/8-SAM | 2021-06-03 |
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks | ✓ Link | 78.95% | | | | | | InceptionV3 (FRN layer) | 2019-11-21 |
Averaging Weights Leads to Wider Optima and Better Generalization | ✓ Link | 78.94% | | | | | | ResNet-152 + SWA | 2018-03-14 |
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks | ✓ Link | 78.92% | 57.40M | 10.83 | | | | ECA-Net (ResNet-152) | 2019-10-08 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 78.9% | | 0.279 | | | | AlphaNet-A1 | 2021-02-16 |
MixConv: Mixed Depthwise Convolutional Kernels | ✓ Link | 78.9% | 7.3M | 0.565 | | | | MixNet-L | 2019-07-22 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 78.8% | | 3.6 | | | | CeiT-T (384 finetune res) | 2021-03-22 |
[]() | | 78.8 | | | | 94.4 | | Inception V3 | |
Self-training with Noisy Student improves ImageNet classification | ✓ Link | 78.8% | 5.3M | | | | | NoisyStudent (EfficientNet-B0) | 2019-11-11 |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 78.8% | 7.8M | 0.7 | | | | EfficientNet-B1 | 2019-05-28 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 78.8% | 25.5M | | | | | ResNet-50 | 2021-01-27 |
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks | ✓ Link | 78.798% | 44.55M | 7.858 | | | | SGE-ResNet101 | 2019-05-23 |
RepVGG: Making VGG-style ConvNets Great Again | ✓ Link | 78.78% | 80.31M | 18.4 | | | | RepVGG-B2 | 2021-01-11 |
Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup | ✓ Link | 78.76% | | | | | | ResNet-50 | 2020-09-15 |
[]() | | 78.75% | | | | | | SAMix+DM (ResNet-50 RSB A3) | |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 78.7% | | | | | | ResNet-50 | 2021-01-05 |
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications | ✓ Link | 78.7% | 3.2M | 0.56 | | | | CAS-ViT-XS | 2024-08-07 |
A Fast Knowledge Distillation Framework for Visual Recognition | ✓ Link | 78.7% | 5M | 1.2 | | | | SReT-LT (Fast Knowledge Distillation) | 2021-12-02 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 78.7% | 13.1M | 2.1 | | | | PVTv2-B1 | 2021-06-25 |
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks | ✓ Link | 78.65% | 42.49M | 7.35 | | | | ECA-Net (ResNet-101) | 2019-10-08 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 78.64% | | 1.876 | | | | MobileViTv3-1.0 | 2022-09-30 |
ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer | ✓ Link | 78.63% | 5M | 3.48 | | | | EdgeFormer-S | 2022-03-08 |
[]() | | 78.62% | | | | | | AutoMix+DM (ResNet-50 RSB A3) | |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 78.6% | | | | | | UniRepLKNet-F | 2023-11-27 |
RCKD: Response-Based Cross-Task Knowledge Distillation for Pathological Image Analysis | | 78.6 | 3M | | | | | CSAT | 2023-10-29 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 78.60% | 5.29M | | | | | TransBoost-EfficientNetB0 | 2022-05-26 |
Visformer: The Vision-friendly Transformer | ✓ Link | 78.6% | 10.3M | 1.3 | | | | Visformer-Ti | 2021-04-26 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 78.6% | 17.7M | 3 | | | | ResMLP-12 (distilled, class-MLP) | 2021-05-07 |
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition | ✓ Link | 78.60% | 52.77M | | | | | RepMLP-Res50 | 2021-05-05 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 78.59% | | | | | | Res2Net-50-299 | 2019-04-02 |
Deep Residual Learning for Image Recognition | ✓ Link | 78.57% | | 11.3 | | | | ResNet-152 | 2015-12-10 |
Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation | ✓ Link | 78.5% | | | | | | ResNet-50-DW (Deformable Kernels) | 2019-10-07 |
HRFormer: High-Resolution Transformer for Dense Prediction | ✓ Link | 78.5% | 8.0M | 1.8 | | | | HRFormer-T | 2021-10-18 |
Gated Convolutional Networks with Hybrid Connectivity for Image Classification | ✓ Link | 78.5% | 12.9M | 2.0 | | | | HCGNet-B | 2019-08-26 |
RepVGG: Making VGG-style ConvNets Great Again | ✓ Link | 78.5% | 55.77M | 11.3 | | | | RepVGG-B2g4 | 2021-01-11 |
Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition | ✓ Link | 78.48% | | 0.6 | | | | DVT (T2T-ViT-7) | 2021-05-31 |
SRM : A Style-based Recalibration Module for Convolutional Neural Networks | ✓ Link | 78.47% | | | | | | SRM-ResNet-101 | 2019-03-26 |
Averaging Weights Leads to Wider Optima and Better Generalization | ✓ Link | 78.44% | | | | | | DenseNet-161 + SWA | 2018-03-14 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 78.42% | | | | | | CoaT-Ti | 2023-08-18 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 78.4% | | 0.280 | | | | FBNetV5-AC-CLS | 2021-11-19 |
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features | ✓ Link | 78.4% | | | | | | ResNet-50 (CutMix) | 2019-05-13 |
Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels | ✓ Link | 78.4% | 4.8M | | | | | ReXNet_1.0-relabel | 2021-01-13 |
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer | ✓ Link | 78.4% | 5.6M | | | | | MobileViT-S | 2021-10-05 |
Involution: Inverting the Inherence of Convolution for Visual Recognition | ✓ Link | 78.4% | 15.5M | 2.7 | | | | RedNet-50 | 2021-03-10 |
[]() | | 78.36% | | | | | | ResNet-50 (SAMix+DM) | |
DropBlock: A regularization method for convolutional networks | ✓ Link | 78.35% | | | | | | ResNet-50 + DropBlock (0.9 kp, 0.1 label smoothing) | 2018-10-30 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 78.34% | | | | | | Poly-SA-ViT-S | 2023-08-18 |
CondConv: Conditionally Parameterized Convolutions for Efficient Inference | ✓ Link | 78.3% | | 0.826 | | | | EfficientNet-B0 (CondConv) | 2019-04-10 |
Deep Residual Learning for Image Recognition | ✓ Link | 78.25% | 40M | 7.6 | | | | ResNet-101 | 2015-12-10 |
NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training | ✓ Link | 78.2% | | 0.208 | | | | NASViT-A0 | 2021-09-29 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 78.2% | | | | | | MultiGrain R50-AA-224 | 2019-02-14 |
Vision GNN: An Image is Worth Graph of Nodes | ✓ Link | 78.2% | 10.7M | 1.7 | | | | Pyramid ViG-Ti | 2022-06-01 |
LocalViT: Bringing Locality to Vision Transformers | ✓ Link | 78.2% | 13.5M | 4.8 | | | | LocalViT-PVT | 2021-04-12 |
[]() | | 78.15% | | | | | | ResNet-50 (AutoMix+DM) | |
[]() | | 78.15% | | | | | | PuzzleMix+DM (ResNet-50 RSB A3) | |
LIP: Local Importance-based Pooling | ✓ Link | 78.15% | 25.8M | | | | | ResNet-50 (LIP Bottleneck-256) | 2019-08-12 |
Wide Residual Networks | ✓ Link | 78.1% | | | | | | WRN-50-2-bottleneck | 2016-05-23 |
Separable Self-attention for Mobile Vision Transformers | ✓ Link | 78.1% | 4.9M | 1.8 | | | | MobileViTv2-1.0 | 2022-06-06 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 78.1% | 10.1M | 1.896 | | | | MobileOne-S3 | 2022-06-08 |
ResNet strikes back: An improved training procedure in timm | ✓ Link | 78.1% | 25M | | | | | ResNet50 (A3) | 2021-10-01 |
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition | ✓ Link | 78% | 5.7M | 0.820 | | | | ZenNet-400M-SE | 2021-02-01 |
Designing Network Design Spaces | ✓ Link | 78% | 11.2M | 1.6 | | | | RegNetY-1.6GF | 2020-03-30 |
Scalable Vision Transformers with Hierarchical Pooling | ✓ Link | 78.00% | 21.74M | 2.4 | | | | HVT-S-1 | 2021-03-19 |
Perceiver: General Perception with Iterative Attention | ✓ Link | 78% | 44.9M | 707.2 | | | | Perceiver (FF) | 2021-03-04 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 77.9% | 4.8M | 0.40 | | | | ReXNet_1.0 | 2020-07-02 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 77.9% | 6.5M | 4 | | | | ViTAE-6M | 2021-06-07 |
Densely Connected Convolutional Networks | ✓ Link | 77.85% | | | | | | DenseNet-264 | 2016-08-25 |
AlphaNet: Improved Training of Supernets with Alpha-Divergence | ✓ Link | 77.8% | | 0.203 | | | | AlphaNet-A0 | 2021-02-16 |
Data-Driven Neuron Allocation for Scale Aggregation Networks | ✓ Link | 77.8% | | 3.8 | | | | ScaleNet-50 | 2019-04-20 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 77.8% | 15.4M | | | | | ResMLP-S12 | 2021-05-07 |
[]() | | 77.71% | | | | | | ResNet-50 (PuzzleMix+DM) | |
Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets | ✓ Link | 77.7% | 5.1M | 0.339 | | | | TinyNet-A + RA | 2020-10-28 |
Fast AutoAugment | ✓ Link | 77.6% | | | | | | ResNet-50 (Fast AA) | 2019-05-01 |
Sliced Recursive Transformer | ✓ Link | 77.6% | 4.8M | 1.1 | | | | SReT-T | 2021-11-09 |
Involution: Inverting the Inherence of Convolution for Visual Recognition | ✓ Link | 77.6% | 12.4M | 2.2 | | | | RedNet-38 | 2021-03-10 |
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks | ✓ Link | 77.584% | 25.56M | 4.127 | | | | SGE-ResNet50 | 2019-05-23 |
Go Wider Instead of Deeper | ✓ Link | 77.54% | 29M | | | | | WideNet-B | 2021-07-25 |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 77.5% | | | | | | EfficientNet-B0 | 2021-01-05 |
Adaptively Connected Neural Networks | ✓ Link | 77.5% | 29.38M | | | | | ACNet (ResNet-50) | 2019-04-07 |
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks | ✓ Link | 77.48% | 24.37M | 3.86 | | | | ECA-Net (ResNet-50) | 2019-10-08 |
Densely Connected Convolutional Networks | ✓ Link | 77.42% | | | | | | DenseNet-201 | 2016-08-25 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 77.4% | 7.8M | 1.299 | | | | MobileOne-S2 | 2022-06-08 |
Expeditious Saliency-guided Mix-up through Random Gradient Thresholding | ✓ Link | 77.39% | | | | | | R-Mix (ResNet-50) | 2022-12-09 |
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks | ✓ Link | 77.21% | | | | | | ResnetV2 50 (FRN layer) | 2019-11-21 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 77.2% | | 0.215 | | | | FBNetV5-AR-CLS | 2021-11-19 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 77.2% | 3M | 1.04 | | | | MogaNet-XT (256res) | 2022-11-07 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 77.2% | 4.1M | 0.35 | | | | ReXNet_0.9 | 2020-07-02 |
Deep Polynomial Neural Networks | ✓ Link | 77.17% | | | | | | Prodpoly | 2020-06-20 |
Bag of Tricks for Image Classification with Convolutional Neural Networks | ✓ Link | 77.16% | 25M | | | | | ResNet-50-D | 2018-12-04 |
What do Deep Networks Like to See? | ✓ Link | 77.12% | | | | | | Inception v3 | 2018-03-22 |
GhostNetV3: Exploring the Training Strategies for Compact Models | ✓ Link | 77.1% | | | | 93.3 | | GhostNetV3 1.0x | 2024-04-17 |
Meta Knowledge Distillation | | 77.1% | | | | | | MKD ViT-T | 2022-02-16 |
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet | | 77.1% | 6.5M | 0.366 | | | | GreedyNAS-A | 2020-03-25 |
Bias Loss for Mobile Neural Networks | ✓ Link | 77.1% | 7.1M | 0.364 | | | | SkipblockNet-L | 2021-07-23 |
Compress image to patches for Vision Transformer | ✓ Link | 77% | | 6.442 | | | | CI2P-ViT | 2025-02-14 |
Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks | ✓ Link | 77.0% | | | | | | SSAL-Resnet50 | 2021-01-07 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 77% | | | | | | UniRepLKNet-A | 2023-11-27 |
Rethinking Local Perception in Lightweight Vision Transformer | ✓ Link | 77% | 4.2M | 0.6 | | | | CloFormer-XXS | 2023-03-31 |
MixConv: Mixed Depthwise Convolutional Kernels | ✓ Link | 77% | 5.0M | 0.360 | | | | MixNet-M | 2019-07-22 |
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective | ✓ Link | 76.91% | | | | | | ResNet50 (FSGDM) | 2024-11-29 |
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search | ✓ Link | 76.9% | 6.7M | 0.730 | | | | SCARLET-A | 2019-08-16 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 76.81% | 5.48M | | | | | TransBoost-MobileNetV3-L | 2022-05-26 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 76.8% | 4.8M | 4.6 | | | | ViTAE-T-Stage | 2021-06-07 |
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet | | 76.8% | 5.2M | 0.324 | | | | GreedyNAS-B | 2020-03-25 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 76.8 | 9M | | | | | ConvMLP-S | 2021-09-09 |
Learning Visual Representations for Transfer Learning by Suppressing Texture | ✓ Link | 76.71% | | | | | | Perona Malik (Perona and Malik, 1990) | 2020-11-03 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 76.7% | | | | | | PVT-T (+MixPro) | 2023-04-24 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 76.7% | 2.5M | 0.927 | | | | MobileViTv3-XS | 2022-09-30 |
MnasNet: Platform-Aware Neural Architecture Search for Mobile | ✓ Link | 76.7% | 5.2M | 0.806 | | | 0.0403G | MnasNet-A3 | 2018-07-31 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 76.7% | 6.7M | 1.3 | | | | ViL-Tiny-RPB | 2021-03-29 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 76.7% | 10M | 2 | | | | ConViT-Ti+ | 2021-03-19 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 76.70% | 21.8M | | | | | TransBoost-ResNet34 | 2022-05-26 |
LIP: Local Importance-based Pooling | ✓ Link | 76.64% | 8.7M | | | | | LIP-DenseNet-BC-121 | 2019-08-12 |
X-volution: On the unification of convolution and self-attention | | 76.6% | | | | | | ResNet-50 (X-volution, stage3) | 2021-06-04 |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 76.6% | 4.0M | 0.636 | | | | MUXNet-l | 2020-03-31 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 76.6% | 5M | | | | | DeiT-B | 2020-12-23 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 76.55% | 3M | 1.064 | | | | MobileViTv3-0.75 | 2022-09-30 |
MLP-Mixer: An all-MLP Architecture for Vision | ✓ Link | 76.44% | 46M | | | | | Mixer-B/16 | 2021-05-04 |
Perceiver: General Perception with Iterative Attention | ✓ Link | 76.4% | | | | | | Perceiver | 2021-03-04 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 76.4% | 6.4M | 1.2 | | | | CeiT-T | 2021-03-22 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 76.35% | 21.8M | | | | | ResNet-34 (SAMix) | 2021-11-30 |
[]() | | 76.3 | | | | 93.2 | | VGG | |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | ✓ Link | 76.3% | 5.3M | 0.39 | | | | EfficientNet-B0 | 2019-05-28 |
Designing Network Design Spaces | ✓ Link | 76.3% | 6.3M | 0.8 | | | | RegNetY-800MF | 2020-03-30 |
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search | ✓ Link | 76.3% | 6.5M | 0.658 | | | | SCARLET-B | 2019-08-16 |
GLiT: Neural Architecture Search for Global and Local Image Transformer | ✓ Link | 76.3% | 7.2M | 1.4 | | | | GLiT-Tinys | 2021-07-07 |
Densely Connected Convolutional Networks | ✓ Link | 76.2% | | | | | | DenseNet-169 | 2016-08-25 |
GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet | | 76.2% | 4.7M | 0.284 | | | | GreedyNAS-C | 2020-03-25 |
Bias Loss for Mobile Neural Networks | ✓ Link | 76.2% | 5.5M | 0.246 | | | | SkipblockNet-M | 2021-07-23 |
A Simple Episodic Linear Probe Improves Visual Recognition in the Wild | ✓ Link | 76.13 | | | | | | ELP (naive ResNet50) | 2022-01-01 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 76.1% | 21.8M | | | | | ResNet-34 (AutoMix) | 2021-03-24 |
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes | | 75.92% | | | | | | ResNet-50 MLPerf v0.7 - 2512 steps | 2021-02-12 |
Densely Connected Search Space for More Flexible Neural Architecture Search | ✓ Link | 75.9% | | | | | | DenseNAS-A | 2019-06-23 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 75.9% | 4.8M | 0.825 | | | | MobileOne-S1 | 2022-06-08 |
MoGA: Searching Beyond MobileNetV3 | ✓ Link | 75.9% | 5.1M | 0.608 | | | 0.0304G | MoGA-A | 2019-08-04 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 75.9% | 5.11M | 0.62 | | | | RevBiFPN-S1 | 2022-06-28 |
LocalViT: Bringing Locality to Vision Transformers | ✓ Link | 75.9% | 6.3M | 1.4 | | | | LocalViT-TNT | 2021-04-12 |
Semantic-Aware Local-Global Vision Transformer | | 75.9% | 6.5M | | | | | SALG-ST | 2022-11-27 |
Involution: Inverting the Inherence of Convolution for Visual Recognition | ✓ Link | 75.9% | 9.2M | 1.7 | | | | RedNet-26 | 2021-03-10 |
FractalNet: Ultra-Deep Neural Networks without Residuals | ✓ Link | 75.88% | | | | | | FractalNet-34 | 2016-05-24 |
MixConv: Mixed Depthwise Convolutional Kernels | ✓ Link | 75.8% | 4.1M | 0.256 | | | | MixNet-S | 2019-07-22 |
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution | ✓ Link | 75.74% | | | | | | CoordConv ResNet-50 | 2018-07-09 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 75.7% | 4.7M | 0.288 | | | | LeViT-128S | 2021-04-02 |
GhostNet: More Features from Cheap Operations | ✓ Link | 75.7% | 7.3M | 0.226 | | | | GhostNet ×1.3 | 2019-11-27 |
Local Relation Networks for Image Recognition | ✓ Link | 75.7% | 14.7M | 2.6 | | | | LR-Net-26 | 2019-04-25 |
Spatial-Channel Token Distillation for Vision MLPs | ✓ Link | 75.7% | 22.2M | 4.3 | | | | Mixer-S16 + STD | 2022-07-23 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 75.66 | 3M | | | | | SimpleNetV1-small-075-correct-labels | 2016-08-22 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | 75.6% | | | | | | FastViT-T8 | 2023-03-24 |
Separable Self-attention for Mobile Vision Transformers | ✓ Link | 75.6% | 2.9M | 1.0 | | | | MobileViTv2-0.75 | 2022-06-06 |
MnasNet: Platform-Aware Neural Architecture Search for Mobile | ✓ Link | 75.6% | 4.8M | 0.680 | | | | MnasNet-A2 | 2018-07-31 |
SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search | ✓ Link | 75.6% | 6M | 0.560 | | | | SCARLET-C | 2019-08-16 |
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples | ✓ Link | 75.5% | | | | | | PAWS (ResNet-50, 10% labels) | 2021-04-28 |
Designing Network Design Spaces | ✓ Link | 75.5% | 6.1M | 0.6 | | | | RegNetY-600MF | 2020-03-30 |
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design | ✓ Link | 75.4% | | 0.597 | | | | ShuffleNet V2 | 2018-07-30 |
Visual Attention Network | ✓ Link | 75.4% | 4.1M | 0.9 | | | | VAN-B0 | 2022-02-20 |
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks | ✓ Link | 75.4% | 5.99M | 0.4338 | | | | AsymmNet-Large ×1.0 | 2021-04-15 |
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search | ✓ Link | 75.34% | 4.6M | 0.776 | | | | FairNAS-A | 2019-07-03 |
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias | ✓ Link | 75.3% | | 3.0 | | | | ViTAE-T | 2021-06-07 |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 75.3% | 3.4M | 0.436 | | | | MUXNet-m | 2020-03-31 |
Deep Residual Learning for Image Recognition | ✓ Link | 75.3% | 25M | 3.8 | | | | ResNet-50 | 2015-12-10 |
MnasNet: Platform-Aware Neural Architecture Search for Mobile | ✓ Link | 75.2% | 3.9M | 0.624 | | | | MnasNet-A1 | 2018-07-31 |
Searching for MobileNetV3 | ✓ Link | 75.2% | 5.4M | 0.438 | | | | MobileNet V3-Large 1.0 | 2019-05-06 |
DiCENet: Dimension-wise Convolutions for Efficient Networks | ✓ Link | 75.1% | | 0.553 | | | | DiCENet | 2019-06-08 |
MultiGrain: a unified image embedding for classes and instances | ✓ Link | 75.1% | | | | | | MultiGrain NASNet-A-Mobile (350px) | 2019-02-14 |
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search | ✓ Link | 75.10% | 4.5M | 0.690 | | | | FairNAS-B | 2019-07-03 |
X-volution: On the unification of convolution and self-attention | | 75% | | | | | | ResNet-34 (X-volution, stage3) | 2021-06-04 |
GhostNet: More Features from Cheap Operations | ✓ Link | 75% | 13M | 2.2 | | | | Ghost-ResNet-50 (s=2) | 2019-11-27 |
Densely Connected Convolutional Networks | ✓ Link | 74.98% | | | | | | DenseNet-121 | 2016-08-25 |
Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours | ✓ Link | 74.96% | | | | | | Single-Path NAS | 2019-04-05 |
WaveMix: A Resource-efficient Neural Network for Image Analysis | ✓ Link | 74.93% | | | | | | WaveMix-192/16 (level 3) | 2022-05-28 |
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet | ✓ Link | 74.9 | | | | | | FF | 2021-05-06 |
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search | ✓ Link | 74.9% | 5.5M | 0.375 | | | | FBNet-C | 2018-12-09 |
ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network | ✓ Link | 74.9% | 5.9M | 0.602 | | | | ESPNetv2 | 2018-11-28 |
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer | ✓ Link | 74.8% | 2.3M | 0.7 | | | | MobileViT-XS | 2021-10-05 |
LocalViT: Bringing Locality to Vision Transformers | ✓ Link | 74.8% | 5.9M | 1.3 | | | | LocalViT-T | 2021-04-12 |
Exploring Randomly Wired Neural Networks for Image Recognition | ✓ Link | 74.7% | 5.6M | 0.583 | | | | RandWire-WS (small) | 2019-04-02 |
AutoFormer: Searching Transformers for Visual Recognition | ✓ Link | 74.7% | 5.7M | 1.3 | | | | AutoFormer-tiny | 2021-07-01 |
MobileNetV2: Inverted Residuals and Linear Bottlenecks | ✓ Link | 74.7% | 6.9M | 1.170 | | | | MobileNetV2 (1.4) | 2018-01-13 |
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search | ✓ Link | 74.69% | 4.4M | 0.642 | | | | FairNAS-C | 2019-07-03 |
Rethinking Channel Dimensions for Efficient Model Design | ✓ Link | 74.6% | 2.7M | | | | | ReXNet_0.6 | 2020-07-02 |
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware | ✓ Link | 74.6% | 4.0M | | | | | Proxyless | 2018-12-02 |
Rethinking Spatial Dimensions of Vision Transformers | ✓ Link | 74.6% | 4.9M | 0.7 | | | | PiT-Ti | 2021-03-30 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 74.4% | 11.1M | 0,626 | | | | DY-MobileNetV2 ×1.0 | 2019-12-07 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 74.17 | 9.5M | | | | | SimpleNetV1-9m | 2016-08-22 |
Designing Network Design Spaces | ✓ Link | 74.1% | 4.3M | 0.4 | | | | RegNetY-400MF | 2020-03-30 |
GhostNet: More Features from Cheap Operations | ✓ Link | 74.1% | 6.5M | 1.2 | | | | Ghost-ResNet-50 (s=4) | 2019-11-27 |
Sliced Recursive Transformer | ✓ Link | 74.0% | 4M | 0.7 | | | | SReT-ExT | 2021-11-09 |
GhostNet: More Features from Cheap Operations | ✓ Link | 73.9% | 5.2M | 0.141 | | | | GhostNet ×1.0 | 2019-11-27 |
MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer | ✓ Link | 73.8% | | | | | | DeiT-T (+MixPro) | 2023-04-24 |
MobileNetV4 -- Universal Models for the Mobile Ecosystem | ✓ Link | 73.8% | | | | | | MNv4-Conv-S | 2024-04-16 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 73.7% | 6M | 2.568 | | | | DeiT-Ti with iRPE-K | 2021-07-29 |
Distilled Gradual Pruning with Pruned Fine-tuning | ✓ Link | 73.66% | 2.56M | 0.4 | | | | DGPPF-ResNet50 | 2024-02-15 |
TransBoost: Improving the Best ImageNet Performance using Deep Transduction | ✓ Link | 73.36% | 11.69M | | | | | TransBoost-ResNet18 | 2022-05-26 |
What's Hidden in a Randomly Weighted Neural Network? | ✓ Link | 73.3% | 20.6M | | | | | Wide ResNet-50 (edge-popup) | 2019-11-29 |
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks | ✓ Link | 73.19% | | | | | | ResNet-18 (MEAL V2) | 2020-09-17 |
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases | ✓ Link | 73.1% | 6M | 1 | | | | ConViT-Ti | 2021-03-19 |
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network | ✓ Link | 72.8% | 3.42M | 0.31 | | | | RevBiFPN-S0 | 2022-06-28 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 72.8% | 7M | 0.435 | | | | DY-MobileNetV2 ×0.75 | 2019-12-07 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 72.7% | 42.7M | 3.7 | | | | DY-ResNet-18 | 2019-12-07 |
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks | ✓ Link | 72.56% | 3.34M | 0.320 | | | | ECA-Net (MobileNetV2) | 2019-10-08 |
Compact Global Descriptor for Neural Networks | ✓ Link | 72.56% | 4.26M | 1.198 | | | | MobileNet-224 (CGD) | 2019-07-23 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 72.5% | 2.1M | 0.275 | | | | MobileOne-S0 (distill) | 2022-06-08 |
LocalViT: Bringing Locality to Vision Transformers | ✓ Link | 72.5% | 4.3M | 1.2 | | | | LocalViT-T2T | 2021-04-12 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 72.33% | 1.4M | 0.481 | | | | MobileViTv3-0.5 | 2022-09-30 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 72.33% | 11.7M | | | | | ResNet-18 (SAMix) | 2021-11-30 |
On the adequacy of untuned warmup for adaptive optimization | ✓ Link | 72.1% | | | | | | ResNet-50 | 2019-10-09 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 72.05% | 11.7M | | | | | ResNet-18 (AutoMix) | 2021-03-24 |
MobileNetV2: Inverted Residuals and Linear Bottlenecks | ✓ Link | 72% | 3.4M | 0.600 | | | | MobileNetV2 | 2018-01-13 |
QuantNet: Learning to Quantize by Learning within Fully Differentiable Framework | | 71.97% | | | | | | Ours | 2020-09-10 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 71.94 | 5.7M | | | | | SimpleNetV1-5m | 2016-08-22 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 71.71% | | | | | | ResNet-18 (PAD-L2 w/ ResNet-34 teacher) | 2020-11-25 |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 71.6% | 2.4M | 0.234 | | | | MUXNet-s | 2020-03-31 |
Augmenting Deep Classifiers with Polynomial Neural Networks | ✓ Link | 71.6% | 11.51M | | | | | PDC | 2021-04-16 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 71.56% | | | | | | ResNet-18 (FT w/ ResNet-34 teacher) | 2020-11-25 |
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers | ✓ Link | 71.53% | | | | | | EfficientFormer-V2-S0 | 2023-08-18 |
MobileOne: An Improved One millisecond Mobile Backbone | ✓ Link | 71.4% | 2.1M | | | | | MobileOne-S0 | 2022-06-08 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 71.37% | | | | | | ResNet-18 (KD w/ ResNet-34 teacher) | 2020-11-25 |
Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks | | 71.24 | | | | | | Dspike (VGG-16) | 2021-12-01 |
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications | ✓ Link | 71.2% | 1.3M | 0.522 | | | | EdgeNeXt-XXS | 2022-06-21 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 71.08% | | | | | | ResNet-18 (L2 w/ ResNet-34 teacher) | 2020-11-25 |
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features | ✓ Link | 70.98% | 1.2M | 0.289 | | | | MobileViTv3-XXS | 2022-09-30 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 70.93% | | | | | | ResNet-18 (CRD w/ ResNet-34 teacher) | 2020-11-25 |
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices | ✓ Link | 70.9% | | | | | | ShuffleNet | 2017-07-04 |
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications | ✓ Link | 70.6% | | 1.138 | | | | MobileNet-224 ×1.25 | 2017-04-17 |
[]() | | 70.54 | | | | | | PSN (SEW ResNet-34) | |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 70.52% | | | | | | ResNet-18 (tf-KD w/ ResNet-18 teacher) | 2020-11-25 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 70.5% | 3.4M | 0.6 | | | | PVTv2-B0 | 2021-06-25 |
Gated Attention Coding for Training High-performance and Efficient Spiking Neural Networks | ✓ Link | 70.42 | | | | | | GAC-SNN MS-ResNet-34 | 2023-08-12 |
Separable Self-attention for Mobile Vision Transformers | ✓ Link | 70.2% | 1.4M | 0.5 | | | | MobileViTv2-0.5 | 2022-06-06 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 70.09% | | | | | | ResNet-18 (SSKD w/ ResNet-34 teacher) | 2020-11-25 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 69.7% | 4.8M | 0.137 | | | | DY-MobileNetV3-Small | 2019-12-07 |
Scalable Vision Transformers with Hierarchical Pooling | ✓ Link | 69.64% | 5.74M | 0.64 | | | | HVT-Ti-1 | 2021-03-19 |
GhostNetV3: Exploring the Training Strategies for Compact Models | ✓ Link | 69.4% | | | | 88.5 | | GhostNetV3 0.5x | 2024-04-17 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 69.4% | 4M | 0.203 | | | | DY-MobileNetV2 ×0.5 | 2019-12-07 |
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks | ✓ Link | 69.2% | 2.8M | 0.1344 | | | | AsymmNet-Large ×0.5 | 2021-04-15 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 69.11 | 1.5M | | | | | SimpleNetV1-small-05-correct-labels | 2016-08-22 |
Correlated Input-Dependent Label Noise in Large-Scale Image Classification | | 68.6% | | | | | | Heteroscedastic (InceptionResNet-v2) | 2021-05-19 |
AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks | ✓ Link | 68.4% | 3.1M | 0.1154 | | | | AsymmNet-Small ×1.0 | 2021-04-15 |
FireCaffe: near-linear acceleration of deep neural network training on compute clusters | | 68.3% | | | | | | FireCaffe (GoogLeNet) | 2015-10-31 |
Graph-RISE: Graph-Regularized Image Semantic Embedding | ✓ Link | 68.29% | | | | | | Graph-RISE (40M) | 2019-02-14 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 68.15 | 3M | | | | | SimpleNetV1-small-075 | 2016-08-22 |
"BNN - BN = ?": Training Binary Neural Networks without Batch Normalization | ✓ Link | 68.0% | | | | | | ReActNet-A (BN-Free) | 2021-04-16 |
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective | ✓ Link | 67.74% | | | | | | ResNet34 (FSGDM) | 2024-11-29 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 67.7% | 18.6M | 1.82 | | | | DY-ResNet-10 | 2019-12-07 |
WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis | ✓ Link | 67.7% | 32.4M | | | | | WaveMixLite-256/24 | 2022-10-13 |
[]() | | 67.63 | | | | | | PSN (SEW ResNet-18) | |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 66.7% | 1.8M | 0.132 | | | | MUXNet-xs | 2020-03-31 |
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples | ✓ Link | 66.5% | | | | | | PAWS (ResNet-50, 1% labels) | 2021-04-28 |
GhostNet: More Features from Cheap Operations | ✓ Link | 66.2% | 2.6M | 0.042 | | | | GhostNet ×0.5 | 2019-11-27 |
[]() | | 66.04 | | | | 86.76 | | OverFeat | |
Distilled Gradual Pruning with Pruned Fine-tuning | ✓ Link | 65.59% | 1.03M | 0.1 | | | | DGPPF-MobileNetV2 | 2024-02-15 |
Distilled Gradual Pruning with Pruned Fine-tuning | ✓ Link | 65.22 | 1.15M | 0.2 | | | | DGPPF-ResNet18 | 2024-02-15 |
Online Training Through Time for Spiking Neural Networks | ✓ Link | 65.15% | | | | | | OTTT | 2022-10-09 |
Dynamic Convolution: Attention over Convolution Kernels | ✓ Link | 64.9% | 2.8M | 0.124 | | | | DY-MobileNetV2 ×0.35 | 2019-12-07 |
[]() | | 63.3 | 62M | | | 84.6 | | Alexnet | |
Balanced Binary Neural Networks with Gated Residual | ✓ Link | 62.6% | | | | | | BBG (ResNet-34) | 2019-09-26 |
Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures | ✓ Link | 61.52 | 1.5M | | | | | SimpleNetV1-small-05 | 2016-08-22 |
Balanced Binary Neural Networks with Gated Residual | ✓ Link | 59.4% | | | | | | BBG (ResNet-18) | 2019-09-26 |
FireCaffe: near-linear acceleration of deep neural network training on compute clusters | | 58.9% | | | | | | FireCaffe (AlexNet) | 2015-10-31 |
0/1 Deep Neural Networks via Block Coordinate Descent | | 38.3% | | | | | | HMAX | 2022-06-19 |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | ✓ Link | 24% | | | | | | ViT-Large | 2020-10-22 |
Escaping the Big Data Paradigm with Compact Transformers | ✓ Link | | 22.36M | 11.06 | | | | CCT-14/7x2 | 2021-04-12 |
MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ✓ Link | | 241.5M | | | | | MambaVision-L2 | 2024-07-10 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | | 1520M | | | | | ONE-PEACE | 2023-05-18 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | | 2700M | | | | | AIMv2-2B | 2024-11-21 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | | 3000M | | | | | InternImage-DCNv3-G (M3I Pre-training) | 2022-11-10 |