Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 85.9 | | | | AIMv2-3B (448 res) | 2024-11-21 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 83.8 | | | | Hiera-H (448px) | 2023-06-01 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 83.4 | | | | MAE (ViT-H, 448) | 2021-11-11 |
MetaFormer: A Unified Meta Framework for Fine-Grained Recognition | ✓ Link | 83.4% | | | | MetaFormer
(MetaFormer-2,384,extra_info) | 2022-03-05 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 81.5 | | | | AIMv2-3B | 2024-11-21 |
ViT-NeT: Interpretable Vision Transformers with Neural Tree Decoder | ✓ Link | 81.2 | | | | ViT-NeT
(SwinV2-B) | 2022-07-17 |
MetaFormer: A Unified Meta Framework for Fine-Grained Recognition | ✓ Link | 80.4% | | | | MetaFormer
(MetaFormer-2,384) | 2022-03-05 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 79.7 | | | | AIMv2-1B | 2024-11-21 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 77.9 | | | | AIMv2-H | 2024-11-21 |
Multimodal Autoregressive Pre-training of Large Vision Encoders | ✓ Link | 76 | | | | AIMv2-L | 2024-11-21 |
Fixing the train-test resolution discrepancy | ✓ Link | 75.4 | | | | FixSENet-154 | 2019-06-14 |
On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition | ✓ Link | 72.3 | | | | SEB+EfficientNet-B5 | 2022-05-26 |
TransFG: A Transformer Architecture for Fine-grained Recognition | ✓ Link | 71.7 | | | | TransFG | 2021-03-14 |
The iNaturalist Species Classification and Detection Dataset | ✓ Link | 67.3% | 87.5% | | | IncResNetV2 SE | 2017-07-20 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 63.6% | 84.8% | | | SpineNet-143 | 2019-12-10 |
MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition | ✓ Link | 63.28% | | | | MetaSAug | 2021-03-23 |
Graph-RISE: Graph-Regularized Image Semantic Embedding | ✓ Link | 31.12% | 52.76% | | | Graph-RISE (40M) | 2019-02-14 |
Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization | ✓ Link | | | 14.625 | | iSQRT-COV-Net | 2019-04-15 |
DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets | ✓ Link | | | | 75.1 | b_22DeiT-LT(ours) | 2024-04-03 |