OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 94.6 | | OmniVec2 | 2024-01-01 |
OmniVec: Learning robust representations with cross modal sharing | | 93.8 | | OmniVec | 2023-11-07 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 92.6% | | InternImage-H | 2022-11-10 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 91.3% | | MAWS (ViT-2B) | 2023-03-23 |
MetaFormer: A Unified Meta Framework for Fine-Grained Recognition | ✓ Link | 88.7% | | MetaFormer
(MetaFormer-2,384,extra_info) | 2022-03-05 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 87.3% | | Hiera-H (448px) | 2023-06-01 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 86.8% | | MAE (ViT-H, 448) | 2021-11-11 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 86.0% | | SWAG (ViT H/14) | 2022-01-20 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 84.7% | | SEER (RegNet10B - finetuned - 384px) | 2022-02-16 |
MetaFormer: A Unified Meta Framework for Fine-Grained Recognition | ✓ Link | 84.3% | | MetaFormer
(MetaFormer-2,384) | 2022-03-05 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 84.1% | | OMNIVORE (Swin-L) | 2022-01-20 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 81.8% | 186M | RDNet-L (224 res, IN-1K pretrained) | 2024-03-28 |
Grafit: Learning fine-grained image representations with coarse labels | | 81.2% | | RegNet-8GF | 2020-11-25 |
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition | ✓ Link | 81.0% | | VL-LTR (ViT-B-16) | 2021-11-26 |
A Continual Development Methodology for Large-scale Multitask Dynamic ML Systems | ✓ Link | 80.97 | | ยต2Net+ (ViT-L/16) | 2022-09-15 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 80.5 | 87M | RDNet-B (224 res, IN-1K pretrained) | 2024-03-28 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 80.3% | | MixMIM-L | 2022-05-26 |
Training data-efficient image transformers & distillation through attention | ✓ Link | 79.5% | | DeiT-B | 2020-12-23 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 79.4% | | CeiT-S (384 finetune resolution) | 2021-03-22 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 79.1 | 50M | RDNet-S (224 res, IN-1K pretrained) | 2024-03-28 |
Generalized Parametric Contrastive Learning | ✓ Link | 78.1% | | GPaCo (ResNet-152) | 2022-09-26 |
Going deeper with Image Transformers | ✓ Link | 78% | | CaiT-M-36 U 224 | 2021-03-31 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 77.5% | | MixMIM-B | 2022-05-26 |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs | ✓ Link | 77.0 | 24M | RDNet-T (224 res, IN-1K pretrained) | 2024-03-28 |
Generalized Parametric Contrastive Learning | ✓ Link | 75.4% | | GPaCo (ResNet-50) | 2022-09-26 |
Class-Balanced Distillation for Long-Tailed Visual Recognition | ✓ Link | 75.3% | | CBD-ENS (ResNet-101) | 2021-04-12 |
Three things everyone should know about Vision Transformers | ✓ Link | 75.3% | | ViT-L (attn finetune) | 2022-03-18 |
Parametric Contrastive Learning | ✓ Link | 75.2% | | PaCo(ResNet-152) | 2021-07-26 |
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition | ✓ Link | 74.6% | | VL-LTR (ResNet-50) | 2021-11-26 |
The Majority Can Help The Minority: Context-rich Minority Oversampling for Long-tailed Classification | ✓ Link | 74.0% | | BS-CMO (ResNet-50) | 2021-12-01 |
Class-Balanced Distillation for Long-Tailed Visual Recognition | ✓ Link | 73.6% | | CBD-ENS (ResNet-50) | 2021-04-12 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 73.3% | | CeiT-S | 2021-03-22 |
Self-Supervised Aggregation of Diverse Experts for Test-Agnostic Long-Tailed Recognition | ✓ Link | 72.9% | | TADE (ResNet-50) | 2021-07-20 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 72.2% | | CeiT-T (384 finetune resolution) | 2021-03-22 |
Long-tailed Recognition by Routing Diverse Distribution-Aware Experts | ✓ Link | 72.2% | | RIDE (ResNet-50) | 2020-10-05 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 70.54% | | ResNeXt-101 (SAMix) | 2021-11-30 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 70.49% | | ResNeXt-101 (AutoMix) | 2021-03-24 |
Disentangling Label Distribution for Long-tailed Visual Recognition | ✓ Link | 70.0% | | LADE | 2020-12-01 |
Grafit: Learning fine-grained image representations with coarse labels | | 69.8% | | ResNet-50 | 2020-11-25 |
Feature Space Augmentation for Long-Tailed Data | | 69.08% | | ResNet-152 | 2020-08-09 |
Class-Balanced Loss Based on Effective Number of Samples | ✓ Link | 69.05% | | ResNet-152 | 2019-01-16 |
MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition | ✓ Link | 68.75% | | MetaSAug | 2021-03-23 |
Feature Space Augmentation for Long-Tailed Data | | 68.39% | | ResNet-101 | 2020-08-09 |
Class-Balanced Loss Based on Effective Number of Samples | ✓ Link | 67.98% | | ResNet-101 | 2019-01-16 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 66.9% | | LeViT-384 | 2021-04-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 66.2% | | LeViT-256 | 2021-04-02 |
Feature Space Augmentation for Long-Tailed Data | | 65.91% | | ResNet-50 | 2020-08-09 |
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup | ✓ Link | 64.84% | | ResNet-50 (SAMix) | 2021-11-30 |
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers | ✓ Link | 64.73% | | ResNet-50 (AutoMix) | 2021-03-24 |
Incorporating Convolution Designs into Visual Transformers | ✓ Link | 64.3% | | CeiT-T | 2021-03-22 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 64.3 | | ResMLP-24 | 2021-05-07 |
Class-Balanced Loss Based on Effective Number of Samples | ✓ Link | 64.16% | | ResNet-50 | 2019-01-16 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 60.4% | | LeViT-192 | 2021-04-02 |
The iNaturalist Species Classification and Detection Dataset | ✓ Link | 60.20% | | Inception-V3 | 2017-07-20 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 60.2 | | ResMLP-12 | 2021-05-07 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 55.2% | | LeViT-128S | 2021-04-02 |
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference | ✓ Link | 54% | | LeViT-128 | 2021-04-02 |
ClusterFit: Improving Generalization of Visual Representations | ✓ Link | 49.7% | | ResNet-50 | 2019-12-06 |
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | ✓ Link | 48.6 | | ResNet-50 | 2020-06-17 |
Barlow Twins: Self-Supervised Learning via Redundancy Reduction | ✓ Link | 46.5 | | Barlow Twins (ResNet-50) | 2021-03-04 |