CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 82.7 | | CoCa | 2022-05-04 |
LiT: Zero-Shot Transfer with Locked-image text Tuning | ✓ Link | 82.5 | | LiT | 2021-11-15 |
Combined Scaling for Zero-shot Transfer Learning | | 82.3 | | BASIC | 2021-11-19 |
EVA-CLIP: Improved Training Techniques for CLIP at Scale | ✓ Link | 79.6 | | EVA-02-CLIP-E/14+ | 2023-03-27 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 79.03 | | Baseline (ViT-G/14) | 2022-03-10 |
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time | ✓ Link | 78.52 | | Model soups (ViT-G/14) | 2022-03-10 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 77.9 | | MAWS (ViT-6.5B) | 2023-03-23 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 75.8 | | MAWS (ViT-2B) | 2023-03-23 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 72.6 | | MAWS (ViT-H) | 2023-03-23 |
Learning Transferable Visual Models From Natural Language Supervision | ✓ Link | 72.3 | | CLIP | 2021-02-26 |
Combined Scaling for Zero-shot Transfer Learning | | 72.2 | | ALIGN | 2021-11-19 |
Robust fine-tuning of zero-shot models | ✓ Link | 72.1 | | WiSE-FT | 2021-09-04 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 72.0 | | ViT-e | 2022-09-14 |
Scaling Vision Transformers | ✓ Link | 70.53 | | ViT-G/14 | 2021-06-08 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 69.5 | | SWAG (ViT H/14) | 2022-01-20 |
Scaling Vision Transformers | ✓ Link | 68.5 | | NS (Eff.-L2) | 2021-06-08 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 64.3 | | RegNetY 128GF (Platt) | 2022-01-20 |
A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others | ✓ Link | 60.78 | | LLE (ViT-H/14, MAE, Edge Aug) | 2022-12-09 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 60.2 | | SEER (RegNet10B) | 2022-02-16 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 60 | | ViT H/14 (Platt) | 2022-01-20 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 58.7 | 80 | BiT-L (ResNet-152x4) | 2019-12-24 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 57.3 | | ViT L/16 (Platt) | 2022-01-20 |
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy | ✓ Link | 53.9 | | Vit B/16 (Bamboo) | 2022-03-15 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 52.0 | 73.5 | AR-L (Opt Relevance) | 2022-06-02 |
Matryoshka Representation Learning | ✓ Link | 51.6 | | ALIGN-MRL | 2022-05-26 |
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations | | 50.7 | | ViT-B/16 (ANN-1.3B) | 2021-08-12 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 49.39 | | ViT-B/16 (512x512) + Pyramid | 2021-11-30 |
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations | | 49.1 | | ResNet-101 (JFT-300M) | 2021-08-12 |
Revisiting Weakly Supervised Pre-Training of Visual Perception Models | ✓ Link | 48.9 | | ViT B/16 | 2022-01-20 |
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations | | 48.4 | | ViT-B/32 | 2021-08-12 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 47.53 | | ViT-B/16 (512x512) + Pixel | 2021-11-30 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 47.1 | 70 | AR-B (Opt Relevance) | 2022-06-02 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 47.0 | 69 | BiT-M (ResNet-152x4) | 2019-12-24 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 46.68 | | ViT-B/16 (512x512) | 2021-11-30 |
Discrete Representations Strengthen Vision Transformer Robustness | ✓ Link | 46.62 | | ViT-B (Discrete 512x512) | 2021-11-20 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 46.5 | 68.3 | AR-L | 2022-06-02 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 43.2 | 65.8 | ViT-L (Opt Relevance) | 2022-06-02 |
Optimal Representations for Covariate Shift | ✓ Link | 42.80 | | CLIP L | 2021-12-31 |
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations | | 42.5 | | ResNet-50 (JFT-300M) | 2021-08-12 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 42.2 | 65.1 | ViT-B (Opt Relevance) | 2022-06-02 |
Optimal Representations for Covariate Shift | ✓ Link | 42.10 | | CLIP L (LAION) | 2021-12-31 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 41.4 | 63.7 | AR-B | 2022-06-02 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 39.79 | | RegViT on 384x384 + Adv Pyramid | 2021-11-30 |
Generative Interventions for Causal Learning | ✓ Link | 39.38 | 61.43 | ResNet-152 + GenInt with Transfer | 2020-12-22 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 39.3 | 61.7 | AR-S (Opt Relevance) | 2022-06-02 |
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy | ✓ Link | 38.8 | | ResNet-50 (Bamboo) | 2022-03-15 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 37.41 | | RegViT on 384x384 + Adv Pixel | 2021-11-30 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 37.4 | 59.5 | ViT-L | 2022-06-02 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 36.3 | 56.6 | DeiT-L (Opt Relevance) | 2022-06-02 |
Big Transfer (BiT): General Visual Representation Learning | ✓ Link | 36.0 | 57 | BiT-S (ResNet-152x4) | 2019-12-24 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 35.77 | 56.05 | NASNet-A | 2019-12-01 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 35.63 | 54.95 | PNASNet-5L | 2019-12-01 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 35.59 | | RegViT on 384x384 | 2021-11-30 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 35.1 | 56.4 | ViT-B | 2022-06-02 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 34.83 | | RegViT on 384x384 + Random Pyramid | 2021-11-30 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 34.3 | 55.8 | AR-S | 2022-06-02 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 34.12 | | RegViT on 384x384 + Random Pixel | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 32.92 | | RegViT (RandAug) + Adv Pyramid | 2021-11-30 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 32.24 | 51.98 | Inception-v4 | 2019-12-01 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 31.6 | 53 | DeiT-S (Opt Relevance) | 2022-06-02 |
Context-Gated Convolution | ✓ Link | 31.53 | 50.16 | ResNet-50 + CGC | 2019-10-12 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 31.4 | 48.5 | DeiT-L | 2022-06-02 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 30.98 | | Discrete ViT + Pixel | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 30.28 | | Discrete ViT + Pyramid | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 30.11 | | RegViT (RandAug) + Adv Pixel | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 29.95 | | Discrete ViT | 2021-11-30 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 29.59 | 49.4 | ResNet-152 | 2019-12-01 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 29.41 | | RegViT (RandAug) + Random Pyramid | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 29.3 | | RegViT (RandAug) | 2021-11-30 |
Improving robustness against common corruptions by covariate shift adaptation | ✓ Link | 29.2 | 50.2 | ResNet-50 + GroupNorm | 2020-06-30 |
Improving robustness against common corruptions by covariate shift adaptation | ✓ Link | 29.2 | | ResNet-50 + RoHL | 2020-06-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 28.72 | | RegViT (RandAug) + Random Pixel | 2021-11-30 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 28.6 | | MLP-Mixer + Pyramid | 2021-11-30 |
Improving robustness against common corruptions by covariate shift adaptation | ✓ Link | 28.5 | 48.6 | ResNet-50 + FixUp | 2020-06-30 |
On Mixup Regularization | ✓ Link | 28.37 | | ResNet-50 + MixUp (rescaled) | 2020-06-10 |
Optimizing Relevance Maps of Vision Transformers Improves Robustness | ✓ Link | 28.3 | 47.3 | DeiT-S | 2022-06-02 |
Generative Interventions for Causal Learning | ✓ Link | 27.03 | 48.02 | ResNet-18 + GenInt with Transfer | 2020-12-22 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 25.9 | | MLP-Mixer | 2021-11-30 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 25.9 | | RELICv2 | 2022-01-13 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 25.65 | | ViT + MixUp | 2021-11-30 |
Compressive Visual Representations | ✓ Link | 25.5 | | C-BYOL | 2021-09-27 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 24.75 | | MLP-Mixer + Pixel | 2021-11-30 |
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations | | 23.9 | | BYOL (BG_RM) | 2021-03-23 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 23.8 | | RELIC | 2022-01-13 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 23 | | BYOL | 2022-01-13 |
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations | | 21.9 | | SwAV (BG_RM) | 2021-03-23 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 21.61 | | ViT + CutMix | 2021-11-30 |
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations | | 20.8 | | MoCo-v2 (BG_Swaps) | 2021-03-23 |
Compressive Visual Representations | ✓ Link | 20.8 | | C-SimCLR | 2021-09-27 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 20.61 | 48.83 | SeLa(v2) (reverse linear probing) | 2021-09-29 |
Representation Learning by Detecting Incorrect Location Embeddings | ✓ Link | 20.51 | | DILEMMA | 2022-04-10 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 19.73 | 46.81 | DeepCluster(v2) (reverse linear probing) | 2021-09-29 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 19.13 | 37.15 | VGG-14 | 2019-12-01 |
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) | ✓ Link | 18.70 | | ResNet-50 (ImageNet-Captions) | 2022-05-03 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 17.71 | 43.64 | SwAV (reverse linear probing) | 2021-09-29 |
Pyramid Adversarial Training Improves ViT Performance | ✓ Link | 17.36 | | ViT | 2021-11-30 |
Compact and Optimal Deep Learning with Recurrent Parameter Generators | ✓ Link | 16.5 | | ResNet34-RPG | 2021-07-15 |
Robust Cross-Modal Representation Learning with Progressive Self-Distillation | | 15.24 | | CLIP (CC12M pretrain) | 2022-04-10 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 14.6 | | SimCLR | 2022-01-13 |
Class-agnostic Object Detection | | 13.2 | 29.7 | ResNet-152 (FRCNN-ag-ad, VOC) | 2020-11-28 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 12.67 | 31.45 | MoCo(v2) (reverse linear probing) | 2021-09-29 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 12.64 | 31.71 | MoCHi (reverse linear probing) | 2021-09-29 |
Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing | | 12.23 | 31.72 | OBoW (reverse linear probing) | 2021-09-29 |
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | | 6.78 | 17.6 | AlexNet | 2019-12-01 |
Self-Supervised Learning for Large-Scale Unsupervised Image Clustering | ✓ Link | 4.92 | | BigBiGAN (RevNet-50 4×) | 2020-08-24 |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | ✓ Link | | 82.1 | ViT-H/14 | 2020-10-22 |