The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 63.6 | | 1610 | | | | ViT-P (InternImage-H) | 2025-05-26 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 63.0 | | 1500 | | | | ONE-PEACE | 2023-05-18 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 62.9 | | 1310 | | 4635 | | InternImage-H | 2022-11-10 |
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | ✓ Link | 62.9 | | 1310 | | | | M3I Pre-training (InternImage-H) | 2022-11-17 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 62.8 | | 1900 | | | | BEiT-3 | 2022-08-22 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 62.3 | | 1074 | | | | EVA | 2022-11-14 |
The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 61.6 | | 1400 | | | | ViT-P (OneFormer, InternImage-H) | 2025-05-26 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 61.5 | | 571 | | | | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) | 2022-05-17 |
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | ✓ Link | 61.4 | | 3000 | | | | FD-SwinV2-G | 2022-05-27 |
Reversible Column Networks | ✓ Link | 61.0 | | 2439 | | | | RevCol-H (Mask2Former) | 2022-12-22 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 60.8 | | 223 | | | | MasK DINO (SwinL, multi-scale) | 2022-06-06 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.5 | | 571 | | | | ViT-Adapter-L (Mask2Former, BEiT pretrain) | 2022-05-17 |
DINOv2: Learning Robust Visual Features without Supervision | ✓ Link | 60.2 | | 1080 | | | | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | 2023-04-14 |
The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 59.9 | | 309 | | | | ViT-P (OneFormer, DiNAT-L) | 2025-05-26 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 59.9 | | | | | | SwinV2-G(UperNet) | 2021-11-18 |
Parameter-Inverted Image Pyramid Networks | ✓ Link | 59.9 | | | | | | PIIP-LH6B(UperNet) | 2024-06-06 |
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks | ✓ Link | 59.35 | | | | | | SERNet-Former | 2024-01-28 |
Focal Modulation Networks | ✓ Link | 58.5 | | | | | | FocalNet-L (Mask2Former) | 2022-03-22 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 58.4 | | 451 | | | | ViT-Adapter-L (UperNet, BEiT pretrain) | 2022-05-17 |
Representation Separation for Semantic Segmentation with Vision Transformers | | 58.4 | | 330 | | | | RSSeg-ViT-L (BEiT pretrain) | 2022-12-28 |
Your ViT is Secretly an Image Segmentation Model | ✓ Link | 58.4 | | 316 | 721 | 721 | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) | 2025-03-24 |
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers | ✓ Link | 58.2 | | | 637.9 | | | SegViT-v2 (BEiT-v2-Large) | 2023-06-09 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 58.2 | | | | | | SeMask (SeMask Swin-L FaPN-Mask2Former) | 2021-12-23 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 58.2 | | | | | | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | 2021-12-23 |
Dilated Neighborhood Attention Transformer | ✓ Link | 58.1 | | | | | | DiNAT-L (Mask2Former) | 2022-09-29 |
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions | ✓ Link | 57.9 | | | | | | HorNet-L (Mask2Former) | 2022-07-28 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 57.7 | | | | | | Mask2Former (SwinL-FaPN) | 2021-12-02 |
Dynamic Focus-aware Positional Queries for Semantic Segmentation | ✓ Link | 57.7 | | | | | | FASeg (SwinL) | 2022-04-04 |
Region Rebalance for Long-Tailed Semantic Segmentation | ✓ Link | 57.7 | | | | | | RR (BEiT-L) | 2022-04-05 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 57.6 | | 496 | | | | MOAT-4 (IN-22K pretraining, single-scale) | 2022-10-04 |
Could Giant Pretrained Image Models Extract Universal Representations? | | 57.6 | | | | | | Frozen Backbone, SwinV2-G-ext22K (Mask2Former) | 2022-11-03 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 57.5 | | | | | | SeMask (SeMask Swin-L Mask2Former) | 2021-12-23 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 57.3 | | | | | | Mask2Former (SwinL) | 2021-12-02 |
Efficient Self-Ensemble for Semantic Segmentation | ✓ Link | 57.1 | | | | | | SenFormer (BEiT-L) | 2021-11-26 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 57.0 | | | | | | BEiT-L (ViT+UperNet) | 2021-06-15 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 57.0 | | | | | | SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale) | 2021-12-23 |
Harnessing Diffusion Models for Visual Perception with Meta Prompts | ✓ Link | 56.8 | | | | | | MetaPrompt-SD | 2023-12-22 |
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction | ✓ Link | 56.7 | | | | | | FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain) | 2021-08-16 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 56.5 | | 198 | | | | MOAT-3 (IN-22K pretraining, single-scale) | 2022-10-04 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 56.4 | | | | | | Mask2Former (Swin-L-FaPN) | 2021-12-02 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 56.2 | | | | | | SeMask (SeMask Swin-L MaskFormer) | 2021-12-23 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 56.2 | | | | | | dBOT ViT-L (CLIP) | 2022-09-08 |
Conditional Boundary Loss for Semantic Segmentation | ✓ Link | 56.1 | | | | | | Mask2Former+CBL(Swin-B) | 2023-07-05 |
Text-image Alignment for Diffusion-based Perception | ✓ Link | 55.9 | | | | | | TADP | 2023-09-29 |
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows | ✓ Link | 55.70 | | | | | | CSWin-L (UperNet, ImageNet-22k pretrain) | 2021-07-01 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 55.6 | | | | | | UniRepLKNet-XL | 2023-11-27 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 55.40 | | | | | | Focal-L (UperNet, ImageNet-22k pretrain) | 2021-07-01 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 55.3 | | 368 | | 3142 | | InternImage-XL | 2022-11-10 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 55.2 | | | | | | dBOT ViT-L | 2022-09-08 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 55.1 | | | | | | Mask2Former(Swin-B) | 2021-12-02 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 55 | | | | | | ConvNeXt V2-H (FCMAE) | 2023-01-02 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 55 | | | | | | UniRepLKNet-L++ | 2023-11-27 |
Dilated Neighborhood Attention Transformer | ✓ Link | 54.9 | | | | | | DiNAT-Large (UperNet) | 2022-09-29 |
Conditional Boundary Loss for Semantic Segmentation | ✓ Link | 54.9 | | | | | | MaskFormer+CBL(Swin-B) | 2023-07-05 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 54.7 | | 109 | | | | TransNeXt-Base (IN-1K pretrain, Mask2Former, 512) | 2023-11-28 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 54.7 | | 81 | | | | MOAT-2 (IN-22K pretraining, single-scale) | 2022-10-04 |
Context Autoencoder for Self-Supervised Representation Learning | ✓ Link | 54.7 | | | | | | CAE (ViT-L, UperNet) | 2022-02-07 |
Visual Attention Network | ✓ Link | 54.7 | | | | | | VAN-B6 | 2022-02-20 |
Dilated Neighborhood Attention Transformer | ✓ Link | 54.6 | | | | | | DiNAT_s-Large (UperNet) | 2022-09-29 |
DDP: Diffusion Model for Dense Visual Prediction | ✓ Link | 54.4 | | 207 | | | | DDP (Swin-L, step-3) | 2023-03-30 |
Vision Transformers with Patch Diversification | ✓ Link | 54.4 | | | | | | PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain) | 2021-04-26 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 54.3 | | | | | | VOLO-D5 | 2021-06-24 |
K-Net: Towards Unified Image Segmentation | ✓ Link | 54.3 | | | | | | K-Net | 2021-06-28 |
Generalized Parametric Contrastive Learning | ✓ Link | 54.3 | | | | | | GPaCo (Swin-L) | 2022-09-26 |
Efficient Self-Ensemble for Semantic Segmentation | ✓ Link | 54.2 | | | | | | SenFormer (Swin-L) | 2021-11-26 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 54.2 | | | | | | Swin V2-H | 2023-01-02 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 54.1 | | 256 | | 2526 | | InternImage-L | 2022-11-10 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 54.1 | | 69 | | | | TransNeXt-Small (IN-1K pretrain, Mask2Former, 512) | 2023-11-28 |
A ConvNet for the 2020s | ✓ Link | 54 | | 391 | 3335 | | | ConvNeXt-XL++ | 2022-01-10 |
Sequential Ensembling for Semantic Segmentation | | 54 | | 216.3 | | | | Sequential Ensemble (SegFormer) | 2022-10-08 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 54 | | | | | | MogaNet-XL (UperNet) | 2022-11-07 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 53.9 | | | | | | UniRepLKNet-B++ | 2023-11-27 |
Per-Pixel Classification is Not All You Need for Semantic Segmentation | ✓ Link | 53.8 | | | | | | MaskFormer(Swin-B) | 2021-07-13 |
A ConvNet for the 2020s | ✓ Link | 53.7 | | 235 | 2458 | | | ConvNeXt-L++ | 2022-01-10 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 53.7 | | | | | | SwinV2-G-HTC++ Liu et al. ([2021a]) | 2021-11-18 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 53.7 | | | | | | ConvNeXt V2-L | 2023-01-02 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 53.63 | | | | | | Seg-L-Mask/16 (MS) | 2021-05-12 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 53.6 | | | | | | MAE (ViT-L, UperNet) | 2021-11-11 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 53.52 | | | | | | SeMask (SeMask Swin-L FPN) | 2021-12-23 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 53.50 | 62.8 | | | | | Swin-L (UperNet, ImageNet-22k pretrain) | 2021-03-25 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 53.5 | | | | | | Swin-L | 2023-01-02 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 53.4 | | 47.5 | | | | TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512) | 2023-11-28 |
A ConvNet for the 2020s | ✓ Link | 53.1 | | 122 | 1828 | | | ConvNeXt-B++ | 2022-01-10 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 52.9 | | | | | | PatchConvNet-L120 (UperNet) | 2021-12-27 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 52.9 | | | | | | dBOT ViT-B (CLIP) | 2022-09-08 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 52.8 | | | | | | PatchConvNet-B120
(UperNet) | 2021-12-27 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 52.8 | | | | | | Swin-B | 2023-01-02 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 52.7 | | | | | | UniRepLKNet-S++ | 2023-11-27 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 52.1 | | | | | | ConvNeXt V2-B | 2023-01-02 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | 52.0 | | | | | | DeBiFormer-B (IN1k pretrain, Upernet 160k) | 2024-10-11 |
All Tokens Matter: Token Labeling for Training Better Vision Transformers | ✓ Link | 51.8 | | 209 | | | | LV-ViT-L (UperNet, MS) | 2021-04-22 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | 51.8 | | 84.7 | | | | SegFormer-B5 | 2021-05-31 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | 51.7 | | | | | | BiFormer-B (IN1k pretrain, Upernet 160k) | 2023-03-15 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 51.6 | | | | | | ConvNeXt V2-L (Supervised) | 2023-01-02 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 51.5 | | 61.1 | 71.8 | | | Light-Ham (VAN-Huge) | 2021-09-09 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 51.5 | | | | | | DAT-B++ | 2023-09-04 |
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention | ✓ Link | 51.4 | | | | | | CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test) | 2021-07-31 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 51.3 | | 128 | | 1185 | | InternImage-B | 2022-11-10 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 51.2 | | | | | | DAT-S++ | 2023-09-04 |
Active Token Mixer | ✓ Link | 51.1 | | 108 | | | | ActiveMLP-L(UperNet) | 2022-03-11 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | 51.1 | | 64.1 | | | | SegFormer-B4 | 2021-05-31 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 51.1 | | | | | | PatchConvNet-B60 (UperNet) | 2021-12-27 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 51.0 | | 45.6 | 55.0 | | | Light-Ham (VAN-Large) | 2021-09-09 |
Towards Sustainable Self-supervised Learning | ✓ Link | 51.0 | | | | | | TEC (Vit-B, Upernet) | 2022-10-20 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 51 | | | | | | UniRepLKNet-S | 2023-11-27 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 50.98 | | 96 | | | | SeMask (SeMask Swin-B FPN) | 2021-12-23 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 50.9 | | 80 | | 1017 | | InternImage-S | 2022-11-10 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 50.9 | | | 1176 | | | MogaNet-L (UperNet) | 2022-11-07 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 50.8 | | | | | | dBOT ViT-B | 2022-09-08 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | 50.8 | | | | | | Upernet-BiFormer-S (IN1k pretrain, Upernet 160k) | 2023-03-15 |
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer | ✓ Link | 50.5 | | | | | | UperNet Shuffle-B | 2021-06-07 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 50.5 | | | | | | ConvNeXt V1-L | 2023-01-02 |
Dilated Neighborhood Attention Transformer | ✓ Link | 50.4 | | | | | | DiNAT-Base (UperNet) | 2022-09-29 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 50.3 | | | | | | ELSA-Swin-S | 2021-12-23 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 50.3 | | | | | | DAT-T++ | 2023-09-04 |
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers | ✓ Link | 50.28 | | | | | | SETR-MLA (160k, MS) | 2020-12-31 |
Visual Attention Network | ✓ Link | 50.2 | | 55 | | | | VAN-Large (HamNet) | 2022-02-20 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 50.2 | | 28.7 | 67.9 | | | HRViT-b3 (SegFormer, SS) | 2021-11-01 |
Twins: Revisiting the Design of Spatial Attention in Vision Transformers | ✓ Link | 50.2 | | | | | | Twins-SVT-L (UperNet, ImageNet-1k pretrain) | 2021-04-28 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 50.1 | | | 1050 | | | MogaNet-B (UperNet) | 2022-11-07 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 50.0 | | | | | | Seg-B-Mask/16(MS, ViT-B) | 2021-05-12 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 50.0 | | | | | | iBOT (ViT-B/16) | 2021-11-15 |
A ConvNet for the 2020s | ✓ Link | 49.9 | | 122 | 1170 | | | ConvNeXt-B | 2022-01-10 |
Dilated Neighborhood Attention Transformer | ✓ Link | 49.9 | | | | | | DiNAT-Small (UperNet) | 2022-09-29 |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | ✓ Link | 49.9 | | | | | | ConvNeXt V1-B | 2023-01-02 |
Neighborhood Attention Transformer | ✓ Link | 49.7 | | 123 | 1137 | | | NAT-Base | 2022-04-14 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 49.7 | | | | | | Swin-B (UperNet, ImageNet-1k pretrain) | 2021-03-25 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 49.61 | | | | | | Seg-B/8 (MS, ViT-B) | 2021-05-12 |
A ConvNet for the 2020s | ✓ Link | 49.6 | | 82 | 1027 | | | ConvNeXt-S | 2022-01-10 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 49.6 | | 27.4 | 34.4 | | | Light-Ham (VAN-Base) | 2021-09-09 |
Neighborhood Attention Transformer | ✓ Link | 49.5 | | 82 | 1010 | | | NAT-Small | 2022-04-14 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 49.4 | | | | | | DaViT-B | 2022-04-07 |
Vision Transformer with Deformable Attention | ✓ Link | 49.38 | | 121 | | | | DAT-B (UperNet) | 2022-01-03 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 49.3 | | | | | | PatchConvNet-S60 (UperNet) | 2021-12-27 |
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders | ✓ Link | 49.3 | | | | | | ColorMAE-Green-ViTB-1600 | 2024-07-17 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 49.2 | | | 946 | | | MogaNet-S (UperNet) | 2022-11-07 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 49.2 | | | | | | Shift-B (UperNet) | 2022-01-26 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | 49.1 | | | | | | UniRepLKNet-T | 2023-11-27 |
Vision Transformers for Dense Prediction | ✓ Link | 49.02 | | | | | | DPT-Hybrid | 2021-03-24 |
Global Context Vision Transformers | ✓ Link | 49 | | 125 | 1348 | | | GC ViT-B | 2022-06-20 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 49 | | | | | | A2MIM (ViT-B) | 2022-05-27 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 49 | | | | | | EfficientViT-B3 (r512) | 2022-05-29 |
Dilated Neighborhood Attention Transformer | ✓ Link | 48.8 | | | | | | DiNAT-Tiny (UperNet) | 2022-09-29 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 48.76 | | 20.8 | 28.0 | | | HRViT-b2 (SegFormer, SS) | 2021-11-01 |
Neighborhood Attention Transformer | ✓ Link | 48.4 | | 58 | 934 | | | NAT-Tiny | 2022-04-14 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 48.4 | | | | | | XCiT-M24/8 (UperNet) | 2021-06-17 |
ResNeSt: Split-Attention Networks | ✓ Link | 48.36 | | | | | | ResNeSt-200 | 2020-04-19 |
Vision Transformer with Deformable Attention | ✓ Link | 48.31 | | 81 | | | | DAT-S (UperNet) | 2022-01-03 |
Global Context Vision Transformers | ✓ Link | 48.3 | | 84 | 1163 | | | GC ViT-S | 2022-06-20 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 48.1 | | 59 | | 944 | | InternImage-T | 2022-11-10 |
Visual Attention Network | ✓ Link | 48.1 | | 49 | | | | VAN-Large | 2022-02-20 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 48.1 | | | | | | XCiT-S24/8 (UperNet) | 2021-06-17 |
Per-Pixel Classification is Not All You Need for Semantic Segmentation | ✓ Link | 48.1 | | | | | | MaskFormer(ResNet-101) | 2021-07-13 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 48.1 | | | | | | MAE (ViT-B, UperNet) | 2021-11-11 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 47.98 | | | | | | HRNetV2 + OCR + RMI (PaddleClas pretrained) | 2019-09-24 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 47.9 | | | | | | Shift-B | 2022-01-26 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 47.8 | | | | | | Shift-S | 2022-01-26 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 47.7 | | | 189 | | | MogaNet-S (Semantic FPN) | 2022-11-07 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 47.63 | | 56 | | | | SeMask (SeMask Swin-S FPN) | 2021-12-23 |
ResNeSt: Split-Attention Networks | ✓ Link | 47.60 | | | | | | ResNeSt-269 | 2020-04-19 |
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer | ✓ Link | 47.6 | | | | | | UperNet Shuffle-T | 2021-06-07 |
CondNet: Conditional Classifier for Scene Segmentation | ✓ Link | 47.54 | | | | | | CondNet(ResNest-101) | 2021-09-21 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 47.5 | | 24 | | | | tiny-MOAT-3 (IN-1K pretraining, single scale) | 2022-10-04 |
CondNet: Conditional Classifier for Scene Segmentation | ✓ Link | 47.38 | | | | | | CondNet(ResNet-101) | 2021-09-21 |
Dilated Neighborhood Attention Transformer | ✓ Link | 47.2 | | | | | | DiNAT-Mini (UperNet) | 2022-09-29 |
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation | | 47.12 | | | | | | DCNAS | 2020-03-26 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 47.1 | | | | | | XCiT-S24/8 (Semantic-FPN) | 2021-06-17 |
ResNeSt: Split-Attention Networks | ✓ Link | 46.91 | | | | | | ResNeSt-101 | 2020-04-19 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 46.9 | | | | | | XCiT-M24/8 (Semantic-FPN) | 2021-06-17 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 46.8 | | | | | | HamNet (ResNet-101) | 2021-09-09 |
Sequential Ensembling for Semantic Segmentation | | 46.8 | | | | | | Sequential Ensemble (DeepLabv3+) | 2022-10-08 |
A ConvNet for the 2020s | ✓ Link | 46.7 | | 60 | 939 | | | ConvNeXt-T | 2022-01-10 |
Visual Attention Network | ✓ Link | 46.7 | | | | | | VAN-Base (Semantic-FPN) | 2022-02-20 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 46.6 | | | | | | XCiT-S12/8 (UperNet) | 2021-06-17 |
Global Context Vision Transformers | ✓ Link | 46.5 | | 58 | 947 | | | GC ViT-T | 2022-06-20 |
Neighborhood Attention Transformer | ✓ Link | 46.4 | | 50 | 900 | | | NAT-Mini | 2022-04-14 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | 46.3 | | | | | | Shift-T | 2022-01-26 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 46.3 | | | | | | DaViT-T | 2022-04-07 |
Context Prior for Scene Segmentation | ✓ Link | 46.27 | | | | | | CPN(ResNet-101) | 2020-04-03 |
MultiMAE: Multi-modal Multi-task Masked Autoencoders | ✓ Link | 46.2 | | | | | | MultiMAE (ViT-B) | 2022-04-04 |
Scene Segmentation with Dual Relation-aware Attention Network | ✓ Link | 46.18 | | | | | | DRAN(ResNet-101) | 2020-08-05 |
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition | ✓ Link | 45.99 | 56.52 | | | | | PyConvSegNet-152 | 2020-06-20 |
Disentangled Non-Local Neural Networks | ✓ Link | 45.97 | | | | | | DNL | 2020-06-11 |
Adaptive Context Network for Scene Parsing | | 45.90 | | | | | | ACNet (ResNet-101) | 2019-11-05 |
Adaptive Context Network for Scene Parsing | | 45.90 | | | | | | ACNet
(ResNet-101) | 2019-11-05 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 45.88 | | 8.2 | 14.6 | | | HRViT-b1 (SegFormer, SS) | 2021-11-01 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 45.66 | | | | | | OCR(HRNetV2-W48) | 2019-09-24 |
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing | ✓ Link | 45.6 | | | | | | SPNet (ResNet-101) | 2020-03-30 |
Self-Supervised Learning with Swin Transformers | ✓ Link | 45.58 | | | | | | Swin-T (UPerNet) MoBY | 2021-05-10 |
Vision Transformer with Deformable Attention | ✓ Link | 45.54 | | 60 | | | | DAT-T (UperNet) | 2022-01-03 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 45.4 | | | | | | iBOT (ViT-S/16) | 2021-11-15 |
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks | ✓ Link | 45.33 | | | | | | EANet
(ResNet-101) | 2021-05-05 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 45.28 | | | | | | OCR (ResNet-101) | 2019-09-24 |
Asymmetric Non-local Neural Networks for Semantic Segmentation | ✓ Link | 45.24 | | | | | | Asymmetric ALNN | 2019-08-21 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 45.2 | | 13.8 | 15.8 | | | Light-Ham (VAN-Small, D=256) | 2021-09-09 |
Location-aware Upsampling for Semantic Segmentation | ✓ Link | 45.02 | 56.32 | | | | | LaU-regression-loss | 2019-11-13 |
Pyramid Scene Parsing Network | ✓ Link | 44.94 | 55.38 | | | | | PSPNet | 2016-12-04 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 44.9 | | 13 | | | | tiny-MOAT-2 (IN-1K pretraining, single scale) | 2022-10-04 |
Co-Occurrent Features in Semantic Segmentation | ✓ Link | 44.89 | | | | | | CFNet(ResNet-101) | 2019-06-01 |
Context Encoding for Semantic Segmentation | ✓ Link | 44.65 | 55.67 | | | | | EncNet | 2018-03-23 |
Location-aware Upsampling for Semantic Segmentation | ✓ Link | 44.55 | 56.41 | | | | | LaU-offset-loss | 2019-11-13 |
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation | ✓ Link | 44.34 | 55.84 | | | | | EncNet + JPU | 2019-03-28 |
Symbolic Graph Reasoning Meets Convolutions | ✓ Link | 44.32 | | | | | | SGR (ResNet-101) | 2018-12-01 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 44.2 | | | | | | XCiT-S12/8 (Semantic-FPN) | 2021-06-17 |
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation | ✓ Link | 43.98 | | | | | | Auto-DeepLab-L | 2019-01-10 |
PSANet: Point-wise Spatial Attention Network for Scene Parsing | ✓ Link | 43.77 | | | | | | PSANet (ResNet-101) | 2018-09-01 |
Dynamic-structured Semantic Propagation Network | | 43.68 | | | | | | DSSPN (ResNet-101) | 2018-03-16 |
Pyramid Scene Parsing Network | ✓ Link | 43.51 | | | | | | PSPNet (ResNet-152) | 2016-12-04 |
Pyramid Scene Parsing Network | ✓ Link | 43.29 | | | | | | PSPNet
(ResNet-101) | 2016-12-04 |
High-Resolution Representations for Labeling Pixels and Regions | ✓ Link | 43.2 | | | | | | HRNetV2 | 2019-04-09 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 43.16 | | 35 | | | | SeMask (SeMask Swin-T FPN) | 2021-12-23 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 43.1 | | 8 | | | | tiny-MOAT-1 (IN-1K pretraining, single scale) | 2022-10-04 |
Visual Attention Network | ✓ Link | 42.9 | | 18 | | | | VAN-Small | 2022-02-20 |
MetaFormer Is Actually What You Need for Vision | ✓ Link | 42.7 | | | | | | PoolFormer-M48 | 2021-11-22 |
Unified Perceptual Parsing for Scene Understanding | ✓ Link | 42.66 | | | | | | UperNet (ResNet-101) | 2018-07-26 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 41.2 | | 6 | | | | tiny-MOAT-0 (IN-1K pretraining, single scale) | 2022-10-04 |
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation | ✓ Link | 40.7 | | | | | | RefineNet | 2016-11-20 |
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run | | 40.4 | | | | | | FBNetV5 | 2021-11-19 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 40 | | | | | | ConvMLP-L | 2021-09-09 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 38.6 | | | | | | ConvMLP-M | 2021-09-09 |
Visual Attention Network | ✓ Link | 38.5 | | 8 | | | | VAN-Tiny | 2022-02-20 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 38.3 | | | | | | A2MIM (ResNet-50) | 2022-05-27 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 38.3 | | | | | | iBOT (ViT-B/16) (linear head) | 2021-11-15 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | 37.4 | | 3.8 | | | | SegFormer-B0 | 2021-05-31 |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 35.8 | | | | | | MUXNet-m + PPM | 2020-03-31 |
ConvMLP: Hierarchical Convolutional MLPs for Vision | ✓ Link | 35.8 | | | | | | ConvMLP-S | 2021-09-09 |
MUXConv: Information Multiplexing in Convolutional Neural Networks | ✓ Link | 32.42 | | | | | | MUXNet-m + C1 | 2020-03-31 |
Multi-Scale Context Aggregation by Dilated Convolutions | ✓ Link | 32.31 | | | | | | DilatedNet | 2015-11-23 |
Fully Convolutional Networks for Semantic Segmentation | ✓ Link | 29.39 | | | | | | FCN | 2014-11-14 |
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation | ✓ Link | 21.64 | | | | | | SegNet | 2015-11-02 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | | | 1310 | | | | InternImage-H (M3I Pre-training) | 2022-11-10 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | | | | | | 44.6 | FastViT-MA36 | 2023-03-24 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | | | | | | 42.9 | FastViT-SA36 | 2023-03-24 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | | | | | | 41 | FastViT-SA24 | 2023-03-24 |
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | ✓ Link | | | | | | 38 | FastViT-SA12 | 2023-03-24 |