OpenCodePapers

semantic-segmentation-on-ade20k

Semantic Segmentation
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeValidation mIoUTest ScoreParams (M)GFLOPs (512 x 512)GFLOPsMean IoU (class)ModelNameReleaseDate
The Missing Point in Vision Transformers for Universal Image Segmentation✓ Link63.61610ViT-P (InternImage-H)2025-05-26
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link63.01500ONE-PEACE2023-05-18
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link62.913104635InternImage-H2022-11-10
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information✓ Link62.91310M3I Pre-training (InternImage-H)2022-11-17
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks✓ Link62.81900BEiT-32022-08-22
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link62.31074EVA2022-11-14
The Missing Point in Vision Transformers for Universal Image Segmentation✓ Link61.61400ViT-P (OneFormer, InternImage-H)2025-05-26
Vision Transformer Adapter for Dense Predictions✓ Link61.5571ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)2022-05-17
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation✓ Link61.43000FD-SwinV2-G2022-05-27
Reversible Column Networks✓ Link61.02439RevCol-H (Mask2Former)2022-12-22
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation✓ Link60.8223MasK DINO (SwinL, multi-scale)2022-06-06
Vision Transformer Adapter for Dense Predictions✓ Link60.5571ViT-Adapter-L (Mask2Former, BEiT pretrain)2022-05-17
DINOv2: Learning Robust Visual Features without Supervision✓ Link60.21080DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)2023-04-14
The Missing Point in Vision Transformers for Universal Image Segmentation✓ Link59.9309ViT-P (OneFormer, DiNAT-L)2025-05-26
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link59.9SwinV2-G(UperNet)2021-11-18
Parameter-Inverted Image Pyramid Networks✓ Link59.9PIIP-LH6B(UperNet)2024-06-06
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks✓ Link59.35SERNet-Former2024-01-28
Focal Modulation Networks✓ Link58.5FocalNet-L (Mask2Former)2022-03-22
Vision Transformer Adapter for Dense Predictions✓ Link58.4451ViT-Adapter-L (UperNet, BEiT pretrain)2022-05-17
Representation Separation for Semantic Segmentation with Vision Transformers58.4330RSSeg-ViT-L (BEiT pretrain)2022-12-28
Your ViT is Secretly an Image Segmentation Model✓ Link58.431672172158.4EoMT (DINOv2-L, single-scale, 512x512)2025-03-24
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers✓ Link58.2637.9SegViT-v2 (BEiT-v2-Large)2023-06-09
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link58.2SeMask (SeMask Swin-L FaPN-Mask2Former)2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link58.2SeMask (SeMask Swin-L MSFaPN-Mask2Former)2021-12-23
Dilated Neighborhood Attention Transformer✓ Link58.1DiNAT-L (Mask2Former)2022-09-29
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions✓ Link57.9HorNet-L (Mask2Former)2022-07-28
Masked-attention Mask Transformer for Universal Image Segmentation✓ Link57.7Mask2Former (SwinL-FaPN)2021-12-02
Dynamic Focus-aware Positional Queries for Semantic Segmentation✓ Link57.7FASeg (SwinL)2022-04-04
Region Rebalance for Long-Tailed Semantic Segmentation✓ Link57.7RR (BEiT-L)2022-04-05
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link57.6496MOAT-4 (IN-22K pretraining, single-scale)2022-10-04
Could Giant Pretrained Image Models Extract Universal Representations?57.6Frozen Backbone, SwinV2-G-ext22K (Mask2Former)2022-11-03
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link57.5SeMask (SeMask Swin-L Mask2Former)2021-12-23
Masked-attention Mask Transformer for Universal Image Segmentation✓ Link57.3Mask2Former (SwinL)2021-12-02
Efficient Self-Ensemble for Semantic Segmentation✓ Link57.1SenFormer (BEiT-L)2021-11-26
BEiT: BERT Pre-Training of Image Transformers✓ Link57.0BEiT-L (ViT+UperNet)2021-06-15
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link57.0SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)2021-12-23
Harnessing Diffusion Models for Visual Perception with Meta Prompts✓ Link56.8MetaPrompt-SD2023-12-22
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction✓ Link56.7FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)2021-08-16
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link56.5198MOAT-3 (IN-22K pretraining, single-scale)2022-10-04
Masked-attention Mask Transformer for Universal Image Segmentation✓ Link56.4Mask2Former (Swin-L-FaPN)2021-12-02
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link56.2SeMask (SeMask Swin-L MaskFormer)2021-12-23
Exploring Target Representations for Masked Autoencoders✓ Link56.2dBOT ViT-L (CLIP)2022-09-08
Conditional Boundary Loss for Semantic Segmentation✓ Link56.1Mask2Former+CBL(Swin-B)2023-07-05
Text-image Alignment for Diffusion-based Perception✓ Link55.9TADP2023-09-29
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows✓ Link55.70CSWin-L (UperNet, ImageNet-22k pretrain)2021-07-01
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link55.6UniRepLKNet-XL2023-11-27
Focal Self-attention for Local-Global Interactions in Vision Transformers✓ Link55.40Focal-L (UperNet, ImageNet-22k pretrain)2021-07-01
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link55.33683142InternImage-XL2022-11-10
Exploring Target Representations for Masked Autoencoders✓ Link55.2dBOT ViT-L2022-09-08
Masked-attention Mask Transformer for Universal Image Segmentation✓ Link55.1Mask2Former(Swin-B)2021-12-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link55ConvNeXt V2-H (FCMAE)2023-01-02
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link55UniRepLKNet-L++2023-11-27
Dilated Neighborhood Attention Transformer✓ Link54.9DiNAT-Large (UperNet)2022-09-29
Conditional Boundary Loss for Semantic Segmentation✓ Link54.9MaskFormer+CBL(Swin-B)2023-07-05
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link54.7109TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)2023-11-28
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link54.781MOAT-2 (IN-22K pretraining, single-scale)2022-10-04
Context Autoencoder for Self-Supervised Representation Learning✓ Link54.7CAE (ViT-L, UperNet)2022-02-07
Visual Attention Network✓ Link54.7VAN-B62022-02-20
Dilated Neighborhood Attention Transformer✓ Link54.6DiNAT_s-Large (UperNet)2022-09-29
DDP: Diffusion Model for Dense Visual Prediction✓ Link54.4207DDP (Swin-L, step-3)2023-03-30
Vision Transformers with Patch Diversification✓ Link54.4PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)2021-04-26
VOLO: Vision Outlooker for Visual Recognition✓ Link54.3VOLO-D52021-06-24
K-Net: Towards Unified Image Segmentation✓ Link54.3K-Net2021-06-28
Generalized Parametric Contrastive Learning✓ Link54.3GPaCo (Swin-L)2022-09-26
Efficient Self-Ensemble for Semantic Segmentation✓ Link54.2SenFormer (Swin-L)2021-11-26
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link54.2Swin V2-H2023-01-02
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link54.12562526InternImage-L2022-11-10
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link54.169TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)2023-11-28
A ConvNet for the 2020s✓ Link543913335ConvNeXt-XL++2022-01-10
Sequential Ensembling for Semantic Segmentation54216.3Sequential Ensemble (SegFormer)2022-10-08
MogaNet: Multi-order Gated Aggregation Network✓ Link54MogaNet-XL (UperNet)2022-11-07
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link53.9UniRepLKNet-B++2023-11-27
Per-Pixel Classification is Not All You Need for Semantic Segmentation✓ Link53.8MaskFormer(Swin-B)2021-07-13
A ConvNet for the 2020s✓ Link53.72352458ConvNeXt-L++2022-01-10
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link53.7SwinV2-G-HTC++ Liu et al. ([2021a])2021-11-18
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link53.7ConvNeXt V2-L2023-01-02
Segmenter: Transformer for Semantic Segmentation✓ Link53.63Seg-L-Mask/16 (MS)2021-05-12
Masked Autoencoders Are Scalable Vision Learners✓ Link53.6MAE (ViT-L, UperNet)2021-11-11
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link53.52SeMask (SeMask Swin-L FPN)2021-12-23
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link53.5062.8Swin-L (UperNet, ImageNet-22k pretrain)2021-03-25
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link53.5Swin-L2023-01-02
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link53.447.5TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)2023-11-28
A ConvNet for the 2020s✓ Link53.11221828ConvNeXt-B++2022-01-10
Augmenting Convolutional networks with attention-based aggregation✓ Link52.9PatchConvNet-L120 (UperNet)2021-12-27
Exploring Target Representations for Masked Autoencoders✓ Link52.9dBOT ViT-B (CLIP)2022-09-08
Augmenting Convolutional networks with attention-based aggregation✓ Link52.8PatchConvNet-B120 (UperNet)2021-12-27
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link52.8Swin-B2023-01-02
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link52.7UniRepLKNet-S++2023-11-27
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link52.1ConvNeXt V2-B2023-01-02
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention✓ Link52.0DeBiFormer-B (IN1k pretrain, Upernet 160k)2024-10-11
All Tokens Matter: Token Labeling for Training Better Vision Transformers✓ Link51.8209LV-ViT-L (UperNet, MS)2021-04-22
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers✓ Link51.884.7SegFormer-B52021-05-31
BiFormer: Vision Transformer with Bi-Level Routing Attention✓ Link51.7BiFormer-B (IN1k pretrain, Upernet 160k)2023-03-15
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link51.6ConvNeXt V2-L (Supervised)2023-01-02
Is Attention Better Than Matrix Decomposition?✓ Link51.561.171.8Light-Ham (VAN-Huge)2021-09-09
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link51.5DAT-B++2023-09-04
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention✓ Link51.4CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)2021-07-31
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link51.31281185InternImage-B2022-11-10
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link51.2DAT-S++2023-09-04
Active Token Mixer✓ Link51.1108ActiveMLP-L(UperNet)2022-03-11
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers✓ Link51.164.1SegFormer-B42021-05-31
Augmenting Convolutional networks with attention-based aggregation✓ Link51.1PatchConvNet-B60 (UperNet)2021-12-27
Is Attention Better Than Matrix Decomposition?✓ Link51.045.655.0Light-Ham (VAN-Large)2021-09-09
Towards Sustainable Self-supervised Learning✓ Link51.0TEC (Vit-B, Upernet)2022-10-20
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link51UniRepLKNet-S2023-11-27
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link50.9896SeMask (SeMask Swin-B FPN)2021-12-23
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link50.9801017InternImage-S2022-11-10
MogaNet: Multi-order Gated Aggregation Network✓ Link50.91176MogaNet-L (UperNet)2022-11-07
Exploring Target Representations for Masked Autoencoders✓ Link50.8dBOT ViT-B2022-09-08
BiFormer: Vision Transformer with Bi-Level Routing Attention✓ Link50.8Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)2023-03-15
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer✓ Link50.5UperNet Shuffle-B2021-06-07
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link50.5ConvNeXt V1-L2023-01-02
Dilated Neighborhood Attention Transformer✓ Link50.4DiNAT-Base (UperNet)2022-09-29
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link50.3ELSA-Swin-S2021-12-23
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention✓ Link50.3DAT-T++2023-09-04
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers✓ Link50.28SETR-MLA (160k, MS)2020-12-31
Visual Attention Network✓ Link50.255VAN-Large (HamNet)2022-02-20
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation✓ Link50.228.767.9HRViT-b3 (SegFormer, SS)2021-11-01
Twins: Revisiting the Design of Spatial Attention in Vision Transformers✓ Link50.2Twins-SVT-L (UperNet, ImageNet-1k pretrain)2021-04-28
MogaNet: Multi-order Gated Aggregation Network✓ Link50.11050MogaNet-B (UperNet)2022-11-07
Segmenter: Transformer for Semantic Segmentation✓ Link50.0Seg-B-Mask/16(MS, ViT-B)2021-05-12
iBOT: Image BERT Pre-Training with Online Tokenizer✓ Link50.0iBOT (ViT-B/16)2021-11-15
A ConvNet for the 2020s✓ Link49.91221170ConvNeXt-B2022-01-10
Dilated Neighborhood Attention Transformer✓ Link49.9DiNAT-Small (UperNet)2022-09-29
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders✓ Link49.9ConvNeXt V1-B2023-01-02
Neighborhood Attention Transformer✓ Link49.71231137NAT-Base2022-04-14
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link49.7Swin-B (UperNet, ImageNet-1k pretrain)2021-03-25
Segmenter: Transformer for Semantic Segmentation✓ Link49.61Seg-B/8 (MS, ViT-B)2021-05-12
A ConvNet for the 2020s✓ Link49.6821027ConvNeXt-S2022-01-10
Is Attention Better Than Matrix Decomposition?✓ Link49.627.434.4Light-Ham (VAN-Base)2021-09-09
Neighborhood Attention Transformer✓ Link49.5821010NAT-Small2022-04-14
DaViT: Dual Attention Vision Transformers✓ Link49.4DaViT-B2022-04-07
Vision Transformer with Deformable Attention✓ Link49.38121DAT-B (UperNet)2022-01-03
Augmenting Convolutional networks with attention-based aggregation✓ Link49.3PatchConvNet-S60 (UperNet)2021-12-27
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders✓ Link49.3ColorMAE-Green-ViTB-16002024-07-17
MogaNet: Multi-order Gated Aggregation Network✓ Link49.2946MogaNet-S (UperNet)2022-11-07
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link49.2Shift-B (UperNet)2022-01-26
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition✓ Link49.1UniRepLKNet-T2023-11-27
Vision Transformers for Dense Prediction✓ Link49.02DPT-Hybrid2021-03-24
Global Context Vision Transformers✓ Link491251348GC ViT-B2022-06-20
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN✓ Link49A2MIM (ViT-B)2022-05-27
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction✓ Link49EfficientViT-B3 (r512)2022-05-29
Dilated Neighborhood Attention Transformer✓ Link48.8DiNAT-Tiny (UperNet)2022-09-29
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation✓ Link48.7620.828.0HRViT-b2 (SegFormer, SS)2021-11-01
Neighborhood Attention Transformer✓ Link48.458934NAT-Tiny2022-04-14
XCiT: Cross-Covariance Image Transformers✓ Link48.4XCiT-M24/8 (UperNet)2021-06-17
ResNeSt: Split-Attention Networks✓ Link48.36ResNeSt-2002020-04-19
Vision Transformer with Deformable Attention✓ Link48.3181DAT-S (UperNet)2022-01-03
Global Context Vision Transformers✓ Link48.3841163GC ViT-S2022-06-20
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link48.159944InternImage-T2022-11-10
Visual Attention Network✓ Link48.149VAN-Large2022-02-20
XCiT: Cross-Covariance Image Transformers✓ Link48.1XCiT-S24/8 (UperNet)2021-06-17
Per-Pixel Classification is Not All You Need for Semantic Segmentation✓ Link48.1MaskFormer(ResNet-101)2021-07-13
Masked Autoencoders Are Scalable Vision Learners✓ Link48.1MAE (ViT-B, UperNet)2021-11-11
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation✓ Link47.98HRNetV2 + OCR + RMI (PaddleClas pretrained)2019-09-24
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link47.9Shift-B2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link47.8Shift-S2022-01-26
MogaNet: Multi-order Gated Aggregation Network✓ Link47.7189MogaNet-S (Semantic FPN)2022-11-07
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link47.6356SeMask (SeMask Swin-S FPN)2021-12-23
ResNeSt: Split-Attention Networks✓ Link47.60ResNeSt-2692020-04-19
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer✓ Link47.6UperNet Shuffle-T2021-06-07
CondNet: Conditional Classifier for Scene Segmentation✓ Link47.54CondNet(ResNest-101)2021-09-21
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link47.524tiny-MOAT-3 (IN-1K pretraining, single scale)2022-10-04
CondNet: Conditional Classifier for Scene Segmentation✓ Link47.38CondNet(ResNet-101)2021-09-21
Dilated Neighborhood Attention Transformer✓ Link47.2DiNAT-Mini (UperNet)2022-09-29
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation47.12DCNAS2020-03-26
XCiT: Cross-Covariance Image Transformers✓ Link47.1XCiT-S24/8 (Semantic-FPN)2021-06-17
ResNeSt: Split-Attention Networks✓ Link46.91ResNeSt-1012020-04-19
XCiT: Cross-Covariance Image Transformers✓ Link46.9XCiT-M24/8 (Semantic-FPN)2021-06-17
Is Attention Better Than Matrix Decomposition?✓ Link46.8HamNet (ResNet-101)2021-09-09
Sequential Ensembling for Semantic Segmentation46.8Sequential Ensemble (DeepLabv3+)2022-10-08
A ConvNet for the 2020s✓ Link46.760939ConvNeXt-T2022-01-10
Visual Attention Network✓ Link46.7VAN-Base (Semantic-FPN)2022-02-20
XCiT: Cross-Covariance Image Transformers✓ Link46.6XCiT-S12/8 (UperNet)2021-06-17
Global Context Vision Transformers✓ Link46.558947GC ViT-T2022-06-20
Neighborhood Attention Transformer✓ Link46.450900NAT-Mini2022-04-14
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link46.3Shift-T2022-01-26
DaViT: Dual Attention Vision Transformers✓ Link46.3DaViT-T2022-04-07
Context Prior for Scene Segmentation✓ Link46.27CPN(ResNet-101)2020-04-03
MultiMAE: Multi-modal Multi-task Masked Autoencoders✓ Link46.2MultiMAE (ViT-B)2022-04-04
Scene Segmentation with Dual Relation-aware Attention Network✓ Link46.18DRAN(ResNet-101)2020-08-05
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition✓ Link45.9956.52PyConvSegNet-1522020-06-20
Disentangled Non-Local Neural Networks✓ Link45.97DNL2020-06-11
Adaptive Context Network for Scene Parsing45.90ACNet (ResNet-101)2019-11-05
Adaptive Context Network for Scene Parsing45.90ACNet (ResNet-101)2019-11-05
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation✓ Link45.888.214.6HRViT-b1 (SegFormer, SS)2021-11-01
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation✓ Link45.66OCR(HRNetV2-W48)2019-09-24
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing✓ Link45.6SPNet (ResNet-101)2020-03-30
Self-Supervised Learning with Swin Transformers✓ Link45.58Swin-T (UPerNet) MoBY2021-05-10
Vision Transformer with Deformable Attention✓ Link45.5460DAT-T (UperNet)2022-01-03
iBOT: Image BERT Pre-Training with Online Tokenizer✓ Link45.4iBOT (ViT-S/16)2021-11-15
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks✓ Link45.33EANet (ResNet-101)2021-05-05
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation✓ Link45.28OCR (ResNet-101)2019-09-24
Asymmetric Non-local Neural Networks for Semantic Segmentation✓ Link45.24Asymmetric ALNN2019-08-21
Is Attention Better Than Matrix Decomposition?✓ Link45.213.815.8Light-Ham (VAN-Small, D=256)2021-09-09
Location-aware Upsampling for Semantic Segmentation✓ Link45.0256.32LaU-regression-loss2019-11-13
Pyramid Scene Parsing Network✓ Link44.9455.38PSPNet2016-12-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link44.913tiny-MOAT-2 (IN-1K pretraining, single scale)2022-10-04
Co-Occurrent Features in Semantic Segmentation✓ Link44.89CFNet(ResNet-101)2019-06-01
Context Encoding for Semantic Segmentation✓ Link44.6555.67EncNet2018-03-23
Location-aware Upsampling for Semantic Segmentation✓ Link44.5556.41LaU-offset-loss2019-11-13
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation✓ Link44.3455.84EncNet + JPU2019-03-28
Symbolic Graph Reasoning Meets Convolutions✓ Link44.32SGR (ResNet-101)2018-12-01
XCiT: Cross-Covariance Image Transformers✓ Link44.2XCiT-S12/8 (Semantic-FPN)2021-06-17
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation✓ Link43.98Auto-DeepLab-L2019-01-10
PSANet: Point-wise Spatial Attention Network for Scene Parsing✓ Link43.77PSANet (ResNet-101)2018-09-01
Dynamic-structured Semantic Propagation Network43.68DSSPN (ResNet-101)2018-03-16
Pyramid Scene Parsing Network✓ Link43.51PSPNet (ResNet-152)2016-12-04
Pyramid Scene Parsing Network✓ Link43.29PSPNet (ResNet-101)2016-12-04
High-Resolution Representations for Labeling Pixels and Regions✓ Link43.2HRNetV22019-04-09
SeMask: Semantically Masked Transformers for Semantic Segmentation✓ Link43.1635SeMask (SeMask Swin-T FPN)2021-12-23
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link43.18tiny-MOAT-1 (IN-1K pretraining, single scale)2022-10-04
Visual Attention Network✓ Link42.918VAN-Small2022-02-20
MetaFormer Is Actually What You Need for Vision✓ Link42.7PoolFormer-M482021-11-22
Unified Perceptual Parsing for Scene Understanding✓ Link42.66UperNet (ResNet-101)2018-07-26
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link41.26tiny-MOAT-0 (IN-1K pretraining, single scale)2022-10-04
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation✓ Link40.7RefineNet2016-11-20
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run40.4FBNetV52021-11-19
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link40ConvMLP-L2021-09-09
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link38.6ConvMLP-M2021-09-09
Visual Attention Network✓ Link38.58VAN-Tiny2022-02-20
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN✓ Link38.3A2MIM (ResNet-50)2022-05-27
iBOT: Image BERT Pre-Training with Online Tokenizer✓ Link38.3iBOT (ViT-B/16) (linear head)2021-11-15
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers✓ Link37.43.8SegFormer-B02021-05-31
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link35.8MUXNet-m + PPM2020-03-31
ConvMLP: Hierarchical Convolutional MLPs for Vision✓ Link35.8ConvMLP-S2021-09-09
MUXConv: Information Multiplexing in Convolutional Neural Networks✓ Link32.42MUXNet-m + C12020-03-31
Multi-Scale Context Aggregation by Dilated Convolutions✓ Link32.31DilatedNet2015-11-23
Fully Convolutional Networks for Semantic Segmentation✓ Link29.39FCN2014-11-14
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation✓ Link21.64SegNet2015-11-02
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link1310InternImage-H (M3I Pre-training)2022-11-10
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link44.6FastViT-MA362023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link42.9FastViT-SA362023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link41FastViT-SA242023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization✓ Link38FastViT-SA122023-03-24