semantic-segmentation-on-ade20k

Semantic Segmentation

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Validation mIoU	Test Score	Params (M)	GFLOPs (512 x 512)	GFLOPs	Mean IoU (class)	ModelName	ReleaseDate
The Missing Point in Vision Transformers for Universal Image Segmentation	✓ Link	63.6		1610				ViT-P (InternImage-H)	2025-05-26
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	✓ Link	63.0		1500				ONE-PEACE	2023-05-18
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	62.9		1310		4635		InternImage-H	2022-11-10
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information	✓ Link	62.9		1310				M3I Pre-training (InternImage-H)	2022-11-17
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	62.8		1900				BEiT-3	2022-08-22
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	62.3		1074				EVA	2022-11-14
The Missing Point in Vision Transformers for Universal Image Segmentation	✓ Link	61.6		1400				ViT-P (OneFormer, InternImage-H)	2025-05-26
Vision Transformer Adapter for Dense Predictions	✓ Link	61.5		571				ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)	2022-05-17
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation	✓ Link	61.4		3000				FD-SwinV2-G	2022-05-27
Reversible Column Networks	✓ Link	61.0		2439				RevCol-H (Mask2Former)	2022-12-22
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation	✓ Link	60.8		223				MasK DINO (SwinL, multi-scale)	2022-06-06
Vision Transformer Adapter for Dense Predictions	✓ Link	60.5		571				ViT-Adapter-L (Mask2Former, BEiT pretrain)	2022-05-17
DINOv2: Learning Robust Visual Features without Supervision	✓ Link	60.2		1080				DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)	2023-04-14
The Missing Point in Vision Transformers for Universal Image Segmentation	✓ Link	59.9		309				ViT-P (OneFormer, DiNAT-L)	2025-05-26
Swin Transformer V2: Scaling Up Capacity and Resolution	✓ Link	59.9						SwinV2-G(UperNet)	2021-11-18
Parameter-Inverted Image Pyramid Networks	✓ Link	59.9						PIIP-LH6B(UperNet)	2024-06-06
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks	✓ Link	59.35						SERNet-Former	2024-01-28
Focal Modulation Networks	✓ Link	58.5						FocalNet-L (Mask2Former)	2022-03-22
Vision Transformer Adapter for Dense Predictions	✓ Link	58.4		451				ViT-Adapter-L (UperNet, BEiT pretrain)	2022-05-17
Representation Separation for Semantic Segmentation with Vision Transformers		58.4		330				RSSeg-ViT-L (BEiT pretrain)	2022-12-28
Your ViT is Secretly an Image Segmentation Model	✓ Link	58.4		316	721	721	58.4	EoMT (DINOv2-L, single-scale, 512x512)	2025-03-24
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers	✓ Link	58.2			637.9			SegViT-v2 (BEiT-v2-Large)	2023-06-09
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	58.2						SeMask (SeMask Swin-L FaPN-Mask2Former)	2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	58.2						SeMask (SeMask Swin-L MSFaPN-Mask2Former)	2021-12-23
Dilated Neighborhood Attention Transformer	✓ Link	58.1						DiNAT-L (Mask2Former)	2022-09-29
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions	✓ Link	57.9						HorNet-L (Mask2Former)	2022-07-28
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	57.7						Mask2Former (SwinL-FaPN)	2021-12-02
Dynamic Focus-aware Positional Queries for Semantic Segmentation	✓ Link	57.7						FASeg (SwinL)	2022-04-04
Region Rebalance for Long-Tailed Semantic Segmentation	✓ Link	57.7						RR (BEiT-L)	2022-04-05
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	57.6		496				MOAT-4 (IN-22K pretraining, single-scale)	2022-10-04
Could Giant Pretrained Image Models Extract Universal Representations?		57.6						Frozen Backbone, SwinV2-G-ext22K (Mask2Former)	2022-11-03
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	57.5						SeMask (SeMask Swin-L Mask2Former)	2021-12-23
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	57.3						Mask2Former (SwinL)	2021-12-02
Efficient Self-Ensemble for Semantic Segmentation	✓ Link	57.1						SenFormer (BEiT-L)	2021-11-26
BEiT: BERT Pre-Training of Image Transformers	✓ Link	57.0						BEiT-L (ViT+UperNet)	2021-06-15
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	57.0						SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)	2021-12-23
Harnessing Diffusion Models for Visual Perception with Meta Prompts	✓ Link	56.8						MetaPrompt-SD	2023-12-22
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction	✓ Link	56.7						FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)	2021-08-16
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	56.5		198				MOAT-3 (IN-22K pretraining, single-scale)	2022-10-04
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	56.4						Mask2Former (Swin-L-FaPN)	2021-12-02
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	56.2						SeMask (SeMask Swin-L MaskFormer)	2021-12-23
Exploring Target Representations for Masked Autoencoders	✓ Link	56.2						dBOT ViT-L (CLIP)	2022-09-08
Conditional Boundary Loss for Semantic Segmentation	✓ Link	56.1						Mask2Former+CBL(Swin-B)	2023-07-05
Text-image Alignment for Diffusion-based Perception	✓ Link	55.9						TADP	2023-09-29
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows	✓ Link	55.70						CSWin-L (UperNet, ImageNet-22k pretrain)	2021-07-01
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	55.6						UniRepLKNet-XL	2023-11-27
Focal Self-attention for Local-Global Interactions in Vision Transformers	✓ Link	55.40						Focal-L (UperNet, ImageNet-22k pretrain)	2021-07-01
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	55.3		368		3142		InternImage-XL	2022-11-10
Exploring Target Representations for Masked Autoencoders	✓ Link	55.2						dBOT ViT-L	2022-09-08
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	55.1						Mask2Former(Swin-B)	2021-12-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	55						ConvNeXt V2-H (FCMAE)	2023-01-02
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	55						UniRepLKNet-L++	2023-11-27
Dilated Neighborhood Attention Transformer	✓ Link	54.9						DiNAT-Large (UperNet)	2022-09-29
Conditional Boundary Loss for Semantic Segmentation	✓ Link	54.9						MaskFormer+CBL(Swin-B)	2023-07-05
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	54.7		109				TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)	2023-11-28
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	54.7		81				MOAT-2 (IN-22K pretraining, single-scale)	2022-10-04
Context Autoencoder for Self-Supervised Representation Learning	✓ Link	54.7						CAE (ViT-L, UperNet)	2022-02-07
Visual Attention Network	✓ Link	54.7						VAN-B6	2022-02-20
Dilated Neighborhood Attention Transformer	✓ Link	54.6						DiNAT_s-Large (UperNet)	2022-09-29
DDP: Diffusion Model for Dense Visual Prediction	✓ Link	54.4		207				DDP (Swin-L, step-3)	2023-03-30
Vision Transformers with Patch Diversification	✓ Link	54.4						PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)	2021-04-26
VOLO: Vision Outlooker for Visual Recognition	✓ Link	54.3						VOLO-D5	2021-06-24
K-Net: Towards Unified Image Segmentation	✓ Link	54.3						K-Net	2021-06-28
Generalized Parametric Contrastive Learning	✓ Link	54.3						GPaCo (Swin-L)	2022-09-26
Efficient Self-Ensemble for Semantic Segmentation	✓ Link	54.2						SenFormer (Swin-L)	2021-11-26
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	54.2						Swin V2-H	2023-01-02
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	54.1		256		2526		InternImage-L	2022-11-10
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	54.1		69				TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)	2023-11-28
A ConvNet for the 2020s	✓ Link	54		391	3335			ConvNeXt-XL++	2022-01-10
Sequential Ensembling for Semantic Segmentation		54		216.3				Sequential Ensemble (SegFormer)	2022-10-08
MogaNet: Multi-order Gated Aggregation Network	✓ Link	54						MogaNet-XL (UperNet)	2022-11-07
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	53.9						UniRepLKNet-B++	2023-11-27
Per-Pixel Classification is Not All You Need for Semantic Segmentation	✓ Link	53.8						MaskFormer(Swin-B)	2021-07-13
A ConvNet for the 2020s	✓ Link	53.7		235	2458			ConvNeXt-L++	2022-01-10
Swin Transformer V2: Scaling Up Capacity and Resolution	✓ Link	53.7						SwinV2-G-HTC++ Liu et al. ([2021a])	2021-11-18
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	53.7						ConvNeXt V2-L	2023-01-02
Segmenter: Transformer for Semantic Segmentation	✓ Link	53.63						Seg-L-Mask/16 (MS)	2021-05-12
Masked Autoencoders Are Scalable Vision Learners	✓ Link	53.6						MAE (ViT-L, UperNet)	2021-11-11
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	53.52						SeMask (SeMask Swin-L FPN)	2021-12-23
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	53.50	62.8					Swin-L (UperNet, ImageNet-22k pretrain)	2021-03-25
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	53.5						Swin-L	2023-01-02
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	53.4		47.5				TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)	2023-11-28
A ConvNet for the 2020s	✓ Link	53.1		122	1828			ConvNeXt-B++	2022-01-10
Augmenting Convolutional networks with attention-based aggregation	✓ Link	52.9						PatchConvNet-L120 (UperNet)	2021-12-27
Exploring Target Representations for Masked Autoencoders	✓ Link	52.9						dBOT ViT-B (CLIP)	2022-09-08
Augmenting Convolutional networks with attention-based aggregation	✓ Link	52.8						PatchConvNet-B120 (UperNet)	2021-12-27
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	52.8						Swin-B	2023-01-02
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	52.7						UniRepLKNet-S++	2023-11-27
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	52.1						ConvNeXt V2-B	2023-01-02
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention	✓ Link	52.0						DeBiFormer-B (IN1k pretrain, Upernet 160k)	2024-10-11
All Tokens Matter: Token Labeling for Training Better Vision Transformers	✓ Link	51.8		209				LV-ViT-L (UperNet, MS)	2021-04-22
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	✓ Link	51.8		84.7				SegFormer-B5	2021-05-31
BiFormer: Vision Transformer with Bi-Level Routing Attention	✓ Link	51.7						BiFormer-B (IN1k pretrain, Upernet 160k)	2023-03-15
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	51.6						ConvNeXt V2-L (Supervised)	2023-01-02
Is Attention Better Than Matrix Decomposition?	✓ Link	51.5		61.1	71.8			Light-Ham (VAN-Huge)	2021-09-09
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention	✓ Link	51.5						DAT-B++	2023-09-04
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention	✓ Link	51.4						CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)	2021-07-31
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	51.3		128		1185		InternImage-B	2022-11-10
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention	✓ Link	51.2						DAT-S++	2023-09-04
Active Token Mixer	✓ Link	51.1		108				ActiveMLP-L(UperNet)	2022-03-11
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	✓ Link	51.1		64.1				SegFormer-B4	2021-05-31
Augmenting Convolutional networks with attention-based aggregation	✓ Link	51.1						PatchConvNet-B60 (UperNet)	2021-12-27
Is Attention Better Than Matrix Decomposition?	✓ Link	51.0		45.6	55.0			Light-Ham (VAN-Large)	2021-09-09
Towards Sustainable Self-supervised Learning	✓ Link	51.0						TEC (Vit-B, Upernet)	2022-10-20
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	51						UniRepLKNet-S	2023-11-27
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	50.98		96				SeMask (SeMask Swin-B FPN)	2021-12-23
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	50.9		80		1017		InternImage-S	2022-11-10
MogaNet: Multi-order Gated Aggregation Network	✓ Link	50.9			1176			MogaNet-L (UperNet)	2022-11-07
Exploring Target Representations for Masked Autoencoders	✓ Link	50.8						dBOT ViT-B	2022-09-08
BiFormer: Vision Transformer with Bi-Level Routing Attention	✓ Link	50.8						Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)	2023-03-15
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer	✓ Link	50.5						UperNet Shuffle-B	2021-06-07
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	50.5						ConvNeXt V1-L	2023-01-02
Dilated Neighborhood Attention Transformer	✓ Link	50.4						DiNAT-Base (UperNet)	2022-09-29
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	50.3						ELSA-Swin-S	2021-12-23
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention	✓ Link	50.3						DAT-T++	2023-09-04
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers	✓ Link	50.28						SETR-MLA (160k, MS)	2020-12-31
Visual Attention Network	✓ Link	50.2		55				VAN-Large (HamNet)	2022-02-20
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation	✓ Link	50.2		28.7	67.9			HRViT-b3 (SegFormer, SS)	2021-11-01
Twins: Revisiting the Design of Spatial Attention in Vision Transformers	✓ Link	50.2						Twins-SVT-L (UperNet, ImageNet-1k pretrain)	2021-04-28
MogaNet: Multi-order Gated Aggregation Network	✓ Link	50.1			1050			MogaNet-B (UperNet)	2022-11-07
Segmenter: Transformer for Semantic Segmentation	✓ Link	50.0						Seg-B-Mask/16(MS, ViT-B)	2021-05-12
iBOT: Image BERT Pre-Training with Online Tokenizer	✓ Link	50.0						iBOT (ViT-B/16)	2021-11-15
A ConvNet for the 2020s	✓ Link	49.9		122	1170			ConvNeXt-B	2022-01-10
Dilated Neighborhood Attention Transformer	✓ Link	49.9						DiNAT-Small (UperNet)	2022-09-29
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	✓ Link	49.9						ConvNeXt V1-B	2023-01-02
Neighborhood Attention Transformer	✓ Link	49.7		123	1137			NAT-Base	2022-04-14
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	49.7						Swin-B (UperNet, ImageNet-1k pretrain)	2021-03-25
Segmenter: Transformer for Semantic Segmentation	✓ Link	49.61						Seg-B/8 (MS, ViT-B)	2021-05-12
A ConvNet for the 2020s	✓ Link	49.6		82	1027			ConvNeXt-S	2022-01-10
Is Attention Better Than Matrix Decomposition?	✓ Link	49.6		27.4	34.4			Light-Ham (VAN-Base)	2021-09-09
Neighborhood Attention Transformer	✓ Link	49.5		82	1010			NAT-Small	2022-04-14
DaViT: Dual Attention Vision Transformers	✓ Link	49.4						DaViT-B	2022-04-07
Vision Transformer with Deformable Attention	✓ Link	49.38		121				DAT-B (UperNet)	2022-01-03
Augmenting Convolutional networks with attention-based aggregation	✓ Link	49.3						PatchConvNet-S60 (UperNet)	2021-12-27
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders	✓ Link	49.3						ColorMAE-Green-ViTB-1600	2024-07-17
MogaNet: Multi-order Gated Aggregation Network	✓ Link	49.2			946			MogaNet-S (UperNet)	2022-11-07
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism	✓ Link	49.2						Shift-B (UperNet)	2022-01-26
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition	✓ Link	49.1						UniRepLKNet-T	2023-11-27
Vision Transformers for Dense Prediction	✓ Link	49.02						DPT-Hybrid	2021-03-24
Global Context Vision Transformers	✓ Link	49		125	1348			GC ViT-B	2022-06-20
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN	✓ Link	49						A2MIM (ViT-B)	2022-05-27
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction	✓ Link	49						EfficientViT-B3 (r512)	2022-05-29
Dilated Neighborhood Attention Transformer	✓ Link	48.8						DiNAT-Tiny (UperNet)	2022-09-29
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation	✓ Link	48.76		20.8	28.0			HRViT-b2 (SegFormer, SS)	2021-11-01
Neighborhood Attention Transformer	✓ Link	48.4		58	934			NAT-Tiny	2022-04-14
XCiT: Cross-Covariance Image Transformers	✓ Link	48.4						XCiT-M24/8 (UperNet)	2021-06-17
ResNeSt: Split-Attention Networks	✓ Link	48.36						ResNeSt-200	2020-04-19
Vision Transformer with Deformable Attention	✓ Link	48.31		81				DAT-S (UperNet)	2022-01-03
Global Context Vision Transformers	✓ Link	48.3		84	1163			GC ViT-S	2022-06-20
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	48.1		59		944		InternImage-T	2022-11-10
Visual Attention Network	✓ Link	48.1		49				VAN-Large	2022-02-20
XCiT: Cross-Covariance Image Transformers	✓ Link	48.1						XCiT-S24/8 (UperNet)	2021-06-17
Per-Pixel Classification is Not All You Need for Semantic Segmentation	✓ Link	48.1						MaskFormer(ResNet-101)	2021-07-13
Masked Autoencoders Are Scalable Vision Learners	✓ Link	48.1						MAE (ViT-B, UperNet)	2021-11-11
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	47.98						HRNetV2 + OCR + RMI (PaddleClas pretrained)	2019-09-24
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism	✓ Link	47.9						Shift-B	2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism	✓ Link	47.8						Shift-S	2022-01-26
MogaNet: Multi-order Gated Aggregation Network	✓ Link	47.7			189			MogaNet-S (Semantic FPN)	2022-11-07
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	47.63		56				SeMask (SeMask Swin-S FPN)	2021-12-23
ResNeSt: Split-Attention Networks	✓ Link	47.60						ResNeSt-269	2020-04-19
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer	✓ Link	47.6						UperNet Shuffle-T	2021-06-07
CondNet: Conditional Classifier for Scene Segmentation	✓ Link	47.54						CondNet(ResNest-101)	2021-09-21
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	47.5		24				tiny-MOAT-3 (IN-1K pretraining, single scale)	2022-10-04
CondNet: Conditional Classifier for Scene Segmentation	✓ Link	47.38						CondNet(ResNet-101)	2021-09-21
Dilated Neighborhood Attention Transformer	✓ Link	47.2						DiNAT-Mini (UperNet)	2022-09-29
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation		47.12						DCNAS	2020-03-26
XCiT: Cross-Covariance Image Transformers	✓ Link	47.1						XCiT-S24/8 (Semantic-FPN)	2021-06-17
ResNeSt: Split-Attention Networks	✓ Link	46.91						ResNeSt-101	2020-04-19
XCiT: Cross-Covariance Image Transformers	✓ Link	46.9						XCiT-M24/8 (Semantic-FPN)	2021-06-17
Is Attention Better Than Matrix Decomposition?	✓ Link	46.8						HamNet (ResNet-101)	2021-09-09
Sequential Ensembling for Semantic Segmentation		46.8						Sequential Ensemble (DeepLabv3+)	2022-10-08
A ConvNet for the 2020s	✓ Link	46.7		60	939			ConvNeXt-T	2022-01-10
Visual Attention Network	✓ Link	46.7						VAN-Base (Semantic-FPN)	2022-02-20
XCiT: Cross-Covariance Image Transformers	✓ Link	46.6						XCiT-S12/8 (UperNet)	2021-06-17
Global Context Vision Transformers	✓ Link	46.5		58	947			GC ViT-T	2022-06-20
Neighborhood Attention Transformer	✓ Link	46.4		50	900			NAT-Mini	2022-04-14
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism	✓ Link	46.3						Shift-T	2022-01-26
DaViT: Dual Attention Vision Transformers	✓ Link	46.3						DaViT-T	2022-04-07
Context Prior for Scene Segmentation	✓ Link	46.27						CPN(ResNet-101)	2020-04-03
MultiMAE: Multi-modal Multi-task Masked Autoencoders	✓ Link	46.2						MultiMAE (ViT-B)	2022-04-04
Scene Segmentation with Dual Relation-aware Attention Network	✓ Link	46.18						DRAN(ResNet-101)	2020-08-05
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition	✓ Link	45.99	56.52					PyConvSegNet-152	2020-06-20
Disentangled Non-Local Neural Networks	✓ Link	45.97						DNL	2020-06-11
Adaptive Context Network for Scene Parsing		45.90						ACNet (ResNet-101)	2019-11-05
Adaptive Context Network for Scene Parsing		45.90						ACNet (ResNet-101)	2019-11-05
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation	✓ Link	45.88		8.2	14.6			HRViT-b1 (SegFormer, SS)	2021-11-01
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	45.66						OCR(HRNetV2-W48)	2019-09-24
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing	✓ Link	45.6						SPNet (ResNet-101)	2020-03-30
Self-Supervised Learning with Swin Transformers	✓ Link	45.58						Swin-T (UPerNet) MoBY	2021-05-10
Vision Transformer with Deformable Attention	✓ Link	45.54		60				DAT-T (UperNet)	2022-01-03
iBOT: Image BERT Pre-Training with Online Tokenizer	✓ Link	45.4						iBOT (ViT-S/16)	2021-11-15
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks	✓ Link	45.33						EANet (ResNet-101)	2021-05-05
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	45.28						OCR (ResNet-101)	2019-09-24
Asymmetric Non-local Neural Networks for Semantic Segmentation	✓ Link	45.24						Asymmetric ALNN	2019-08-21
Is Attention Better Than Matrix Decomposition?	✓ Link	45.2		13.8	15.8			Light-Ham (VAN-Small, D=256)	2021-09-09
Location-aware Upsampling for Semantic Segmentation	✓ Link	45.02	56.32					LaU-regression-loss	2019-11-13
Pyramid Scene Parsing Network	✓ Link	44.94	55.38					PSPNet	2016-12-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	44.9		13				tiny-MOAT-2 (IN-1K pretraining, single scale)	2022-10-04
Co-Occurrent Features in Semantic Segmentation	✓ Link	44.89						CFNet(ResNet-101)	2019-06-01
Context Encoding for Semantic Segmentation	✓ Link	44.65	55.67					EncNet	2018-03-23
Location-aware Upsampling for Semantic Segmentation	✓ Link	44.55	56.41					LaU-offset-loss	2019-11-13
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation	✓ Link	44.34	55.84					EncNet + JPU	2019-03-28
Symbolic Graph Reasoning Meets Convolutions	✓ Link	44.32						SGR (ResNet-101)	2018-12-01
XCiT: Cross-Covariance Image Transformers	✓ Link	44.2						XCiT-S12/8 (Semantic-FPN)	2021-06-17
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation	✓ Link	43.98						Auto-DeepLab-L	2019-01-10
PSANet: Point-wise Spatial Attention Network for Scene Parsing	✓ Link	43.77						PSANet (ResNet-101)	2018-09-01
Dynamic-structured Semantic Propagation Network		43.68						DSSPN (ResNet-101)	2018-03-16
Pyramid Scene Parsing Network	✓ Link	43.51						PSPNet (ResNet-152)	2016-12-04
Pyramid Scene Parsing Network	✓ Link	43.29						PSPNet (ResNet-101)	2016-12-04
High-Resolution Representations for Labeling Pixels and Regions	✓ Link	43.2						HRNetV2	2019-04-09
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	43.16		35				SeMask (SeMask Swin-T FPN)	2021-12-23
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	43.1		8				tiny-MOAT-1 (IN-1K pretraining, single scale)	2022-10-04
Visual Attention Network	✓ Link	42.9		18				VAN-Small	2022-02-20
MetaFormer Is Actually What You Need for Vision	✓ Link	42.7						PoolFormer-M48	2021-11-22
Unified Perceptual Parsing for Scene Understanding	✓ Link	42.66						UperNet (ResNet-101)	2018-07-26
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	41.2		6				tiny-MOAT-0 (IN-1K pretraining, single scale)	2022-10-04
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation	✓ Link	40.7						RefineNet	2016-11-20
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run		40.4						FBNetV5	2021-11-19
ConvMLP: Hierarchical Convolutional MLPs for Vision	✓ Link	40						ConvMLP-L	2021-09-09
ConvMLP: Hierarchical Convolutional MLPs for Vision	✓ Link	38.6						ConvMLP-M	2021-09-09
Visual Attention Network	✓ Link	38.5		8				VAN-Tiny	2022-02-20
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN	✓ Link	38.3						A2MIM (ResNet-50)	2022-05-27
iBOT: Image BERT Pre-Training with Online Tokenizer	✓ Link	38.3						iBOT (ViT-B/16) (linear head)	2021-11-15
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	✓ Link	37.4		3.8				SegFormer-B0	2021-05-31
MUXConv: Information Multiplexing in Convolutional Neural Networks	✓ Link	35.8						MUXNet-m + PPM	2020-03-31
ConvMLP: Hierarchical Convolutional MLPs for Vision	✓ Link	35.8						ConvMLP-S	2021-09-09
MUXConv: Information Multiplexing in Convolutional Neural Networks	✓ Link	32.42						MUXNet-m + C1	2020-03-31
Multi-Scale Context Aggregation by Dilated Convolutions	✓ Link	32.31						DilatedNet	2015-11-23
Fully Convolutional Networks for Semantic Segmentation	✓ Link	29.39						FCN	2014-11-14
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation	✓ Link	21.64						SegNet	2015-11-02
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link			1310				InternImage-H (M3I Pre-training)	2022-11-10
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization	✓ Link						44.6	FastViT-MA36	2023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization	✓ Link						42.9	FastViT-SA36	2023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization	✓ Link						41	FastViT-SA24	2023-03-24
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization	✓ Link						38	FastViT-SA12	2023-03-24

OpenCodePapers

semantic-segmentation-on-ade20k