semantic-segmentation-on-ade20k-val

Semantic Segmentation

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	mIoU	Pixel Accuracy	ModelName	ReleaseDate
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	✓ Link	62.8		BEiT-3	2022-08-22
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions)	✓ Link	62.1		ViT-CoMer	2024-03-13
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	61.5		EVA	2022-11-14
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation	✓ Link	61.4		FD-SwinV2-G	2022-05-27
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation	✓ Link	60.8		MaskDINO-SwinL	2022-06-06
OneFormer: One Transformer to Rule Universal Image Segmentation	✓ Link	60.8		OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896)	2022-11-10
Vision Transformer Adapter for Dense Predictions	✓ Link	60.5		ViT-Adapter-L (Mask2Former, BEiT pretrain)	2022-05-17
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks	✓ Link	59.35		SERNet-Former_v2	2024-01-28
OneFormer: One Transformer to Rule Universal Image Segmentation	✓ Link	58.6		OneFormer (DiNAT-L, multi-scale, 896x896)	2022-11-10
Vision Transformer Adapter for Dense Predictions	✓ Link	58.4		ViT-Adapter-L (UperNet, BEiT pretrain)	2022-05-17
OneFormer: One Transformer to Rule Universal Image Segmentation	✓ Link	58.4		OneFormer (DiNAT-L, multi-scale, 640x640)	2022-11-10
Representation Separation for Semantic Segmentation with Vision Transformers		58.4		RSSeg-ViT-L(BEiT pretrain)	2022-12-28
Your ViT is Secretly an Image Segmentation Model	✓ Link	58.4		EoMT (DINOv2-L, single-scale, 512x512)	2025-03-24
OneFormer: One Transformer to Rule Universal Image Segmentation	✓ Link	58.3		OneFormer (Swin-L, multi-scale, 896x896)	2022-11-10
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	58.2		SeMask (SeMask Swin-L FaPN-Mask2Former)	2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	58.2		SeMask (SeMask Swin-L MSFaPN-Mask2Former)	2021-12-23
Dilated Neighborhood Attention Transformer	✓ Link	58.1		DiNAT-L (Mask2Former)	2022-09-29
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	57.7		Mask2Former (Swin-L-FaPN, multiscale)	2021-12-02
OneFormer: One Transformer to Rule Universal Image Segmentation	✓ Link	57.7		OneFormer (Swin-L, multi-scale, 640x640)	2022-11-10
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	57.5		SeMask (SeMask Swin-L Mask2Former)	2021-12-23
Efficient Self-Ensemble for Semantic Segmentation	✓ Link	57.1		SenFormer (BEiT-L)	2021-11-26
BEiT: BERT Pre-Training of Image Transformers	✓ Link	57.0		BEiT-L (ViT+UperNet, ImageNet-22k pretrain)	2021-06-15
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	57.0		SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)	2021-12-23
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction	✓ Link	56.7		FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)	2021-08-16
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	56.4		Mask2Former (Swin-L-FaPN)	2021-12-02
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	56.2		SeMask (SeMask Swin-L MaskFormer)	2021-12-23
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows	✓ Link	55.7		CSWin-L (UperNet, ImageNet-22k pretrain)	2021-07-01
Per-Pixel Classification is Not All You Need for Semantic Segmentation	✓ Link	55.6		MaskFormer (Swin-L, ImageNet-22k pretrain)	2021-07-13
DeiT III: Revenge of the ViT	✓ Link	55.6		DeiT-L	2022-04-14
Focal Self-attention for Local-Global Interactions in Vision Transformers	✓ Link	55.4		Focal-L (UperNet, ImageNet-22k pretrain)	2021-07-01
SegViT: Semantic Segmentation with Plain Vision Transformers	✓ Link	55.2		SegViT ViT-Large	2022-10-12
Vision Transformers with Patch Diversification	✓ Link	54.4%		PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)	2021-04-26
K-Net: Towards Unified Image Segmentation	✓ Link	54.3		K-Net	2021-06-28
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective	✓ Link	54.3		DEPICT-SA (ViT-L 640x640 multi-scale)	2024-11-05
Efficient Self-Ensemble for Semantic Segmentation	✓ Link	54.2		SenFormer (Swin-L)	2021-11-26
DeiT III: Revenge of the ViT	✓ Link	54.1		DeiT-B	2022-04-14
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers	✓ Link	53.8		MixMIM-L	2022-05-26
Segmenter: Transformer for Semantic Segmentation	✓ Link	53.63		Seg-L-Mask/16 (MS, ViT-L)	2021-05-12
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	53.5		Swin-L (UperNet, ImageNet-22k pretrain)	2021-03-25
SeMask: Semantically Masked Transformers for Semantic Segmentation	✓ Link	53.5		SeMask (SeMask Swin-L FPN)	2021-12-23
Augmenting Convolutional networks with attention-based aggregation	✓ Link	52.9		PatchConvNet-L120 (UperNet)	2021-12-27
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective	✓ Link	52.9		DEPICT-SA (ViT-L 640x640 single-scale)	2024-11-05
Augmenting Convolutional networks with attention-based aggregation	✓ Link	52.8		PatchConvNet-B120 (UperNet)	2021-12-27
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	✓ Link	51.8		SegFormer-B5(MS, 87M #Params, ImageNet-1K pretrain)	2021-05-31
Is Attention Better Than Matrix Decomposition?	✓ Link	51.5		Light-Ham (VAN-Huge, 61M, IN-1k, MS)	2021-09-09
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention	✓ Link	51.4%	84.0%	CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)	2021-07-31
Augmenting Convolutional networks with attention-based aggregation	✓ Link	51.1		PatchConvNet-B60 (UperNet)	2021-12-27
Is Attention Better Than Matrix Decomposition?	✓ Link	51.0		Light-Ham (VAN-Large, 46M, IN-1k, MS)	2021-09-09
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer	✓ Link	50.5		UperNet Shuffle-B	2021-06-07
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	50.3		ELSA-Swin-S	2021-12-23
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers	✓ Link	50.3		MixMIM-B	2022-05-26
Twins: Revisiting the Design of Spatial Attention in Vision Transformers	✓ Link	50.2		Twins-SVT-L (UperNet, ImageNet-1k pretrain)	2021-04-28
Segmenter: Transformer for Semantic Segmentation	✓ Link	50.0		Seg-B-Mask/16 (MS, ViT-B)	2021-05-12
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	49.7		Swin-B (UperNet, ImageNet-1k pretrain)	2021-03-25
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window		49.69	83.43	gSwin-S	2022-08-24
Segmenter: Transformer for Semantic Segmentation	✓ Link	49.61	83.37	Seg-B/8 (MS, ViT-B)	2021-05-12
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer	✓ Link	49.6		UperNet Shuffle-S	2021-06-07
Is Attention Better Than Matrix Decomposition?	✓ Link	49.6		Light-Ham (VAN-Base, 27M, IN-1k, MS)	2021-09-09
Augmenting Convolutional networks with attention-based aggregation	✓ Link	49.3		PatchConvNet-S60 (UperNet)	2021-12-27
Vision Transformers for Dense Prediction	✓ Link	49.02	83.11	DPT-Hybrid	2021-03-24
DaViT: Dual Attention Vision Transformers	✓ Link	48.8		DaViT-S (UperNet)	2022-04-07
ResNeSt: Split-Attention Networks	✓ Link	48.36		ResNeSt-200	2020-04-19
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	47.98		HRNetV2 + OCR + RMI (PaddleClas pretrained)	2019-09-24
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window		47.63	82.60	gSwin-T	2022-08-24
ResNeSt: Split-Attention Networks	✓ Link	47.60		ResNeSt-269	2020-04-19
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer	✓ Link	47.6		UperNet Shuffle-T	2021-06-07
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation		47.12		DCNAS	2020-03-26
ResNeSt: Split-Attention Networks	✓ Link	46.91		ResNeSt-101	2020-04-19
[]()		46.9		Seg-S-Mask/16 (MS, ViT-S)
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields	✓ Link	46.41		Swin-S (RPE w/ GAB)	2023-05-08
DaViT: Dual Attention Vision Transformers	✓ Link	46.3		DaViT-B (UperNet)	2022-04-07
Context Prior for Scene Segmentation	✓ Link	46.27		CPN(ResNet-101)	2020-04-03
MultiMAE: Multi-modal Multi-task Masked Autoencoders	✓ Link	46.2		MultiMAE (ViT-B)	2022-04-04
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition	✓ Link	45.99	82.49	PyConvSegNet-152	2020-06-20
Disentangled Non-Local Neural Networks	✓ Link	45.97		DNL	2020-06-11
CTNet: Context-based Tandem Network for Semantic Segmentation	✓ Link	45.94		CTNet	2021-04-20
Adaptive Context Network for Scene Parsing		45.90		ACNet (ResNet-101)	2019-11-05
Adaptive Context Network for Scene Parsing		45.90		ACNet(ResNet-101)	2019-11-05
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	45.66		OCR (HRNetV2-W48)	2019-09-24
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks	✓ Link	45.33		EANet (ResNet-101)	2021-05-05
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation	✓ Link	45.28		OCR (ResNet-101)	2019-09-24
Asymmetric Non-local Neural Networks for Semantic Segmentation	✓ Link	45.24		Asymmetric ALNN	2019-08-21
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window		45.07	81.79	gSwin-VT	2022-08-24
Location-aware Upsampling for Semantic Segmentation	✓ Link	45.02		LaU-regression-loss	2019-11-13
Context Encoding for Semantic Segmentation	✓ Link	44.65		EncNet (ResNet-101)	2018-03-23
Symbolic Graph Reasoning Meets Convolutions	✓ Link	44.32		SGR (ResNet-101)	2018-12-01
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation	✓ Link	43.98	81.72	Auto-DeepLab-L	2019-01-10
PSANet: Point-wise Spatial Attention Network for Scene Parsing	✓ Link	43.77		PSANet (ResNet-101)	2018-09-01
Dynamic-structured Semantic Propagation Network		43.68		DSSPN (ResNet-101)	2018-03-16
Pyramid Scene Parsing Network	✓ Link	43.51%		PSPNet (ResNet-152)	2016-12-04
Pyramid Scene Parsing Network	✓ Link	43.29%		PSPNet (ResNet-101)	2016-12-04
High-Resolution Representations for Labeling Pixels and Regions	✓ Link	42.99		HRNetV2 (HRNetV2-W48)	2019-04-09
Unified Perceptual Parsing for Scene Understanding	✓ Link	42.66		UperNet (ResNet-101)	2018-07-26
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation	✓ Link	40.70		RefineNet (ResNet-152)	2016-11-20
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation	✓ Link	40.20		RefineNet (ResNet-101)	2016-11-20

OpenCodePapers

semantic-segmentation-on-ade20k-val