The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 87.4 | | | ViT-P (InternImage-H) | 2025-05-26 |
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks | ✓ Link | 87.35 | | 87.35 | SERNet-Former | 2024-01-28 |
Harnessing Diffusion Models for Visual Perception with Meta Prompts | ✓ Link | 87.1 | | | MetaPrompt-SD | 2023-12-22 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 87 | | | InternImage-H | 2022-11-10 |
Polarized Self-Attention: Towards High-quality Pixel-wise Regression | ✓ Link | 86.93 | | | HRNetV2-OCR+PSA | 2021-07-02 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 86.4 | | | InternImage-XL | 2022-11-10 |
Hierarchical Multi-Scale Attention for Semantic Segmentation | ✓ Link | 86.3 | | | HRNet-OCR | 2020-05-21 |
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data | ✓ Link | 86.2 | | | Depth Anything | 2024-01-19 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 85.8 | | | OneFormer (ConvNeXt-XL, Mapillary, multi-scale) | 2022-11-10 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 85.8 | | | ViT-Adapter-L | 2022-05-17 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 84.98 | | | SeMask (SeMask Swin-L Mask2Former) | 2021-12-23 |
Sequential Ensembling for Semantic Segmentation | | 84.8 | | | Sequential Ensemble (MiT-B5 + HRNet) | 2022-10-08 |
Soft labelling for semantic segmentation: Bringing coherence to label down-sampling | ✓ Link | 84.8 | | | Soft Labells (HRnet) | 2023-02-27 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 84.6 | | | OneFormer (ConvNeXt-XL, multi-scale) | 2022-11-10 |
Dilated Neighborhood Attention Transformer | ✓ Link | 84.5 | | | DiNAT-L (Mask2Former) | 2022-09-29 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 84.4 | | | OneFormer (Swin-L, multi-scale) | 2022-11-10 |
VPNeXt -- Rethinking Dense Decoding for Plain Vision Transformer | | 84.4 | | | VPNeXt | 2025-02-23 |
VOLO: Vision Outlooker for Visual Recognition | ✓ Link | 84.3 | | | VOLO-D4 (MS, ImageNet1k pretrain) | 2021-06-24 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 84.3 | | | Mask2Former (Swin-L) | 2021-12-02 |
Your ViT is Secretly an Image Segmentation Model | ✓ Link | 84.2 | 25 | 84.2 | EoMT (DINOv2-L, single-scale, 1024x1024) | 2025-03-24 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | 84.0 | | | SegFormer (MiT-B5, Mapillary) | 2021-05-31 |
DDP: Diffusion Model for Dense Visual Prediction | ✓ Link | 83.9 | | | DDP (ConvNeXt-L, step-3) | 2023-03-30 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 83.6 | | | HRNetV2 + OCR + RMI (PaddleClas pretrained) | 2019-09-24 |
Vision Transformers with Patch Diversification | ✓ Link | 83.6% | | | PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain) | 2021-04-26 |
Pixel-wise Anomaly Detection in Complex Driving Scenes | ✓ Link | 83.5 | | | SynBoost | 2021-03-09 |
Conditional Boundary Loss for Semantic Segmentation | ✓ Link | 83.4 | | | HRNetV2+OCR+CBL(ImageNet pretrained) | 2023-07-05 |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction | ✓ Link | 83.2 | | | EfficientViT-B3 (r1184x2368) | 2022-05-29 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 83.16% | | | HRViT-b3 (SegFormer, SS) | 2021-11-01 |
Dilated SpineNet for Semantic Segmentation | | 83.04% | | | SpineNet-S143+ (single-scale test) | 2021-03-23 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 82.81% | | | HRViT-b2 (SegFormer, SS) | 2021-11-01 |
Fully Attentional Networks with Self-emerging Token Labeling | ✓ Link | 82.8 | | | FAN-L-Hybrid+STL | 2024-01-08 |
ResNeSt: Split-Attention Networks | ✓ Link | 82.7 | | | ResNeSt-200 | 2020-04-19 |
WaveMix: A Resource-efficient Neural Network for Image Analysis | ✓ Link | 82.7 | | | WaveMix | 2022-05-28 |
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 82.6 | | | CMX (B4) | 2022-03-09 |
WaveMix: A Resource-efficient Neural Network for Image Analysis | ✓ Link | 82.60 | | | WaveMix-256/16 (Level-4) | 2022-05-28 |
Understanding The Robustness in Vision Transformers | ✓ Link | 82.3 | | | FAN-L-Hybrid | 2022-04-26 |
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers | ✓ Link | 82.15 | | | SETR-PUP (80k, MS) | 2020-12-31 |
DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation | ✓ Link | 82.0 | | | DSNet-Base(single-scale) | 2024-06-06 |
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks | ✓ Link | 81.7% | | | EANet | 2021-05-05 |
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation | ✓ Link | 81.63% | | | HRViT-b1 (SegFormer, SS) | 2021-11-01 |
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 81.6 | | | CMX (B2) | 2022-03-09 |
Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance | ✓ Link | 81.54% | | | Trans4Trans | 2021-08-20 |
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | ✓ Link | 81.5% | | | Panoptic-DeepLab | 2019-11-22 |
[]() | | 81.5 | | | Soft Labells (Deeplab) | |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 81.1 | | | HRNetV2 (HRNetV2-W48) | 2019-08-20 |
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation | ✓ Link | 81.1% | | | Trans4PASS (Small) | 2022-03-02 |
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective | ✓ Link | 81.0 | | | DEPICT-SA (ViT-L multi-scale) | 2024-11-05 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 80.6 | | | OCR (ResNet-101-FCN) | 2019-09-24 |
RepVGG: Making VGG-style ConvNets Great Again | ✓ Link | 80.57% | | | RepVGG-B2 | 2021-01-11 |
DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation | ✓ Link | 80.4 | 81.9 | | DSNet(single-scale) | 2024-06-06 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 80.39 | | | SeMask (SeMask Swin-L FPN) | 2021-12-23 |
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation | ✓ Link | 80.33% | | | Auto-DeepLab-L | 2019-01-10 |
Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation | ✓ Link | 80.33 | | | SML | 2021-07-23 |
Multiscale Deep Equilibrium Models | ✓ Link | 80.3% | | | Multiscale DEQ (MDEQ-XL) | 2020-06-15 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 80.2 | | | HRNetV2 (HRNetV2-W40) | 2019-08-20 |
Pyramid Scene Parsing Network | ✓ Link | 79.7 | | | PSPNet (Dilated-ResNet-101) | 2016-12-04 |
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation | ✓ Link | 79.6 | | | DeepLabv3+ (Dilated-Xception-71) | 2018-02-07 |
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation | ✓ Link | 79.1% | | | Trans4PASS (Tiny) | 2022-03-02 |
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective | ✓ Link | 78.8 | | | DEPICT-SA (ViT-L single-scale) | 2024-11-05 |
PointRend: Image Segmentation as Rendering | ✓ Link | 78.6 | | | SemanticFPN P2-P5 + PointRend | 2019-12-17 |
Rethinking Atrous Convolution for Semantic Image Segmentation | ✓ Link | 78.5% | | | DeepLabv3 (Dilated-ResNet-101) | 2017-06-17 |
Representation Recycling for Streaming Video Analysis | ✓ Link | 78.2 | 1.1 | | StreamDEQ (8 iterations) | 2022-04-28 |
Multiscale Deep Equilibrium Models | ✓ Link | 77.8% | | | Multiscale DEQ (MDEQ-large) | 2020-06-15 |
Hyperbolic Active Learning for Semantic Segmentation under Domain Shift | ✓ Link | 77.8 | | | HALO | 2023-06-19 |
Efficient Visual Pretraining with Contrastive Detection | ✓ Link | 77.0% | | | DetCon_B | 2021-03-19 |
EEEA-Net: An Early Exit Evolutionary Neural Architecture Search | ✓ Link | 76.8 | | | EEEA-Net-C2 (ours) | 2021-08-13 |
WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis | ✓ Link | 76.79 | | | WaveMixLite-256/16 | 2022-10-13 |
SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images | ✓ Link | 76.41 | | | SwinMTL | 2024-03-15 |
CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes | ✓ Link | 76.36 | 72.3 (3090) | | CSFNet-2 | 2024-07-01 |
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality | ✓ Link | 76.27 | | | RepMLPNet-D256 | 2021-12-21 |
Deep Residual Learning for Image Recognition | ✓ Link | 75.7 | | | Dilated-ResNet (Dilated-ResNet-101) | 2015-12-10 |
UNet++: A Nested U-Net Architecture for Medical Image Segmentation | ✓ Link | 75.5 | | | UNet++ (ResNet-101) | 2018-07-18 |
SqueezeNAS: Fast neural architecture search for faster semantic segmentation | ✓ Link | 75.2% | | | SqueezeNAS (LAT XLarge) | 2019-08-05 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 75.2 | | | ReLICv2 | 2022-01-13 |
CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes | ✓ Link | 74.73 | 106.1 (3090) | | CSFNet-1 | 2024-07-01 |
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation | ✓ Link | 74.7% | | | GSCNN (ResNet-101) | 2019-07-12 |
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? | ✓ Link | 74.6 | | | BYOL | 2022-01-13 |
Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation | ✓ Link | 74% | | | WASPnet (ours) | 2019-12-06 |
SqueezeNAS: Fast neural architecture search for faster semantic segmentation | ✓ Link | 73.6% | | | SqueezeNAS (LAT Large) | 2019-08-05 |
FasterSeg: Searching for Faster Real-time Semantic Segmentation | ✓ Link | 73.1% | | | FasterSeg | 2019-12-23 |
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation | ✓ Link | 73.0% | | | GSCNN (ResNet-50) | 2019-07-12 |
Aerial-PASS: Panoramic Annular Scene Segmentation in Drone Videos | | 72.8% | | | Aerial-PASS (ResNet-18) | 2021-05-15 |
Real-time Fusion Network for RGB-D Semantic Segmentation Incorporating Unexpected Obstacle Detection for Road-driving Images | ✓ Link | 72.5% | | | RFNet (ResNet-18) | 2020-02-24 |
ERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic Segmentation | ✓ Link | 72.1% | | | ERFNet (PyTorch) | 2017-10-09 |
DS-PASS: Detail-Sensitive Panoramic Annular Semantic Segmentation through SwaftNet for Surrounding Sensing | ✓ Link | 72.1% | | | SwaftNet (ResNet-18) | 2019-09-17 |
Representation Recycling for Streaming Video Analysis | ✓ Link | 71.5 | 1.9 | | StreamDEQ (4 iterations) | 2022-04-28 |
Template-Based Automatic Search of Compact Semantic Segmentation Architectures | ✓ Link | 69.5% | | | Template-Based NAS-arch1 | 2019-04-04 |
Fast-SCNN: Fast Semantic Segmentation Network | ✓ Link | 69.19 | | | Fast-SCNN + Coarse + ImageNet | 2019-02-12 |
Incorporating Luminance, Depth and Color Information by a Fusion-based Network for Semantic Segmentation | ✓ Link | 68.48% | | | LDFNet | 2018-09-24 |
Template-Based Automatic Search of Compact Semantic Segmentation Architectures | ✓ Link | 68.1% | | | Template-Based NAS-arch0 | 2019-04-04 |
SqueezeNAS: Fast neural architecture search for faster semantic segmentation | ✓ Link | 68.0% | | | SqueezeNAS (LAT Small) | 2019-08-05 |
ContextNet: Exploring Context and Detail for Semantic Segmentation in Real-time | ✓ Link | 65.9% | | | ContextNet | 2018-05-11 |
DiCENet: Dimension-wise Convolutions for Efficient Networks | ✓ Link | 63.4 | | | DiCENet | 2019-06-08 |
Exploring Semantic Segmentation on the DCT Representation | | 61.6 | | | DCT-EDANet | 2019-07-23 |
Representation Recycling for Streaming Video Analysis | ✓ Link | 57.9 | 2.9 | | StreamDEQ (2 iterations) | 2022-04-28 |
Representation Recycling for Streaming Video Analysis | ✓ Link | 45.5 | 4.3 | | StreamDEQ (1 iterations) | 2022-04-28 |
MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation | ✓ Link | 42.4 | | | MRFP+(Ours) Resnet50 | 2023-11-30 |
MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation | ✓ Link | 34.66 | | | Resnet50 | 2023-11-30 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | | | 76.2 | SegFormer-B0 | 2021-05-31 |