Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 62.8 | | BEiT-3 | 2022-08-22 |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions) | ✓ Link | 62.1 | | ViT-CoMer | 2024-03-13 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 61.5 | | EVA | 2022-11-14 |
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | ✓ Link | 61.4 | | FD-SwinV2-G | 2022-05-27 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 60.8 | | MaskDINO-SwinL | 2022-06-06 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 60.8 | | OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896) | 2022-11-10 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.5 | | ViT-Adapter-L (Mask2Former, BEiT pretrain) | 2022-05-17 |
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks | ✓ Link | 59.35 | | SERNet-Former_v2 | 2024-01-28 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 58.6 | | OneFormer (DiNAT-L, multi-scale, 896x896) | 2022-11-10 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 58.4 | | ViT-Adapter-L (UperNet, BEiT pretrain) | 2022-05-17 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 58.4 | | OneFormer (DiNAT-L, multi-scale, 640x640) | 2022-11-10 |
Representation Separation for Semantic Segmentation with Vision Transformers | | 58.4 | | RSSeg-ViT-L(BEiT pretrain) | 2022-12-28 |
Your ViT is Secretly an Image Segmentation Model | ✓ Link | 58.4 | | EoMT (DINOv2-L, single-scale, 512x512) | 2025-03-24 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 58.3 | | OneFormer (Swin-L, multi-scale, 896x896) | 2022-11-10 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 58.2 | | SeMask (SeMask Swin-L FaPN-Mask2Former) | 2021-12-23 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 58.2 | | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | 2021-12-23 |
Dilated Neighborhood Attention Transformer | ✓ Link | 58.1 | | DiNAT-L (Mask2Former) | 2022-09-29 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 57.7 | | Mask2Former (Swin-L-FaPN, multiscale) | 2021-12-02 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 57.7 | | OneFormer (Swin-L, multi-scale, 640x640) | 2022-11-10 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 57.5 | | SeMask (SeMask Swin-L Mask2Former) | 2021-12-23 |
Efficient Self-Ensemble for Semantic Segmentation | ✓ Link | 57.1 | | SenFormer (BEiT-L) | 2021-11-26 |
BEiT: BERT Pre-Training of Image Transformers | ✓ Link | 57.0 | | BEiT-L (ViT+UperNet, ImageNet-22k pretrain) | 2021-06-15 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 57.0 | | SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale) | 2021-12-23 |
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction | ✓ Link | 56.7 | | FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain) | 2021-08-16 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 56.4 | | Mask2Former (Swin-L-FaPN) | 2021-12-02 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 56.2 | | SeMask (SeMask Swin-L MaskFormer) | 2021-12-23 |
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows | ✓ Link | 55.7 | | CSWin-L (UperNet, ImageNet-22k pretrain) | 2021-07-01 |
Per-Pixel Classification is Not All You Need for Semantic Segmentation | ✓ Link | 55.6 | | MaskFormer (Swin-L, ImageNet-22k pretrain) | 2021-07-13 |
DeiT III: Revenge of the ViT | ✓ Link | 55.6 | | DeiT-L | 2022-04-14 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 55.4 | | Focal-L (UperNet, ImageNet-22k pretrain) | 2021-07-01 |
SegViT: Semantic Segmentation with Plain Vision Transformers | ✓ Link | 55.2 | | SegViT ViT-Large | 2022-10-12 |
Vision Transformers with Patch Diversification | ✓ Link | 54.4% | | PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain) | 2021-04-26 |
K-Net: Towards Unified Image Segmentation | ✓ Link | 54.3 | | K-Net | 2021-06-28 |
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective | ✓ Link | 54.3 | | DEPICT-SA (ViT-L 640x640 multi-scale) | 2024-11-05 |
Efficient Self-Ensemble for Semantic Segmentation | ✓ Link | 54.2 | | SenFormer (Swin-L) | 2021-11-26 |
DeiT III: Revenge of the ViT | ✓ Link | 54.1 | | DeiT-B | 2022-04-14 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 53.8 | | MixMIM-L | 2022-05-26 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 53.63 | | Seg-L-Mask/16 (MS, ViT-L) | 2021-05-12 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 53.5 | | Swin-L (UperNet, ImageNet-22k pretrain) | 2021-03-25 |
SeMask: Semantically Masked Transformers for Semantic Segmentation | ✓ Link | 53.5 | | SeMask (SeMask Swin-L FPN) | 2021-12-23 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 52.9 | | PatchConvNet-L120 (UperNet) | 2021-12-27 |
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective | ✓ Link | 52.9 | | DEPICT-SA (ViT-L 640x640 single-scale) | 2024-11-05 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 52.8 | | PatchConvNet-B120 (UperNet) | 2021-12-27 |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | ✓ Link | 51.8 | | SegFormer-B5(MS, 87M #Params, ImageNet-1K pretrain) | 2021-05-31 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 51.5 | | Light-Ham (VAN-Huge, 61M, IN-1k, MS) | 2021-09-09 |
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention | ✓ Link | 51.4% | 84.0% | CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test) | 2021-07-31 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 51.1 | | PatchConvNet-B60 (UperNet) | 2021-12-27 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 51.0 | | Light-Ham (VAN-Large, 46M, IN-1k, MS) | 2021-09-09 |
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer | ✓ Link | 50.5 | | UperNet Shuffle-B | 2021-06-07 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 50.3 | | ELSA-Swin-S | 2021-12-23 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | 50.3 | | MixMIM-B | 2022-05-26 |
Twins: Revisiting the Design of Spatial Attention in Vision Transformers | ✓ Link | 50.2 | | Twins-SVT-L (UperNet, ImageNet-1k pretrain) | 2021-04-28 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 50.0 | | Seg-B-Mask/16 (MS, ViT-B) | 2021-05-12 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 49.7 | | Swin-B (UperNet, ImageNet-1k pretrain) | 2021-03-25 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 49.69 | 83.43 | gSwin-S | 2022-08-24 |
Segmenter: Transformer for Semantic Segmentation | ✓ Link | 49.61 | 83.37 | Seg-B/8 (MS, ViT-B) | 2021-05-12 |
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer | ✓ Link | 49.6 | | UperNet Shuffle-S | 2021-06-07 |
Is Attention Better Than Matrix Decomposition? | ✓ Link | 49.6 | | Light-Ham (VAN-Base, 27M, IN-1k, MS) | 2021-09-09 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 49.3 | | PatchConvNet-S60 (UperNet) | 2021-12-27 |
Vision Transformers for Dense Prediction | ✓ Link | 49.02 | 83.11 | DPT-Hybrid | 2021-03-24 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 48.8 | | DaViT-S (UperNet) | 2022-04-07 |
ResNeSt: Split-Attention Networks | ✓ Link | 48.36 | | ResNeSt-200 | 2020-04-19 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 47.98 | | HRNetV2 + OCR + RMI (PaddleClas pretrained) | 2019-09-24 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 47.63 | 82.60 | gSwin-T | 2022-08-24 |
ResNeSt: Split-Attention Networks | ✓ Link | 47.60 | | ResNeSt-269 | 2020-04-19 |
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer | ✓ Link | 47.6 | | UperNet Shuffle-T | 2021-06-07 |
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation | | 47.12 | | DCNAS | 2020-03-26 |
ResNeSt: Split-Attention Networks | ✓ Link | 46.91 | | ResNeSt-101 | 2020-04-19 |
[]() | | 46.9 | | Seg-S-Mask/16 (MS, ViT-S) | |
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields | ✓ Link | 46.41 | | Swin-S (RPE w/ GAB) | 2023-05-08 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 46.3 | | DaViT-B (UperNet) | 2022-04-07 |
Context Prior for Scene Segmentation | ✓ Link | 46.27 | | CPN(ResNet-101) | 2020-04-03 |
MultiMAE: Multi-modal Multi-task Masked Autoencoders | ✓ Link | 46.2 | | MultiMAE (ViT-B) | 2022-04-04 |
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition | ✓ Link | 45.99 | 82.49 | PyConvSegNet-152 | 2020-06-20 |
Disentangled Non-Local Neural Networks | ✓ Link | 45.97 | | DNL | 2020-06-11 |
CTNet: Context-based Tandem Network for Semantic Segmentation | ✓ Link | 45.94 | | CTNet | 2021-04-20 |
Adaptive Context Network for Scene Parsing | | 45.90 | | ACNet (ResNet-101) | 2019-11-05 |
Adaptive Context Network for Scene Parsing | | 45.90 | | ACNet(ResNet-101) | 2019-11-05 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 45.66 | | OCR (HRNetV2-W48) | 2019-09-24 |
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks | ✓ Link | 45.33 | | EANet (ResNet-101) | 2021-05-05 |
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | ✓ Link | 45.28 | | OCR (ResNet-101) | 2019-09-24 |
Asymmetric Non-local Neural Networks for Semantic Segmentation | ✓ Link | 45.24 | | Asymmetric ALNN | 2019-08-21 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 45.07 | 81.79 | gSwin-VT | 2022-08-24 |
Location-aware Upsampling for Semantic Segmentation | ✓ Link | 45.02 | | LaU-regression-loss | 2019-11-13 |
Context Encoding for Semantic Segmentation | ✓ Link | 44.65 | | EncNet (ResNet-101) | 2018-03-23 |
Symbolic Graph Reasoning Meets Convolutions | ✓ Link | 44.32 | | SGR (ResNet-101) | 2018-12-01 |
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation | ✓ Link | 43.98 | 81.72 | Auto-DeepLab-L | 2019-01-10 |
PSANet: Point-wise Spatial Attention Network for Scene Parsing | ✓ Link | 43.77 | | PSANet (ResNet-101) | 2018-09-01 |
Dynamic-structured Semantic Propagation Network | | 43.68 | | DSSPN (ResNet-101) | 2018-03-16 |
Pyramid Scene Parsing Network | ✓ Link | 43.51% | | PSPNet (ResNet-152) | 2016-12-04 |
Pyramid Scene Parsing Network | ✓ Link | 43.29% | | PSPNet (ResNet-101) | 2016-12-04 |
High-Resolution Representations for Labeling Pixels and Regions | ✓ Link | 42.99 | | HRNetV2 (HRNetV2-W48) | 2019-04-09 |
Unified Perceptual Parsing for Scene Understanding | ✓ Link | 42.66 | | UperNet (ResNet-101) | 2018-07-26 |
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation | ✓ Link | 40.70 | | RefineNet (ResNet-152) | 2016-11-20 |
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation | ✓ Link | 40.20 | | RefineNet (ResNet-101) | 2016-11-20 |