OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 54.5 | 60.4 | 40.2 | OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | 2022-11-10 |
The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 54.0 | | | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain) | 2025-05-26 |
A Simple Framework for Open-Vocabulary Segmentation and Detection | ✓ Link | 53.7 | | | OpenSeed(SwinL, single scale, 1280x1280) | 2023-03-14 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 53.4 | 58.9 | | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | 2022-11-10 |
Your ViT is Secretly an Image Segmentation Model | ✓ Link | 52.8 | | | EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained) | 2025-03-24 |
Generalized Decoding for Pixel, Image, and Language | ✓ Link | 52.4 | 59.1 | 38.7 | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 2022-12-21 |
The Missing Point in Vision Transformers for Universal Image Segmentation | ✓ Link | 51.9 | | | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280) | 2025-05-26 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 51.5 | 58.3 | 37.1 | OneFormer (DiNAT-L, single-scale, 1280x1280) | 2022-11-10 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 51.4 | 57.0 | 37.8 | OneFormer (Swin-L, single-scale, 1280x1280) | 2022-11-10 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 50.9 | 55.2 | - | kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281) | 2022-07-08 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 50.5 | 58.3 | 36.0 | OneFormer (DiNAT-L, single-scale, 640x640) | 2022-11-10 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 50.1 | 57.4 | 36.3 | OneFormer (ConvNeXt-XL, single-scale, 640x640) | 2022-11-10 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 50.0 | 56.6 | 36.2 | OneFormer (ConvNeXt-L, single-scale, 640x640) | 2022-11-10 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 49.8 | 57.0 | 35.9 | OneFormer (Swin-L, single-scale, 640x640) | 2022-11-10 |
Generalized Decoding for Pixel, Image, and Language | ✓ Link | 49.6 | 58.1 | 35.8 | X-Decoder (L) | 2022-12-21 |
Dilated Neighborhood Attention Transformer | ✓ Link | 49.4 | 56.3 | 35.0 | DiNAT-L (Mask2Former, 640x640) | 2022-09-29 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 48.7 | 54.8 | - | kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641) | 2022-07-08 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 48.1 | 54.5 | 34.2 | Mask2Former (Swin-L) | 2021-12-02 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 46.2 | 55.4 | 33.2 | Mask2Former (Swin-L + FAPN, 640x640) | 2021-12-02 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 42.3 | 45.3 | - | kMaX-DeepLab (ResNet50, single-scale, 1281x1281) | 2022-07-08 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 41.5 | 45.0 | - | kMaX-DeepLab (ResNet50, single-scale, 641x641) | 2022-07-08 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 39.7 | | | Mask2Former (ResNet-50, 640x640) | 2021-12-02 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 37.9 | 50 | | Panoptic-DeepLab (SwideRNet) | 2021-12-02 |
Per-Pixel Classification is Not All You Need for Semantic Segmentation | ✓ Link | 35.7 | | | MaskFormer (R101 + 6 Enc) | 2021-07-13 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | | 46.1 | 26.5 | Mask2Former (ResNet-50, 640x640) | 2021-12-02 |