HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 61.2 | | | | | | | | | | | | | HyperSeg (Swin-B) | 2024-11-26 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 60.0 | 49.2 | 67.1 | | | | | | | 52.0 | 68.8 | | | OneFormer (InternImage-H,single-scale) | 2022-11-10 |
A Simple Framework for Open-Vocabulary Segmentation and Detection | ✓ Link | 59.5 | | | | | | | | | 53.2 | | | | OpenSeeD (SwinL, single-scale) | 2023-03-14 |
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding | ✓ Link | 59.5 | | | | | | | | | 50.7 | 69.7 | | | UMG-CLIP-E/14 | 2024-01-12 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 59.4 | | | | | | | | | 50.9 | | | | MasK DINO (SwinL,single-scale) | 2022-06-06 |
Your ViT is Secretly an Image Segmentation Model | ✓ Link | 59.2 | | | | | | | | | | | | | EoMT (DINOv2-g, single-scale, 1280x1280) | 2025-03-24 |
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding | ✓ Link | 58.9 | | | | | | | | | 49.7 | 68.9 | | | UMG-CLIP-L/14 | 2024-01-12 |
Dilated Neighborhood Attention Transformer | ✓ Link | 58.5 | 48.8 | 64.9 | | | | | | | 49.2 | 68.3 | | | DiNAT-L (single-scale, Mask2Former) | 2022-09-29 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 58.4 | 48.4 | 65.0 | | | | | | | 48.9 | | | | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) | 2022-05-17 |
Visual Attention Network | ✓ Link | 58.2 | 48.2 | 64.8 | | | | | | | | | | | Visual Attention Network (VAN-B6 + Mask2Former) | 2022-02-20 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 58.1 | 48.8 | 64.3 | | | | | | | | | | | kMaX-DeepLab (single-scale, pseudo-labels) | 2022-07-08 |
Hierarchical Open-vocabulary Universal Image Segmentation | ✓ Link | 58.1 | | | | | | | | | | 66.8 | | | HIPIE (ViT-H, single-scale) | 2023-07-03 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 58.0 | 48.6 | 64.2 | | | | | | | | | | | kMaX-DeepLab (single-scale, drop query with 256 queries) | 2022-07-08 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 58.0 | 48.4 | 64.3 | | | | | | | 49.2 | 68.1 | | | OneFormer (DiNAT-L, single-scale) | 2022-11-10 |
kMaX-DeepLab: k-means Mask Transformer | ✓ Link | 57.9 | 48.6 | 64.0 | | | | | | | | | | | kMaX-DeepLab (single-scale) | 2022-07-08 |
OneFormer: One Transformer to Rule Universal Image Segmentation | ✓ Link | 57.9 | 48.0 | 64.4 | | | | | | | 49.0 | 67.4 | | | OneFormer (Swin-L, single-scale) | 2022-11-10 |
Focal Modulation Networks | ✓ Link | 57.9 | | | | | | | | | 48.4 | | | | FocalNet-L (Mask2Former (200 queries)) | 2022-03-22 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 57.8 | 48.1 | 64.2 | | | | | | | 48.6 | | | | Mask2Former (single-scale) | 2021-12-02 |
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers | ✓ Link | 55.8 | 46.9 | 61.7 | | | | | | | | | | | Panoptic SegFormer (single-scale) | 2021-09-08 |
CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation | ✓ Link | 55.3 | 46.6 | 61.0 | | | | | | | | | | | CMT-DeepLab (single-scale) | 2022-06-17 |
Per-Pixel Classification is Not All You Need for Semantic Segmentation | ✓ Link | 52.7 | 44.0 | 58.5 | 63.5 | 81.8 | | | | | | | | | MaskFormer (single-scale) | 2021-07-13 |
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers | ✓ Link | 51.1 | 42.2 | 57.0 | | | | | | | | | | | MaX-DeepLab-L (single-scale) | 2020-12-01 |
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers | ✓ Link | 50.6 | 43.2 | 55.5 | | | | | | | | | | | Panoptic SegFormer (ResNet-101) | 2021-09-08 |
ResNeSt: Split-Attention Networks | ✓ Link | 47.9 | 37.0 | 55.1 | | | | | | | | | | | PanopticFPN+ResNeSt(single-scale) | 2020-04-19 |
End-to-End Object Detection with Transformers | ✓ Link | 45.1 | 37 | 50.5 | 55.5 | 79.9 | 46 | 61.7 | 78.5 | 80.9 | 33 | | | | DETR-R101 (ResNet-101) | 2020-05-26 |
Fully Convolutional Networks for Panoptic Segmentation | ✓ Link | 44.3 | 35.6 | 50 | 53 | 80.7 | 43.5 | 59.3 | 76.7 | 83.4 | | | | | Panoptic FCN* (ResNet-50-FPN) | 2020-12-01 |
End-to-End Object Detection with Transformers | ✓ Link | 44.1 | 33.6 | 51.0 | 53.3 | 79.5 | 42.1 | 60.6 | 74.0 | 83.2 | 39.7 | | | | PanopticFPN++ | 2020-05-26 |
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | ✓ Link | 43.9 | | | | | | | | | | | | | Axial-DeepLab-L (multi-scale) | 2020-03-17 |
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | ✓ Link | 43.4 | 35.6 | 48.5 | | | | | | | | | | | Axial-DeepLab-L (single-scale) | 2020-03-17 |
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | ✓ Link | | 36.8 | 48.6 | | | | | | | | | | | Axial-DeepLab-L(multi-scale) | 2020-03-17 |
Fully Convolutional Networks for Panoptic Segmentation | ✓ Link | | | 58.5 | 61.6 | 83.2 | 51.1 | 68.6 | 81.1 | 84.6 | | | | | Panoptic FCN* (Swin-L, single-scale) | 2020-12-01 |