DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy | ✓ Link | 85.41 | | | | | | | 85.72 | DeRIS-L | 2025-07-02 |
HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 84.8 | | | | | | | | HyperSeg | 2024-11-26 |
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model | ✓ Link | 83.6 | | | | | | | | PSALM | 2024-03-21 |
Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 83.6 | | | | | | | | MLCD-Seg-7B | 2024-07-24 |
Hierarchical Open-vocabulary Universal Image Segmentation | ✓ Link | 82.8 | | | | | | | | HIPIE | 2023-07-03 |
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | ✓ Link | 82.4 | | | | | | | | EVF-SAM | 2024-06-28 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 82.19 | | | | | | | | UNINEXT-H | 2023-03-12 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 81.74 | | | | | | | | UniLSeg-100 | 2023-12-04 |
Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 81.0 | | | | | | | | DETRIS | 2025-01-15 |
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | ✓ Link | 80.89 | | | | | | | | C3VG | 2025-01-12 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 80.0 | | | | | | | | GLEE-Pro | 2023-12-14 |
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | ✓ Link | 79.7 | | | | | | | | SegAgent | 2025-03-11 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 78.71 | | | | | | | | MaskRIS (Swin-B, combined DB) | 2024-11-28 |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | 78.5 | | | | | | | | GROUNDHOG | 2024-02-26 |
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation | | 77.21 | | | | | | | | SafaRi-B | 2024-07-02 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 76.49 | | | | | | | 78.35 | MaskRIS (Swin-B) | 2024-11-28 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 75.96 | | | | | | | 76.94 | PolyFormer-L | 2023-02-14 |
Mask Grounding for Referring Image Segmentation | ✓ Link | 75.24 | | | | | | | | MagNet | 2023-12-19 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 74.82 | | | | | | | | PolyFormer-B | 2023-02-14 |
GRES: Generalized Referring Expression Segmentation | ✓ Link | 73.82 | | | | | | | | ReLA | 2023-06-01 |
Unleashing Text-to-Image Diffusion Models for Visual Perception | ✓ Link | 73.25 | | | | | | | | VPD | 2023-03-03 |
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 72.96 | | | | | | | | VLT | 2022-10-28 |
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 71.06 | | | | | | | | ETRIS | 2023-07-21 |
Referring Transformer: A One-step Approach to Multi-task Visual Grounding | ✓ Link | 70.56 | | | | | | | | RefTR | 2021-06-06 |
CRIS: CLIP-Driven Referring Image Segmentation | ✓ Link | 70.47 | | | | | | | | CRIS | 2021-11-30 |
MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation | | 70.13 | | | | | | | | MaIL | 2021-11-21 |
Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 65.65 | | | | | | | | VLT | 2021-08-12 |
Comprehensive Multi-Modal Interactions for Referring Image Segmentation | ✓ Link | 65.32 | | 75.18 | 69.36 | 61.21 | 46.16 | 16.23 | | SHNet | 2021-04-21 |
Referring Image Segmentation via Cross-Modal Progressive Comprehension | ✓ Link | 61.36 | | | | | | | | CPMC | 2020-10-01 |
Bi-Directional Relationship Inferring Network for Referring Image Segmentation | | 61.35 | | | | | | | | BRINet | 2020-06-01 |
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | 59.45 | | | | | | | | RefVOS with BERT + MLM loss | 2020-10-01 |
Referring Expression Object Segmentation with Caption-Aware Consistency | ✓ Link | 58.90 | | | | | | | | LANG2SEG | 2019-10-10 |
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | 58.65 | | | | | | | | RefVOS with BERT Pre-train | 2020-10-01 |
Cross-Modal Self-Attention Network for Referring Image Segmentation | ✓ Link | 58.32 | | | | | | | | CMSA | 2019-04-09 |
See-Through-Text Grouping for Referring Image Segmentation | | 56.58 | | | | | | | | STEP (1-fold) | 2019-10-01 |
MAttNet: Modular Attention Network for Referring Expression Comprehension | ✓ Link | 56.51 | | | | | | | | MattNet | 2018-01-24 |
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding | ✓ Link | | 78.16 | | | | | | | VATEX | 2024-04-12 |