| Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 79.4 | | MLCD-Seg-7B | 2024-07-24 |
| DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy | ✓ Link | 79.01 | 81.28 | DeRIS-L | 2025-07-02 |
| HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 79.0 | | HyperSeg | 2024-11-26 |
| EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | ✓ Link | 76.5 | | EVF-SAM | 2024-06-28 |
| Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 75.2 | | DETRIS | 2025-01-15 |
| Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | ✓ Link | 74.68 | | C3VG | 2025-01-12 |
| Hierarchical Open-vocabulary Universal Image Segmentation | ✓ Link | 73.9 | | HIPIE | 2023-07-03 |
| Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 73.18 | | UniLSeg-100 | 2023-12-04 |
| Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 72.70 | | UniLSeg-20 | 2023-12-04 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | ✓ Link | 72.49 | | SegAgent | 2025-03-11 |
| Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 72.47 | | UNINEXT-H | 2023-03-12 |
| SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation | | 70.78 | | SafaRi-B | 2024-07-02 |
| GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | 70.5 | | GROUNDHOG | 2024-02-26 |
| MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 70.26 | | MaskRIS (Swin-B, combined DB) | 2024-11-28 |
| General Object Foundation Model for Images and Videos at Scale | ✓ Link | 69.6 | | GLEE-Pro | 2023-12-14 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 69.33 | 72.15 | PolyFormer-L | 2023-02-14 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 67.64 | 70.65 | PolyFormer-B | 2023-02-14 |
| MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 67.54 | 71.68 | MaskRIS (Swin-B) | 2024-11-28 |
| Mask Grounding for Referring Image Segmentation | ✓ Link | 66.16 | | MagNet | 2023-12-19 |
| GRES: Generalized Referring Expression Segmentation | ✓ Link | 66.04 | | ReLA | 2023-06-01 |
| VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 63.53 | | VLT | 2022-10-28 |
| CRIS: CLIP-Driven Referring Image Segmentation | ✓ Link | 62.27 | | CRIS | 2021-11-30 |
| MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation | | 62.23 | | MaIL | 2021-11-21 |
| LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | ✓ Link | 62.14 | | LAVT | 2021-12-04 |
| Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 55.50 | | VLT | 2021-08-12 |
| Comprehensive Multi-Modal Interactions for Referring Image Segmentation | ✓ Link | 52.75 | | SHNet | 2021-04-21 |
| Referring Image Segmentation via Cross-Modal Progressive Comprehension | ✓ Link | 49.56 | | CPMC | 2020-10-01 |
| Bi-Directional Relationship Inferring Network for Referring Image Segmentation | | 48.57 | | BRINet | 2020-06-01 |
| See-Through-Text Grouping for Referring Image Segmentation | | 48.18 | | STEP (5-fold) | 2019-10-01 |
| MAttNet: Modular Attention Network for Referring Expression Comprehension | ✓ Link | 46.67 | | MattNet | 2018-01-24 |
| RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | 44.71 | | RefVOS with BERT + MLM loss | 2020-10-01 |
| Cross-Modal Self-Attention Network for Referring Image Segmentation | ✓ Link | 43.76 | | CMSA | 2019-04-09 |
| Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding | ✓ Link | | 70.02 | VATEX | 2024-04-12 |