Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 75.6 | | | MLCD-Seg-7B | 2024-07-24 |
HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 75.2 | | | HyperSeg | 2024-11-26 |
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | ✓ Link | 71.9 | | | EVF-SAM | 2024-06-28 |
Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 70.2 | | | DETRIS | 2025-01-15 |
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | ✓ Link | 68.95 | | | C3VG | 2025-01-12 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 68.15 | | | UniLSeg-100 | 2023-12-04 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 66.99 | | | UniLSeg-20 | 2023-12-04 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 66.22 | | | UNINEXT-H | 2023-03-12 |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | 64.9 | | | GROUNDHOG | 2024-02-26 |
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation | | 64.88 | | | SafaRi-B | 2024-07-02 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 62.83 | | | MaskRIS (Swin-B, combined DB) | 2024-11-28 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 61.87 | 66.73 | | PolyFormer-L | 2023-02-14 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 59.39 | 64.5 | | MaskRIS (Swin-B) | 2024-11-28 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 59.33 | 64.64 | | PolyFormer-B | 2023-02-14 |
Mask Grounding for Referring Image Segmentation | ✓ Link | 58.14 | | | MagNet | 2023-12-19 |
GRES: Generalized Referring Expression Segmentation | ✓ Link | 57.65 | | | ReLA | 2023-06-01 |
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 56.92 | | | VLT | 2022-10-28 |
MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation | | 56.06 | | | MaIL | 2021-11-21 |
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | ✓ Link | 55.1 | | | LAVT | 2021-12-04 |
CRIS: CLIP-Driven Referring Image Segmentation | ✓ Link | 53.68 | | | CRIS | 2021-11-30 |
Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 49.36 | | | VLT | 2021-08-12 |
Comprehensive Multi-Modal Interactions for Referring Image Segmentation | ✓ Link | 44.12 | | | SHNet | 2021-04-21 |
Referring Image Segmentation via Cross-Modal Progressive Comprehension | ✓ Link | 43.23 | | | CPMC | 2020-10-01 |
Bi-Directional Relationship Inferring Network for Referring Image Segmentation | | 42.13 | | | BRINet | 2020-06-01 |
See-Through-Text Grouping for Referring Image Segmentation | | 40.41 | | | STEP (5-fold) | 2019-10-01 |
MAttNet: Modular Attention Network for Referring Expression Comprehension | ✓ Link | 40.08 | | | MattNet | 2018-01-24 |
Cross-Modal Self-Attention Network for Referring Image Segmentation | ✓ Link | 37.89 | | | CMSA | 2019-04-09 |
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | 36.17 | | | RefVOS with BERT + MLM loss | 2020-10-01 |
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy | ✓ Link | | 78.59 | | DeRIS-L | 2025-07-02 |
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding | ✓ Link | | | 62.52 | VATEX | 2024-04-12 |