Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 79.4 | | MLCD-Seg-7B | 2024-07-24 |
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy | ✓ Link | 79.01 | 81.28 | DeRIS-L | 2025-07-02 |
HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 79.0 | | HyperSeg | 2024-11-26 |
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | ✓ Link | 76.5 | | EVF-SAM | 2024-06-28 |
Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 75.2 | | DETRIS | 2025-01-15 |
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | ✓ Link | 74.68 | | C3VG | 2025-01-12 |
Hierarchical Open-vocabulary Universal Image Segmentation | ✓ Link | 73.9 | | HIPIE | 2023-07-03 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 73.18 | | UniLSeg-100 | 2023-12-04 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 72.70 | | UniLSeg-20 | 2023-12-04 |
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | ✓ Link | 72.49 | | SegAgent | 2025-03-11 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 72.47 | | UNINEXT-H | 2023-03-12 |
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation | | 70.78 | | SafaRi-B | 2024-07-02 |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | 70.5 | | GROUNDHOG | 2024-02-26 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 70.26 | | MaskRIS (Swin-B, combined DB) | 2024-11-28 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 69.6 | | GLEE-Pro | 2023-12-14 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 69.33 | 72.15 | PolyFormer-L | 2023-02-14 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 67.64 | 70.65 | PolyFormer-B | 2023-02-14 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 67.54 | 71.68 | MaskRIS (Swin-B) | 2024-11-28 |
Mask Grounding for Referring Image Segmentation | ✓ Link | 66.16 | | MagNet | 2023-12-19 |
GRES: Generalized Referring Expression Segmentation | ✓ Link | 66.04 | | ReLA | 2023-06-01 |
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 63.53 | | VLT | 2022-10-28 |
CRIS: CLIP-Driven Referring Image Segmentation | ✓ Link | 62.27 | | CRIS | 2021-11-30 |
MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation | | 62.23 | | MaIL | 2021-11-21 |
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | ✓ Link | 62.14 | | LAVT | 2021-12-04 |
Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 55.50 | | VLT | 2021-08-12 |
Comprehensive Multi-Modal Interactions for Referring Image Segmentation | ✓ Link | 52.75 | | SHNet | 2021-04-21 |
Referring Image Segmentation via Cross-Modal Progressive Comprehension | ✓ Link | 49.56 | | CPMC | 2020-10-01 |
Bi-Directional Relationship Inferring Network for Referring Image Segmentation | | 48.57 | | BRINet | 2020-06-01 |
See-Through-Text Grouping for Referring Image Segmentation | | 48.18 | | STEP (5-fold) | 2019-10-01 |
MAttNet: Modular Attention Network for Referring Expression Comprehension | ✓ Link | 46.67 | | MattNet | 2018-01-24 |
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | 44.71 | | RefVOS with BERT + MLM loss | 2020-10-01 |
Cross-Modal Self-Attention Network for Referring Image Segmentation | ✓ Link | 43.76 | | CMSA | 2019-04-09 |
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding | ✓ Link | | 70.02 | VATEX | 2024-04-12 |