Multi-label Cluster Discrimination for Visual Representation Learning | ✓ Link | 79.9 | | | | MLCD-Seg-7B | 2024-07-24 |
HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 79.4 | | | | HyperSeg | 2024-11-26 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 79.27 | | | | UniLSeg-100 | 2023-12-04 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 78.41 | | | | UniLSeg-20 | 2023-12-04 |
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | ✓ Link | 78.2 | | | | EVF-SAM | 2024-06-28 |
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | ✓ Link | 75.11 | | | | SegAgent | 2025-03-11 |
Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation | ✓ Link | 74.6 | | | | DETRIS | 2025-01-15 |
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | ✓ Link | 74.43 | | | | C3VG | 2025-01-12 |
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | | 74.1 | | | | GROUNDHOG | 2024-02-26 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 72.9 | | | | GLEE-Pro | 2023-12-14 |
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation | | 70.48 | | | | SafaRi-B | 2024-07-02 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 69.2 | 71.15 | | | PolyFormer-L | 2023-02-14 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 69.12 | | | | MaskRIS (Swin-B, combined DB) | 2024-11-28 |
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | ✓ Link | 67.76 | 69.36 | | | PolyFormer-B | 2023-02-14 |
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation | ✓ Link | 65.55 | 69.31 | | | MaskRIS (Swin-B) | 2024-11-28 |
Mask Grounding for Referring Image Segmentation | ✓ Link | 65.36 | | | | MagNet | 2023-12-19 |
Generalized Decoding for Pixel, Image, and Language | ✓ Link | 64.6 | | | | X-Decoder (Davit-d5) | 2022-12-21 |
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 63.49 | | | | VLT (Swin-B) | 2022-10-28 |
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | ✓ Link | 61.24 | | | | LAVT | 2021-12-04 |
Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 52.99 | | | | VLT (Darknet53) | 2021-08-12 |
Comprehensive Multi-Modal Interactions for Referring Image Segmentation | ✓ Link | 49.90 | | | | SHNet | 2021-04-21 |
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy | ✓ Link | | 80.01 | | | DeRIS-L | 2025-07-02 |
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding | ✓ Link | | | 0.7554 | 69.73 | VATEX | 2024-04-12 |