HyperSeg: Towards Universal Visual Segmentation with Large Language Model | ✓ Link | 64.6 | HyperSeg | 2024-11-26 |
SILC: Improving Vision Language Pretraining with Self-Distillation | | 63.5 | SILC | 2023-10-20 |
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation | ✓ Link | 63.3 | CAT-Seg | 2023-03-21 |
MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation | ✓ Link | 62.5 | MaskCLIP++ | 2024-12-16 |
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction | ✓ Link | 62.3 | CLIPSelf | 2023-10-02 |
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding | ✓ Link | 61.0 | UMG-CLIP-L/14 | 2024-01-12 |
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation | ✓ Link | 60.6 | SED | 2023-11-27 |
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation | ✓ Link | 60.4 | Mask-Adapter | 2024-12-05 |
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing | ✓ Link | 60.2 | EBSeg-L | 2024-06-14 |
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation | ✓ Link | 59.4 | MAFT+ | 2024-08-01 |
Open-Vocabulary Segmentation with Semantic-Assisted Calibration | ✓ Link | 59.3 | SCAN | 2023-12-07 |
Learning Mask-aware CLIP Representations for Zero-Shot Segmentation | ✓ Link | 58.5 | MAFT-ViTL | 2023-09-30 |
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP | ✓ Link | 58.4 | FC-CLIP | 2023-08-04 |
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models | ✓ Link | 57.3 | ODISE | 2023-03-08 |
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP | ✓ Link | 55.7 | OVSeg Swin-B | 2022-10-09 |
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning | ✓ Link | 50.1 | PACL | 2022-12-09 |
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model | ✓ Link | 47.7 | SimSeg | 2021-12-29 |
Open-Vocabulary Universal Image Segmentation with MaskCLIP | ✓ Link | 45.9 | MaskCLIP | 2022-08-18 |
TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification | ✓ Link | 37.6 | TaAlign(trained with image-text pairs) | 2023-12-21 |
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias | ✓ Link | 37.4 | TTD (TCL) | 2024-03-30 |
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | ✓ Link | 34.7 | LaVG | 2024-08-09 |
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs | ✓ Link | 33.9 | TCL | 2022-12-01 |
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias | ✓ Link | 31.0 | TTD (MaskCLIP) | 2024-03-30 |
A Closer Look at the Explainability of Contrastive Language-Image Pre-training | ✓ Link | 29.3 | CLIP Surgery (original CLIP without any fine-tuning) | 2023-04-12 |