MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation | ✓ Link | 73.9 | 71.7 | 76.1 | MPG-SAM 2 | 2025-01-23 |
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation | ✓ Link | 71 | 69 | 73.1 | VRS-HQ (Chat-UniVi-13B) | 2025-01-15 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 70.6 | 68.2 | 72.9 | GLEE-Pro | 2023-12-14 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 70.1 | 67.6 | 72.7 | UNINEXT-H | 2023-03-12 |
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations | | 69.3 | 67.0 | 71.5 | ReferDINO (Swin-B) | 2025-01-24 |
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation | ✓ Link | 68.4 | 66.4 | 70.4 | MUTR | 2023-05-25 |
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation | | 67.6 | 65.3 | 69.8 | VLP (VLMo-L) | 2024-05-17 |
Segment Every Reference Object in Spatial and Temporal Spaces | | 67.4 | 65.5 | 69.2 | UniRef-L (Swin-L) | 2023-01-01 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 67.3±0.5 | 65.3 | 69.3 | SOC (Joint training, Video-Swin-B) | 2023-05-26 |
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory | ✓ Link | 67.1 | 65.3 | 68.9 | HTR (Pre-training) | 2024-03-28 |
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation | ✓ Link | 67.1 | 65 | 69.1 | DsHmp (Video-Swin-Base) | 2024-04-04 |
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces | ✓ Link | 66.9 | 64.8 | 69.0 | UniRef++-L | 2023-12-25 |
ViLLa: Video Reasoning Segmentation with Large Language Model | ✓ Link | 66.5 | 64.6 | 68.6 | ViLLa | 2024-07-18 |
Tracking Anything with Decoupled Video Segmentation | ✓ Link | 66.0 | | | DEVA (ReferFormer) | 2023-09-07 |
Spectrum-guided Multi-granularity Referring Video Object Segmentation | ✓ Link | 65.7 | 63.9 | 67.4 | SgMg (Pre-training) | 2023-07-25 |
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation | | 65.5 | 64.1 | 66.9 | GroPrompt | 2024-06-18 |
Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation | | 65 | 62.9 | 67.2 | EPCFormer (ViT-H) | 2023-08-08 |
Universal Segmentation at Arbitrary Granularity with Language Instruction | ✓ Link | 64.9 | 62.8 | 67.0 | UniLSeg-100 | 2023-12-04 |
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation | ✓ Link | 64.2 | 62.5 | 66.0 | LoSh-R | 2023-06-14 |
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation | ✓ Link | 63.8 | 61.9 | 65.6 | VLT | 2022-10-28 |
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation | ✓ Link | 63.5 | 61.6 | 65.5 | OnlineRefer (Swin-L, online) | 2023-07-18 |
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus | ✓ Link | 61.3 | 59.6 | 63.1 | R2VOS (Video-Swin-T) | 2022-07-04 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 59.2 | 57.8 | 60.5 | SOC (Video-Swin-T) | 2023-05-26 |
UniVS: Unified and Universal Video Segmentation with Prompts as Queries | ✓ Link | 58.0 | 56.8 | 59.5 | UniVS(Swin-L) | 2024-02-28 |
Language as Queries for Referring Video Object Segmentation | ✓ Link | 57.3 | 56.1 | 58.4 | ReferFormer (ResNet-101) | 2022-01-03 |
Multi-Attention Network for Compressed Video Referring Object Segmentation | ✓ Link | 55.63 | 54.75 | 56.51 | MANET | 2022-07-26 |
Language as Queries for Referring Video Object Segmentation | ✓ Link | 55.6 | 54.8 | 56.6 | ReferFormer (ResNet-50) | 2022-01-03 |
End-to-End Referring Video Object Segmentation with Multimodal Transformers | ✓ Link | 55.32 | 54.00 | 56.64 | MTTR (w=12) | 2021-11-29 |
Local-Global Context Aware Transformer for Language-Guided Video Segmentation | ✓ Link | 50 | 48.8 | 51.1 | Locater | 2022-03-18 |
Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation | | 49.70 | 50.96 | 48.43 | MLRLSA | 2022-01-01 |
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | | 49.56 | 48.44 | 50.67 | VLIDE | 2022-03-30 |
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark | ✓ Link | 48.9 | 47.0 | 50.8 | URVOS | |
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling | ✓ Link | 34.2 | | | InternVideo2.5 | 2025-01-21 |