Spectrum-guided Multi-granularity Referring Video Object Segmentation | ✓ Link | 0.450 | 0.737 | 0.725 | 0.972 | 0.917 | 0.714 | 0.225 | 0.003 | SgMg (Video-Swin-B) | 2023-07-25 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 0.446 | 0.736 | 0.723 | 0.969 | 0.914 | 0.711 | 0.213 | 0.001 | SOC (Video-Swin-B) | 2023-05-26 |
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | | 0.441 | 0.68 | 0.666 | 0.874 | 0.791 | 0.586 | 0.182 | 0.30 | VLIDE | 2022-03-30 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 0.397 | 0.707 | 0.701 | 0.947 | 0.864 | 0.627 | 0.179 | 0.001 | SOC (Video-Swin-T) | 2023-05-26 |
End-to-End Referring Video Object Segmentation with Multimodal Transformers | ✓ Link | 0.392 | 0.701 | 0.698 | 0.939 | 0.852 | 0.616 | 0.166 | 0.001 | MTTR (w=10) | 2021-11-29 |
End-to-End Referring Video Object Segmentation with Multimodal Transformers | ✓ Link | 0.366 | 0.674 | 0.679 | 0.91 | 0.815 | 0.57 | 0.144 | 0.001 | MTTR (w=8) | 2021-11-29 |
Cross-Modal Progressive Comprehension for Referring Segmentation | ✓ Link | 0.342 | 0.616 | 0.617 | 0.813 | 0.657 | 0.371 | 0.07 | 0.000 | CMPC-V | 2021-05-15 |
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | | 0.335 | 0.598 | 0.604 | 0.783 | 0.639 | 0.378 | 0.076 | 0.000 | Hui et al. | 2021-05-14 |
Actor and Action Modular Network for Text-based Video Segmentation | | 0.321 | 0.583 | 0.576 | 0.773 | 0.627 | 0.360 | 0.044 | 0.000 | AAMN | 2020-11-02 |
Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries | | 0.301 | 0.554 | 0.576 | 0.742 | 0.587 | 0.316 | 0.047 | 0.000 | CMDy | 2020-04-03 |
Polar Relative Positional Encoding for Video-Language Segmentation | | 0.294 | | | 0.572 | 0.690 | 0.319 | 0.06 | 0.001 | PRPE | 2020-07-20 |
Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query | ✓ Link | 0.289 | 0.576 | 0.584 | 0.756 | 0.564 | 0.287 | 0.034 | 0.000 | ACGA | 2019-10-01 |
Actor and Action Video Segmentation from a Sentence | ✓ Link | 0.267 | 0.555 | 0.570 | 0.712 | 0.518 | 0.264 | 0.030 | 0.000 | Gavrilyuk et al. (Optical flow) | 2018-03-20 |
Visual-Textual Capsule Routing for Text-Based Video Segmentation | | 0.261 | 0.535 | 0.550 | 0.677 | 0.513 | 0.283 | 0.051 | 0.000 | VT-Capsule | 2020-06-01 |
Actor and Action Video Segmentation from a Sentence | ✓ Link | 0.233 | 0.541 | 0.542 | 0.699 | 0.460 | 0.173 | 0.014 | 0.000 | Gavrilyuk et al. | 2018-03-20 |
Segmentation from Natural Language Expressions | ✓ Link | 0.178 | 0.546 | 0.528 | 0.633 | 0.350 | 0.085 | 0.002 | 0.000 | Hu et al. | 2016-03-20 |
Tracking by Natural Language Specification | | 0.173 | 0.529 | 0.491 | 0.578 | 0.335 | 0.103 | 0.060 | 0.000 | Li et al. | 2017-07-01 |
Hierarchical interaction network for video object segmentation from referring expressions | | | 0.652 | 0.627 | 0.819 | 0.736 | 0.542 | 0.168 | 0.4 | HINet | 2021-11-22 |
ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | | | 0.644 | 0.655 | 0.880 | 0.796 | 0.566 | 0.147 | 0.002 | ClawCraneNet | 2021-03-19 |
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network | | | 0.628 | 0.581 | 0.764 | 0.625 | 0.389 | 0.09 | 0.001 | CMSA+CFSA | 2021-02-09 |
Hierarchical interaction network for video object segmentation from referring expressions | | | 0.606 | 0.568 | 0.731 | 0.62 | 0.392 | 0.088 | 0.0 | RefVOS | 2021-11-22 |