Spectrum-guided Multi-granularity Referring Video Object Segmentation | ✓ Link | 0.585 | 0.799 | 0.720 | 0.843 | 0.822 | 0.767 | 0.617 | 0.259 | SgMg (Video-Swin-B) | 2023-07-25 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 0.573 | 0.807 | 0.725 | 0.851 | 0.827 | 0.765 | 0.607 | 0.252 | SOC (Video-Swin-B) | 2023-05-26 |
Language as Queries for Referring Video Object Segmentation | ✓ Link | 0.550 | 0.786 | 0.703 | 0.831 | 0.804 | 0.741 | 0.579 | 0.212 | ReferFormer (Video-Swin-B) | 2022-01-03 |
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | ✓ Link | 0.504 | 0.747 | 0.669 | 0.79 | 0.756 | 0.687 | 0.535 | 0.195 | SOC (Video-Swin-T) | 2023-05-26 |
Multi-Attention Network for Compressed Video Referring Object Segmentation | ✓ Link | 0.471 | 0.726 | 0.632 | 0.734 | 0.682 | 0.579 | 0.389 | 0.132 | MANET | 2022-07-26 |
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | | 0.469 | 0.714 | 0.598 | 0.702 | 0.663 | 0.585 | 0.428 | 0.151 | VLIDE | 2022-03-30 |
Local-Global Context Aware Transformer for Language-Guided Video Segmentation | ✓ Link | 0.465 | 0.69 | 0.597 | 0.709 | 0.64 | 0.525 | 0.351 | 0.101 | Locater | 2022-03-18 |
End-to-End Referring Video Object Segmentation with Multimodal Transformers | ✓ Link | 0.461 | 0.72 | 0.64 | 0.754 | 0.712 | 0.638 | 0.485 | 0.169 | MTTR (w=10) | 2021-11-29 |
End-to-End Referring Video Object Segmentation with Multimodal Transformers | ✓ Link | 0.447 | 0.702 | 0.618 | 0.721 | 0.684 | 0.607 | 0.456 | 0.164 | MTTR (w=8) | 2021-11-29 |
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation | ✓ Link | 0.419 | 0.673 | 0.558 | 0.645 | 0.597 | 0.523 | 0.375 | 0.13 | mmmmtbvs | 2022-04-06 |
Cross-Modal Progressive Comprehension for Referring Segmentation | ✓ Link | 0.404 | 0.653 | 0.573 | 0.655 | 0.592 | 0.506 | 0.342 | 0.098 | CMPC-V (I3D) | 2021-05-15 |
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | | 0.399 | 0.662 | 0.561 | 0.654 | 0.589 | 0.497 | 0.333 | 0.091 | Hui et al. | 2021-05-14 |
Actor and Action Modular Network for Text-based Video Segmentation | | 0.396 | 0.617 | 0.552 | 0.681 | 0.629 | 0.523 | 0.296 | 0.029 | AAMN | 2020-11-02 |
Polar Relative Positional Encoding for Video-Language Segmentation | | 0.388 | 0.661 | 0.529 | 0.634 | 0.579 | 0.483 | 0.322 | 0.083 | PRPE | 2020-07-20 |
Cross-Modal Progressive Comprehension for Referring Segmentation | ✓ Link | 0.351 | 0.649 | 0.515 | 0.590 | 0.527 | 0.434 | 0.284 | 0.068 | CMPC-V (R2D) | 2021-05-15 |
Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries | | 0.333 | 0.623 | 0.531 | 0.607 | 0.525 | 0.405 | 0.235 | 0.045 | CMDy | 2020-04-03 |
Visual-Textual Capsule Routing for Text-Based Video Segmentation | | 0.303 | 0.568 | 0.460 | 0.526 | 0.450 | 0.345 | 0.207 | 0.036 | VT-Capsule | 2020-06-01 |
Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query | ✓ Link | 0.274 | 0.601 | 0.490 | 0.557 | 0.459 | 0.319 | 0.16 | 0.02 | ACGA | 2019-10-01 |
Actor and Action Video Segmentation from a Sentence | ✓ Link | 0.215 | 0.551 | 0.426 | 0.5 | 0.376 | 0.231 | 0.094 | 0.004 | Gavriluyk el al. (Optical flow) | 2018-03-20 |
Actor and Action Video Segmentation from a Sentence | ✓ Link | 0.198 | 0.536 | 0.421 | 0.475 | 0.347 | 0.211 | 0.08 | 0.002 | Gavriluyk el al. | 2018-03-20 |
Tracking by Natural Language Specification | | 0.163 | 0.515 | 0.354 | 0.387 | 0.290 | 0.175 | 0.066 | 0.001 | Li et al. | 2017-07-01 |
Segmentation from Natural Language Expressions | ✓ Link | 0.132 | 0.474 | 0.350 | 0.348 | 0.236 | 0.133 | 0.033 | 0.000 | Hu et al. | 2016-03-20 |
Hierarchical interaction network for video object segmentation from referring expressions | | | 0.679 | 0.529 | 0.611 | 0.559 | 0.486 | 0.342 | 0.12 | HINet | 2021-11-22 |
Hierarchical interaction network for video object segmentation from referring expressions | | | 0.672 | 0.497 | 0.578 | 0.534 | 0.456 | 0.311 | 0.093 | RefVOS | 2021-11-22 |
ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | | | 0.644 | 0.655 | 0.704 | 0.677 | 0.617 | 0.489 | 0.171 | ClawCraneNet | 2021-03-19 |
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network | | | 0.618 | 0.432 | 0.487 | 0.431 | 0.358 | 0.231 | 0.052 | CMSA+CFSA | 2021-02-09 |
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation | ✓ Link | | 0.599 | 0.599 | 0.495 | | | | 0.064 | RefVOS | 2020-10-01 |