DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries | ✓ Link | 57.1 | 83.8 | 62.9 | | | | | | DVIS-DAQ(VIT-L, Offline) | 2024-03-29 |
Context-Aware Video Instance Segmentation | ✓ Link | 57.1 | 82.6 | 63.5 | | | 21.2 | | 61.8 | CAVIS(VIT-L, Offline) | 2024-07-03 |
DVIS++: Improved Decoupled Framework for Universal Video Segmentation | ✓ Link | 53.4 | 78.9 | 58.5 | | | | | | DVIS++(VIT-L,Offline) | 2023-12-20 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 50.4 | | 55.5 | | | | | | GLEE-Pro | 2023-12-14 |
DVIS: Decoupled Video Instance Segmentation Framework | ✓ Link | 49.9 | 75.9 | 53.0 | | | 19.4 | | 55.3 | DVIS(Swin-L, Offline) | 2023-06-06 |
DVIS++: Improved Decoupled Framework for Universal Video Segmentation | ✓ Link | 49.6 | 72.5 | 55.0 | 27.1 | 56.6 | 20.8 | 69.9 | 54.6 | DVIS++(VIT-L, Online) | 2023-12-20 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 49.0 | 72.5 | 52.2 | | | | | | UNINEXT (ViT-H, Online) | 2023-03-12 |
DVIS: Decoupled Video Instance Segmentation Framework | ✓ Link | 47.1 | 71.9 | 49.2 | | | 19.4 | | 52.5 | DVIS(Swin-L, Online) | 2023-06-06 |
CTVIS: Consistent Training for Online Video Instance Segmentation | ✓ Link | 46.9 | 71.5 | 47.5 | 19.1 | 52.1 | | | | CTVIS (Swin-L) | 2023-07-24 |
RefineVIS: Video Instance Segmentation with Temporal Attention Refinement | | 46 | 70.4 | 48.4 | | | 19.1 | | 51.2 | RefineVIS (Swin-L, offline) | 2023-06-07 |
GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation | ✓ Link | 45.7 | 69.1 | 47.8 | | | 19.2 | | 49.4 | GRAtt-VIS (Swin-L) | 2023-05-26 |
A Generalized Framework for Video Instance Segmentation | ✓ Link | 45.4 | 69.2 | 47.8 | | | 18.9 | | 49.0 | GenVIS (Swin-L) | 2022-11-16 |
NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation | | 43.5 | 68.3 | 43.8 | | | 19.4 | | 46.9 | NOVIS (Swin-L) | 2023-08-29 |
TarViS: A Unified Approach for Target-based Video Segmentation | ✓ Link | 43.2 | 67.8 | 44.6 | | | 18.0 | | 50.4 | TarViS (Swin-L) | 2023-01-06 |
MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos | ✓ Link | 42.6 | 67.8 | 44.3 | 21.6 | 49.3 | 18.3 | 65.1 | 46.5 | MDQE(SwinL) | 2023-03-25 |
In Defense of Online Models for Video Instance Segmentation | ✓ Link | 42.6 | 65.7 | 45.2 | | | 17.9 | | 49.6 | IDOL (Swin-L) | 2022-07-21 |
Robust Online Video Instance Segmentation with Track Queries | ✓ Link | 42.6 | 64.7 | 42.6 | | | 18.4 | | 49.1 | ROVIS (Swin-L) | 2022-11-16 |
UniVS: Unified and Universal Video Segmentation with Prompts as Queries | ✓ Link | 41.7 | | | | | | | | UniVS(Swin-L) | 2024-02-28 |
DVIS++: Improved Decoupled Framework for Universal Video Segmentation | ✓ Link | 41.2 | 68.9 | 40.9 | | | 16.8 | | 47.3 | DVIS++(R50, Offline) | 2023-12-20 |
BoxVIS: Video Instance Segmentation with Box Annotations | ✓ Link | 40.6 | 68.4 | 39.9 | 20.9 | 45.8 | | 59.4 | | BoxVIS(Swin-L & Box-sup) | 2023-03-26 |
MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training | ✓ Link | 39.4 | 61.5 | 41.3 | | | 18.1 | | 43.3 | MinVIS (Swin-L) | 2022-08-03 |
DVIS++: Improved Decoupled Framework for Universal Video Segmentation | ✓ Link | 37.2 | 62.8 | 37.3 | | | 15.8 | | 42.9 | DVIS++(R50, Online) | 2023-12-20 |
GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation | ✓ Link | 36.2 | 60.8 | 36.8 | | | 16.8 | | 40.1 | GRAtt-VIS (ResNet-50) | 2023-05-26 |
CTVIS: Consistent Training for Online Video Instance Segmentation | ✓ Link | 35.5 | 60.8 | 34.9 | 16.1 | 41.9 | | | | CTVIS (ResNet-50) | 2023-07-24 |
DeVIS: Making Deformable Transformers Work for Video Instance Segmentation | ✓ Link | 35.5 | 59.3 | 38.3 | | | 16.6 | | 39.8 | DeVIS (Swin-L) | 2022-07-22 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 34.0 | 55.5 | 35.6 | | | | | | UNINEXT (ResNet-50, Online) | 2023-03-12 |
TarViS: A Unified Approach for Target-based Video Segmentation | ✓ Link | 34.0 | 55.0 | 34.4 | | | 16.1 | | 40.9 | TarViS (Swin-T) | 2023-01-06 |
NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation | | 32.7 | 56.2 | 32.6 | | | 15.7 | | 37.1 | NOVIS (ResNet-50) | 2023-08-29 |
TarViS: A Unified Approach for Target-based Video Segmentation | ✓ Link | 31.1 | 52.5 | 30.4 | | | 15.9 | | 39.9 | TarViS (ResNet-50) | 2023-01-06 |
In Defense of Online Models for Video Instance Segmentation | ✓ Link | 30.2 | 51.3 | 30 | | | 15 | | 37.5 | IDOL (ResNet-50) | 2022-07-21 |
Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation | ✓ Link | 29.5 | 51.5 | 30.2 | | | 15.5 | | 34.5 | Tube-Link(ResNet-50) | 2023-03-22 |
VITA: Video Instance Segmentation via Object Token Association | ✓ Link | 27.7 | 51.9 | 24.9 | | | 14.9 | | 33.0 | VITA (Swin-L) | 2022-06-09 |
DeVIS: Making Deformable Transformers Work for Video Instance Segmentation | ✓ Link | 23.7 | 47.6 | 20.8 | | | 12.0 | | 28.9 | DeVIS (ResNet-50) | 2022-07-22 |
InstanceFormer: An Online Video Instance Segmentation Framework | ✓ Link | 22.8 | 42.5 | 21.61 | | | 12.9 | | 29.3 | InstanceFormer (Swin-L) | 2022-08-22 |
InstanceFormer: An Online Video Instance Segmentation Framework | ✓ Link | 20.0 | 40.7 | 18.1 | | | 12 | | 27.1 | InstanceFormer(ResNet-50) | 2022-08-22 |
Crossover Learning for Fast Online Video Instance Segmentation | ✓ Link | 18.1 | 35.5 | 16.9 | | | | | | CrossVIS (ResNet-50, calibration) | 2021-04-13 |
Temporally Efficient Vision Transformer for Video Instance Segmentation | ✓ Link | 17.4 | 34.9 | 15.0 | | | | | | TeViT (ResNet-50) | 2022-04-18 |
Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation | ✓ Link | 17.3 | 35.4 | 15.2 | 23.7 | 14.7 | 8.4 | 11.1 | 23.1 | STMask(R101-DCN-FPN) | 2021-04-06 |
Mask2Former for Video Instance Segmentation | ✓ Link | 16.6 | 36.9 | 14.1 | | | 9.9 | | 24.7 | Mask2Former-VIS | 2021-12-20 |
STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation | | 15.5 | 33.5 | 13.4 | | | | | | STC (ResNet-50) | 2022-02-08 |
Occluded Video Instance Segmentation: A Benchmark | ✓ Link | 15.4 | 33.9 | 13.1 | 4.1 | 18.7 | | 28.6 | | CMaskTrack R-CNN (ResNet-50) | 2021-02-02 |
D2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos | ✓ Link | 15.2 | 33.8 | 13.7 | | | | | | D2Conv3D (ResNet-50) | 2021-11-15 |
Crossover Learning for Fast Online Video Instance Segmentation | ✓ Link | 14.9 | 32.7 | 12.1 | | | | | | CrossVIS (ResNet-50) | 2021-04-13 |
Occluded Video Instance Segmentation: A Benchmark | ✓ Link | 14.3 | 29.9 | 12.5 | 2.7 | 12.8 | | 23 | | CSipMask (ResNet-50) | 2021-02-02 |