instance-segmentation-on-coco-minival

Instance Segmentation

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	mask AP	AP50	AP75	APL	APM	APS	GFLOPs	Params (M)	box AP	ModelName	ReleaseDate
DETRs with Collaborative Hybrid Assignments Training	✓ Link	56.6	79.7	62.8	74.6	59.7	38.9				Co-DETR	2022-11-22
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions)	✓ Link	55.9									ViT-CoMer-L (Mask RCNN, DINOv2)	2024-03-13
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	55.4	80.1	61.5	74.4	58.4	37.9				InternImage-H	2022-11-10
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	55.0	79.4	60.9	72.0	58.4	37.6				EVA	2022-11-14
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU		54.9	78.9	60.8	72.9	58.4	37.2				Mask Frozen-DETR	2023-08-07
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation	✓ Link	54.5									MasK DINO (SwinL, multi-scale)	2022-06-06
Vision Transformer Adapter for Dense Predictions	✓ Link	54.2									ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)	2022-05-17
General Object Foundation Model for Images and Videos at Scale	✓ Link	54.2									GLEE-Pro	2023-12-14
Swin Transformer V2: Scaling Up Capacity and Resolution	✓ Link	53.7									SwinV2-G (HTC++)	2021-11-18
Exploring Plain Vision Transformer Backbones for Object Detection	✓ Link	53.1									ViTDet, ViT-H Cascade (multiscale)	2022-03-30
General Object Foundation Model for Images and Videos at Scale	✓ Link	53.0									GLEE-Plus	2023-12-14
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation	✓ Link	52.6									Mask DINO (SwinL)	2022-06-06
End-to-End Semi-Supervised Object Detection with Soft Teacher	✓ Link	52.5									Soft Teacher + Swin-L(HTC++, multi-scale)	2021-06-16
Vision Transformer Adapter for Dense Predictions	✓ Link	52.5									ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)	2022-05-17
Vision Transformer Adapter for Dense Predictions	✓ Link	52.2									ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)	2022-05-17
Exploring Plain Vision Transformer Backbones for Object Detection	✓ Link	52									ViTDet, ViT-H Cascade	2022-03-30
End-to-End Semi-Supervised Object Detection with Soft Teacher	✓ Link	51.9									Soft Teacher + Swin-L(HTC++, single-scale)	2021-06-16
CBNet: A Composite Backbone Network Architecture for Object Detection	✓ Link	51.8									CBNetV2 (Dual-Swin-L HTC, multi-scale)	2021-07-01
Could Giant Pretrained Image Models Extract Universal Representations?		51.6									Frozen Backbone, SwinV2-G-ext22K (HTC)	2022-11-03
CBNet: A Composite Backbone Network Architecture for Object Detection	✓ Link	51									CBNetV2 (Dual-Swin-L HTC, multi-scale)	2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers	✓ Link	50.9									Focal-L (HTC++, multi-scale)	2021-07-01
Dilated Neighborhood Attention Transformer	✓ Link	50.8	75.0								DiNAT-L (single-scale, Mask2Former)	2022-09-29
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	50.5									MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)	2021-12-02
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	50.4									Swin-L (HTC++, multi scale)	2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	50.3									MOAT-3 (IN-22K pretraining, single-scale)	2022-10-04
Masked-attention Mask Transformer for Universal Image Segmentation	✓ Link	50.1									Mask2Former (Swin-L)	2021-12-02
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	49.5									Swin-L (HTC++, single scale)	2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	49.3									MOAT-2 (IN-22K pretraining, single-scale)	2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	49.0									MOAT-1 (IN-1K pretraining, single-scale)	2022-10-04
Instances as Queries	✓ Link	48.9	74.0	53.9	68.3	52.6	30.8				QueryInst (single scale)	2021-05-05
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation	✓ Link	48.9									Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)	2020-12-13
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	48.8						1782	387		InternImage-XL	2022-11-10
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion	✓ Link	48.8									CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)	2022-12-07
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	48.6									Heira-L	2023-06-01
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	48.5						1399	277	56.1	InternImage-L	2022-11-10
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	48.5									MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)	2021-12-02
General Object Foundation Model for Images and Videos at Scale	✓ Link	48.4									GLEE-Lite	2023-12-14
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	47.4									MOAT-0 (IN-1K pretraining, single-scale)	2022-10-04
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	47.1									MViTv2-L (Cascade Mask R-CNN, single-scale)	2021-12-02
MPViT: Multi-Path Vision Transformer for Dense Prediction	✓ Link	47.0									MPViT-B (Cascade Mask R-CNN, multi-scale, IN1k pre-train)	2021-12-21
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	47.0									tiny-MOAT-3 (IN-1K pretraining, single-scale)	2022-10-04
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation	✓ Link	46.8									Cascade Eff-B7 NAS-FPN (1280)	2020-12-13
ResNeSt: Split-Attention Networks	✓ Link	46.25									ResNeSt-200 (multi-scale)	2020-04-19
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	46.2									MViT-L (Mask R-CNN, single-scale)	2021-12-02
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization	✓ Link	46.1									RetinaNet (SpineNet-190, 1536x1536)	2019-12-10
MPViT: Multi-Path Vision Transformer for Dense Prediction	✓ Link	45.8									MPViT-B (Cascade R-CNN, sinlge-scale, IN-1K pre-train)	2021-12-21
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding	✓ Link	45.7		49.9							Mask R-CNN (ViL Base, multi-scale, 3x lr)	2021-03-29
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding	✓ Link	45.1	67.2	49.3							Mask R-CNN (ViL Base, 1x lr)	2021-03-29
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	45.0									tiny-MOAT-2 (IN-1K pretraining, single-scale)	2022-10-04
Global Context Networks	✓ Link	44.7	67.9	48.4							GCNet (ResNeXt-101 + DCN + cascade + GC r4)	2020-12-24
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	44.6									tiny-MOAT-1 (IN-1K pretraining, single-scale)	2022-10-04
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	44.5						340	69	49.7	InternImage-S	2022-11-10
ResNeSt: Split-Attention Networks	✓ Link	44.5									ResNeSt-200-DCN (single-scale)	2020-04-19
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	44.4	67.8	47.8							ELSA-S (Cascade Mask RCNN)	2021-12-23
Bottleneck Transformers for Visual Recognition	✓ Link	44.4									BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	2021-01-27
DaViT: Dual Attention Vision Transformers	✓ Link	44.3									DaViT-T (Mask R-CNN, 36 epochs)	2022-04-07
ResNeSt: Split-Attention Networks	✓ Link	44.21									ResNeSt-200 (single-scale)	2020-04-19
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	43.7						270	49	49.1	InternImage-T	2022-11-10
Bottleneck Transformers for Visual Recognition	✓ Link	43.7									BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	2021-01-27
XCiT: Cross-Covariance Image Transformers	✓ Link	43.7									XCiT-M24/8	2021-06-17
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	43.3									tiny-MOAT-0 (IN-1K pretraining, single-scale)	2022-10-04
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	43.0	67.3	46.4							ELSA-S (Mask RCNN)	2021-12-23
XCiT: Cross-Covariance Image Transformers	✓ Link	43.0									XCiT-S24/8	2021-06-17
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	42.5									CenterMask-VoVNetV2-99 (multi-scale)	2019-11-15
ResNeSt: Split-Attention Networks	✓ Link	41.56									ResNeSt-101 (single-scale)	2020-04-19
Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings		41.4									SIW	2022-02-04
Res2Net: A New Multi-scale Backbone Architecture	✓ Link	41.3									Res2Net-101+HTC	2019-04-02
Deep High-Resolution Representation Learning for Human Pose Estimation	✓ Link	41.0									HTC (HRNetV2p-W48)	2019-02-25
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	41.0									HTC (HRNetV2p-W48)	2019-08-20
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond	✓ Link	40.9									GCNet (ResNeXt-101 + DCN + cascade + GC r16)	2019-04-25
Bottleneck Transformers for Visual Recognition	✓ Link	40.7									BoTNet 50 (72 epochs)	2021-01-27
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	40.4	61.3	44	56.1	43.6	22.3				R3-CNN (ResNet-50-FPN, DCN)	2021-04-03
Non-local Neural Networks	✓ Link	40.3									Mask R-CNN (ResNext-152, +1 NL)	2017-11-21
Attentive Normalization	✓ Link	40.2	63.2	43.3							Mask R-CNN-FPN (AOGNet-40M)	2019-08-04
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	40.2	61.1	43.5		42.8	22.6				R3-CNN (ResNet-50-FPN, GC-Net)	2021-04-03
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	40.2									CenterMask-VoVNetV2-99-3x	2019-11-15
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	39.1	58.8	42.3	54.3	42.1	20.7				R3-CNN (ResNet-50-FPN, GRoIE)	2021-04-03
Mask Scoring R-CNN	✓ Link	39.1									Mask Scoring R-CNN (ResNet-101-FPN-DCN)	2019-03-01
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization	✓ Link	38.34	61.07	40.82	56.08	41.73	18.32				Mask R-CNN-FPN (ResNeXt-101, GN+WS)	2019-03-25
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	38.2	58	41.4	52.8	41	20.4				R3-CNN (ResNet-50-FPN)	2021-04-03
Hybrid Task Cascade for Instance Segmentation	✓ Link	38.2									HTC (ResNet-50)	2019-01-22
Mask Scoring R-CNN	✓ Link	38.2									Mask Scoring R-CNN (ResNet-101 FPN)	2019-03-01
Path Aggregation Network for Instance Segmentation	✓ Link	37.8									PANet (ResNet-50)	2018-03-05
A novel Region of Interest Extraction Layer for Instance Segmentation	✓ Link	37.2	59.3	39.8	51.2	41	20.2				GCnet (ResNet-50-FPN, GRoIE)	2020-04-28
X-volution: On the unification of convolution and self-attention		37.2			53.1	40	19.2				Mask R-CNN (FPN, X-volution, SA)	2021-06-04
Non-local Neural Networks	✓ Link	37.1									Mask R-CNN (ResNet-101, +1 NL)	2017-11-21
Mask Scoring R-CNN	✓ Link	36.0									Mask Scoring R-CNN (ResNet-50 FPN)	2019-03-01
A novel Region of Interest Extraction Layer for Instance Segmentation	✓ Link	35.8	57.1	38.0	48.7	39	19.1				Mask R-CNN (ResNet-50-FPN, GRoIE)	2020-04-28
Res2Net: A New Multi-scale Backbone Architecture	✓ Link	35.6	57.6		53.7	37.9	15.7				Faster R-CNN (Res2Net-50)	2019-04-02
Non-local Neural Networks	✓ Link	35.5									Mask R-CNN (ResNet-50, +1 NL)	2017-11-21
Adaptively Connected Neural Networks	✓ Link	35.2									Mask R-CNN (ResNet-50, ACNet)	2019-04-07
YOLACT: Real-time Instance Segmentation	✓ Link	29.9									YOLACT-550 (ResNet-50)	2019-04-04
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link							501	115		InternImage-B	2022-11-10

OpenCodePapers

instance-segmentation-on-coco-minival