object-detection-on-coco-minival

Object Detection

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	box AP	AP50	AP75	APS	APM	APL	Params (M)	ModelName	ReleaseDate
Perception Encoder: The best visual embeddings are not at the output of the network	✓ Link	66.0						1900	PE_spatial (DETA)	2025-04-17
DETRs with Collaborative Hybrid Assignments Training	✓ Link	65.9						314	Co-DETR	2022-11-22
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information	✓ Link	65.0							M3I Pre-training (InternImage-H)	2022-11-17
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	65.0							InternImage-H	2022-11-10
DETRs with Collaborative Hybrid Assignments Training	✓ Link	64.7						218	Co-DETR (Swin-L)	2022-11-22
A Strong and Reproducible Object Detector with Only Public Datasets	✓ Link	64.6	81.5	71.4	50.4	68.5	78.5	689	Focal-Stable-DINO (Focal-Huge, no TTA)	2023-04-25
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	64.5	82.1	70.8	49.4	68.4	78.5		EVA	2022-11-14
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions)	✓ Link	64.3						363	ViT-CoMer	2024-03-13
Focal Modulation Networks	✓ Link	64.2							FocalNet-H (DINO)	2022-03-22
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	✓ Link	64.2							InternImage-XL	2022-11-10
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection		64.1							CP-DETR-L Swin-L(Fine tuning separately in COCO)	2024-12-13
Reversible Column Networks	✓ Link	63.8							RevCol-H(DINO)	2022-12-22
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection	✓ Link	63.2							DINO (Swin-L)	2022-03-07
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection	✓ Link	63.0							Grounding DINO	2023-03-09
Swin Transformer V2: Scaling Up Capacity and Resolution	✓ Link	62.5							SwinV2-G (HTC++)	2021-11-18
Florence: A New Foundation Model for Computer Vision	✓ Link	62							Florence-CoSwin-H	2021-11-22
General Object Foundation Model for Images and Videos at Scale	✓ Link	62.0							GLEE-Pro	2023-12-14
Exploring Plain Vision Transformer Backbones for Object Detection	✓ Link	61.3							ViTDet, ViT-H Cascade (multiscale)	2022-03-30
Grounded Language-Image Pre-training	✓ Link	60.8							GLIP (Swin-L, multi-scale)	2021-12-07
End-to-End Semi-Supervised Object Detection with Soft Teacher	✓ Link	60.7							Soft Teacher + Swin-L (HTC++, multi-scale)	2021-06-16
Universal Instance Perception as Object Discovery and Retrieval	✓ Link	60.6	77.5	66.7	45.1	64.8	75.3		UNINEXT-H	2023-03-12
Vision Transformer Adapter for Dense Predictions	✓ Link	60.5							ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)	2022-05-17
Exploring Plain Vision Transformer Backbones for Object Detection	✓ Link	60.4							ViTDet, ViT-H Cascade	2022-03-30
General Object Foundation Model for Images and Videos at Scale	✓ Link	60.4							GLEE-Plus	2023-12-14
Dynamic Head: Unifying Object Detection Heads with Attentions	✓ Link	60.3	78.2				74.2		DyHead (Swin-L, multi scale, self-training)	2021-06-15
Vision Transformer Adapter for Dense Predictions	✓ Link	60.2							ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)	2022-05-17
End-to-End Semi-Supervised Object Detection with Soft Teacher	✓ Link	60.1							Soft Teacher+Swin-L(HTC++, single scale)	2021-06-16
CBNet: A Composite Backbone Network Architecture for Object Detection	✓ Link	59.6							CBNetV2 (Dual-Swin-L HTC, multi-scale)	2021-07-01
Could Giant Pretrained Image Models Extract Universal Representations?		59.3							Frozen Backbone, SwinV2-G-ext22K (HTC)	2022-11-03
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions	✓ Link	59.2							HorNet-L	2022-07-28
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	59.2							MOAT-3 (IN-22K pretraining, single-scale)	2022-10-04
CBNet: A Composite Backbone Network Architecture for Object Detection	✓ Link	59.1							CBNetV2 (Dual-Swin-L HTC, multi-scale)	2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers	✓ Link	58.7	77.2				73.4		Focal-L (DyHead, multi-scale)	2021-07-01
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	58.7							MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)	2021-12-02
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	58.5							MOAT-2 (IN-22K pretraining, single-scale)	2022-10-04
Dynamic Head: Unifying Object Detection Heads with Attentions	✓ Link	58.4	76.8		44.5	62.2	73.2		DyHead (Swin-L, multi scale)	2021-06-15
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	58							Swin-L (HTC++, multi scale)	2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	57.7							MOAT-1 (IN-1K pretraining, single-scale)	2022-10-04
Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality	✓ Link	57.4							UM-MAE(HTC++, Swin-L, IN1K)	2022-05-20
YOLOv6 v3.0: A Full-Scale Reloading	✓ Link	57.2	74.5						YOLOv6-L6(46 fps, 1280, V100)	2023-01-13
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	✓ Link	57.1							Swin-L (HTC++, single scale)	2021-03-25
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	57.1							TransNeXt-Base (IN-1K pretrain, DINO 1x)	2023-11-28
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation	✓ Link	57.0							Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)	2020-12-13
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	56.6							TransNeXt-Small (IN-1K pretrain, DINO 1x)	2023-11-28
Instances as Queries	✓ Link	56.1	75.8	61.7	40.2	59.8	71.5		QueryInst (single scale)	2021-05-05
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	56.1							MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)	2021-12-02
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	55.9							MOAT-0 (IN-1K pretraining, single-scale)	2022-10-04
TransNeXt: Robust Foveal Visual Perception for Vision Transformers	✓ Link	55.7							TransNeXt-Tiny (IN-1K pretrain, DINO 1x)	2023-11-28
Scaled-YOLOv4: Scaling Cross Stage Partial Network	✓ Link	55.4	73.3	60.7	38.1	59.5	67.4		YOLOv4-P7 CSP-P7 (single-scale, 16 fps)	2020-11-16
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	55.2							tiny-MOAT-3 (IN-1K pretraining, single-scale)	2022-10-04
Understanding The Robustness in Vision Transformers	✓ Link	55.1							FAN-L-Hybrid	2022-04-26
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	55							Hiera-L	2023-06-01
General Object Foundation Model for Images and Videos at Scale	✓ Link	55.0							GLEE-Lite	2023-12-14
Towards Sustainable Self-supervised Learning	✓ Link	54.6							TEC(VIT-B, Mask-RCNN)	2022-10-20
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation	✓ Link	54.5							Cascade Eff-B7 NAS-FPN (1280)	2020-12-13
Context Autoencoder for Self-Supervised Representation Learning	✓ Link	54.5							CAE (ViT-L, Mask R-CNN, 1x schedule)	2022-02-07
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	54.3							MViTv2-L (Cascade Mask R-CNN, single-scale)	2021-12-02
Rethinking Pre-training and Self-training	✓ Link	54.2							SpineNet-190 (1280, with Self-training on OpenImages, single-scale)	2020-06-11
Simple Training Strategies and Model Scaling for Object Detection	✓ Link	53.6			34.5	56.7	70.6		Cascade RCNN-RS (SpineNet-143L, single scale)	2021-06-30
USB: Universal-Scale Object Detection Benchmark	✓ Link	53.5	70.8	58.9	36.9	57.5	68.1		UniverseNet-20.08d (Res2Net-101, DCN, multi-scale)	2021-03-25
Masked Autoencoders Are Scalable Vision Learners	✓ Link	53.3							MAE (ViT-L, Mask R-CNN)	2021-11-11
Simple Training Strategies and Model Scaling for Object Detection	✓ Link	53.1			33.9	56.2	70.3		Cascade RCNN-RS (ResNet-200, single scale)	2021-06-30
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	53.0							tiny-MOAT-2 (IN-1K pretraining, single-scale)	2022-10-04
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	52.7							MViT-L (Mask R-CNN, single-scale, IN21k pre-train)	2021-12-02
ResNeSt: Split-Attention Networks	✓ Link	52.47	71.00	57.07	36.80	56.36	66.29		ResNeSt-200 (multi-scale)	2020-04-19
Active Token Mixer	✓ Link	52.3							ActiveMLP-B (Cascade Mask R-CNN)	2022-03-11
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization	✓ Link	52.2							RetinaNet (SpineNet-190, 1536x1536)	2019-12-10
EfficientDet: Scalable and Efficient Object Detection	✓ Link	52.1							EfficientDet-D7 (1536)	2019-11-20
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	51.9							tiny-MOAT-1 (IN-1K pretraining, single-scale)	2022-10-04
Global Context Networks	✓ Link	51.8	70.4	56.1					GCNet (ResNeXt-101 + DCN + cascade + GC r4)	2020-12-24
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	51.6	70.5	56.0					ELSA-S (Cascade Mask RCNN)	2021-12-23
Focal Modulation Networks	✓ Link	51.5	70.3	56.0					FocalNet-T (LRF, Cascade Mask R-CNN)	2022-03-22
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection	✓ Link	51.3	69.1	56	34.5	54.2	65.8		DINO-5scale (24 epoch)	2022-03-07
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection	✓ Link	51.2	69	55.8	35	54.3	65.3		DINO-5scale (36 epoch)	2022-03-07
ResNeSt: Split-Attention Networks	✓ Link	50.91	69.53	55.40	32.67	54.66	65.83		ResNeSt-200-DCN (single-scale)	2020-04-19
USB: Universal-Scale Object Detection Benchmark	✓ Link	50.9	69.5	55.4	33.5	55.5	65.8		UniverseNet-20.08d (Res2Net-101, DCN, single-scale)	2021-03-25
ResNeSt: Split-Attention Networks	✓ Link	50.54	68.78	55.17		54.2	63.9		ResNeSt-200 (single-scale)	2020-04-19
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models	✓ Link	50.5							tiny-MOAT-0 (IN-1K pretraining, single-scale)	2022-10-04
Masked Autoencoders Are Scalable Vision Learners	✓ Link	50.3							MAE (ViT-B, Mask R-CNN)	2021-11-11
PVT v2: Improved Baselines with Pyramid Vision Transformer	✓ Link	50.1	69.5	54.9					Sparse R-CNN (PVTv2-B2)	2021-06-25
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	50.0							Pix2seq (ViT-L)	2021-09-22
DaViT: Dual Attention Vision Transformers	✓ Link	49.9							DaViT-T (Mask R-CNN, 36 epochs)	2022-04-07
Bottleneck Transformers for Visual Recognition	✓ Link	49.7	71.3	54.6					BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	2021-01-27
Bottleneck Transformers for Visual Recognition	✓ Link	49.5	71	54.2					BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	2021-01-27
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising	✓ Link	49.5	67.6	53.8	31.3	52.6	65.4	47	DN-Deformable-DETR-R50++	2022-03-02
Recurrent Glimpse-based Decoder for Detection with Transformer	✓ Link	49.1	67.5	53.1	30	52.6	65		REGO-Deformable DETR-X101	2021-12-09
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	48.6	67.8						CenterMask+VoVNet99 (multi-scale)	2019-11-15
Rethinking ImageNet Pre-training	✓ Link	48.6	66.8	52.9					Mask R-CNN (ResNeXt-152-FPN, cascade)	2018-11-21
USB: Universal-Scale Object Detection Benchmark	✓ Link	48.5	67.0	52.6	30.6	52.7	62.7		UniverseNet-20.08 (Res2Net-50, DCN, single-scale)	2021-03-25
XCiT: Cross-Covariance Image Transformers	✓ Link	48.5							XCiT-M24/8	2021-06-17
ELSA: Enhanced Local Self-Attention for Vision Transformer	✓ Link	48.3	70.4	52.9					ELSA-S (Mask RCNN)	2021-12-23
XCiT: Cross-Covariance Image Transformers	✓ Link	48.1							XCiT-S24/8	2021-06-17
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond	✓ Link	47.9	66.9	52.2					GCNet (ResNeXt-101 + DCN + cascade + GC r16)	2019-04-25
MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection	✓ Link	47.8	65.5	52.2	30.3	51.9	61.1		MAE-Det(MAE-Det-L+GFLV2)	2021-11-26
Res2Net: A New Multi-scale Backbone Architecture	✓ Link	47.5	66.5	51.3	28.6	51.6	62.1		Res2Net101+HTC	2019-04-02
Rethinking ImageNet Pre-training	✓ Link	47.4							Mask R-CNN (ResNet-101-FPN, GN, Cascade)	2018-11-21
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	47.3							Pix2seq (R50-C4)	2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	47.1							Pix2seq (ViT-B)	2021-09-22
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	47.0			28.8	50.3	62.2		HTC (HRNetV2p-W48)	2019-08-20
Augmenting Convolutional networks with attention-based aggregation	✓ Link	47.0							PatchConvNet-S120 (Mask R-CNN)	2021-12-27
RepPoints: Point Set Representation for Object Detection	✓ Link	46.8							RPDet (ResNeXt-101-DCN, multi-scale)	2019-04-25
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR	✓ Link	46.6	67	50.2	28.1	50.5	64.1	63	DAB-DETR-DC5-R101	2022-01-28
Dynamic Head: Unifying Object Detection Heads with Attentions	✓ Link	46.5							DyHead (ResNet-101)	2021-06-15
Rethinking ImageNet Pre-training	✓ Link	46.4	67.1	51.1					Mask R-CNN (ResNeXt-152-FPN)	2018-11-21
RepPoints: Point Set Representation for Object Detection	✓ Link	46.4							RPDet (ResNet-101-DCN, multi-scale)	2019-04-25
Augmenting Convolutional networks with attention-based aggregation	✓ Link	46.4							PatchConvNet-S60 (Mask R-CNN)	2021-12-27
Deep Residual Learning for Image Recognition	✓ Link	46.3	64.3	50.5					Cascade Mask R-CNN (ResNet-50)	2015-12-10
HoughNet: Integrating near and long-range evidence for bottom-up object detection	✓ Link	46.1	64.6	50.3	30.0	48.8	59.7		HoughNet (HG-104, MS)	2020-07-05
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	46.0			27.5		60.1		Mask R-CNN (HRNetV2p-W48, cascade)	2019-08-20
Conditional DETR for Fast Training Convergence	✓ Link	45.9	66.8	49.5	27.2	50.3	63.3	63	Conditional DETR-DC5-R101	2021-08-13
Bottleneck Transformers for Visual Recognition	✓ Link	45.9							BoTNet 50 (72 epochs)	2021-01-27
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals	✓ Link	45.6	64.6	49.5	28.3	48.3	61.6		Sparse R-CNN (ResNet-101, learnable proposals, random crop aug, FPN)	2020-11-25
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	45.6			29.2		58.8		CenterMask+VoVNetV2-99 (single-scale)	2019-11-15
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	45.3			27.0	48.4	59.5		HTC (HRNetV2p-W32)	2019-08-20
Anchor DETR: Query Design for Transformer-Based Object Detection	✓ Link	45.1	65.7	48.8	25.8	49.4	61.6		Anchor DETR-DC5-R101	2021-09-15
Conditional DETR for Fast Training Convergence	✓ Link	45.1	65.4	48.5	25.3	49	62.2	44	Conditional DETR-DC5-R50	2021-08-13
Non-local Neural Networks	✓ Link	45.0	67.8	48.9					Mask R-CNN (ResNeXt-152 + 1 NL)	2017-11-21
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	45.0	63.2	48.6	28.2	48.9	60.4		Pix2seq (R101-DC5)	2021-09-22
Attentive Normalization	✓ Link	44.9	66.2	49.1					Mask R-CNN-FPN (AOGNet-40M)	2019-08-04
End-to-End Object Detection with Transformers	✓ Link	44.9	64.7	47.7	23.7	49.5	62.3		DETR-DC5 (ResNet-101)	2020-05-26
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	44.9			28.5		57.7		Mask R-CNN (VoVNetV2-99, single-scale)	2019-11-15
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	44.8	64.3	48.9	26.6	48.3	59.6		R3-CNN (ResNet-50-FPN, DCN)	2021-04-03
RepPoints: Point Set Representation for Object Detection	✓ Link	44.8							RPDet (ResNet-101-DCN, multi-scale train)	2019-04-25
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding	✓ Link	44.7		47.6	29.9	48	58.1		RetinaNet (ViL-Base, multi-scale, 3x)	2021-03-29
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	44.6	62.7	48.7	26.3	48.1	58.5		Cascade R-CNN (HRNetV2p-W48)	2019-08-20
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	44.6			27.7	48.3			CenterMask+VoVNetV2-57 (single-scale)	2019-11-15
Conditional DETR for Fast Training Convergence	✓ Link	44.5	65.6	47.5	23.6	48.4	63.6	63	Conditional DETR-R101	2021-08-13
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals	✓ Link	44.5	63.4	48.2	26.9	47.2	59.5		Sparse R-CNN (ResNet-50, learnable proposals, random crop aug, FPN)	2020-11-25
Deep Residual Learning for Image Recognition	✓ Link	44.5	63.0	48.3					GFL (ResNet-50)	2015-12-10
RepPoints: Point Set Representation for Object Detection	✓ Link	44.5							RPDet (ResNeXt-101-DCN)	2019-04-25
CenterMask : Real-Time Anchor-Free Instance Segmentation	✓ Link	44.4			26.7		57.1		CenterMask+X101-32x8d (single-scale)	2019-11-15
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding	✓ Link	44.3	65.5	47.1	28.9	47.9	58.3		RetinaNet (ViL-Base)	2021-03-29
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	44.3	64.1	48.4	27	47.1	58.9		R3-CNN (ResNet-50-FPN, GC-Net)	2021-04-03
Anchor DETR: Query Design for Transformer-Based Object Detection	✓ Link	44.2	64.7	47.5	24.7	48.2	60.6		Anchor DETR-DC5-R50	2021-09-15
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR	✓ Link	44.1	64.7	47.2	24.1	48.2	62.9	63	DAB-DETR-R101	2022-01-28
End-to-End Object Detection with Transformers	✓ Link	44	63.9	47.8	27.2	48.1	56		Faster RCNN-R101-FPN+	2020-05-26
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	43.7	61.7	47.7	25.6	46.5	57.4		Cascade R-CNN (HRNetV2p-W32)	2019-08-20
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals	✓ Link	43.5	62.1	47.2	26.1	46.3	59.7		Sparse R-CNN (ResNet-101, FPN)	2020-11-25
Deep Residual Learning for Image Recognition	✓ Link	43.5	61.9	47.0					ATSS (ResNet-50)	2015-12-10
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions	✓ Link	43.4	63.6	46.1	26.1	46.0	59.5		PVT-Large (RetinaNet 3x,MS)	2021-02-24
Bottom-up Object Detection by Grouping Extreme and Center Points	✓ Link	43.3	59.6	46.8	25.7	46.6	59.4		ExtremeNet (Hourglass-104, multi-scale)	2019-01-23
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	43.2	61.0	46.1	26.6	47	58.6		Pix2seq (R50-DC5 )	2021-09-22
Hybrid Task Cascade for Instance Segmentation	✓ Link	43.2	59.4	40.7	20.3	40.9	52.3		HTC (cascade)	2019-01-22
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization	✓ Link	43.12	64.15	47.11	25.49	47.19	56.39		Mask R-CNN-FPN (ResNeXt-101, GN+WS)	2019-03-25
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	43.1			26.6	46.0			HTC (HRNetV2p-W18)	2019-08-20
Deformable ConvNets v2: More Deformable, Better Results	✓ Link	43.1							Mask R-CNN (ResNet-101, DCNv2)	2018-11-27
Conditional DETR for Fast Training Convergence	✓ Link	43	64	45.7	22.7	46.7	61.5	44	Conditional DETR-R50	2021-08-13
HoughNet: Integrating near and long-range evidence for bottom-up object detection	✓ Link	43.0	62.2	46.9	25.5	47.6	55.8		HoughNet (HG-104)	2020-07-05
X-volution: On the unification of convolution and self-attention		42.8	64	46.4	26.9	46	55		Faster R-CNN (FPN, X-volution)	2021-06-04
Cascade R-CNN: Delving into High Quality Object Detection	✓ Link	42.7	61.6	46.6	23.8	46.2	57.4		Cascade R-CNN (ResNet-101-FPN+, cascade)	2017-12-03
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions	✓ Link	42.6	63.7	45.4	25.8	46.0	58.4		PVT-Large (RetinaNet 1x)	2021-02-24
CornerNet-Lite: Efficient Keypoint Based Object Detection	✓ Link	42.6			25.5	44.3	58.4		CornerNet-Saccade (Hourglass-54)	2019-04-18
Pix2seq: A Language Modeling Framework for Object Detection	✓ Link	42.6							Pix2seq (R50)	2021-09-22
Group Normalization	✓ Link	42.3	62.8	46.2					Mask R-CNN (ResNet-101-FPN, GroupNorm, long)	2018-03-22
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals	✓ Link	42.3	61.2	45.7	26.7	44.6	57.6		Sparse R-CNN (ResNet-50, FPN)	2020-11-25
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	42.3			25.0	45.4			Mask R-CNN (HRNetV2p-W32)	2019-08-20
Rethinking and Improving Relative Position Encoding for Vision Transformer	✓ Link	42.3							DETR-ResNet50 with iRPE-K (300 epochs)	2021-07-29
Scale-Aware Trident Networks for Object Detection	✓ Link	42	63.5	45.5	24.9	47	56.9		TridentNet (ResNet-101)	2019-01-07
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link	42	61	46.3	24.5	45.2	55.7		R3-CNN (ResNet-50-FPN)	2021-04-03
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	41.8	62.8	45.9		44.7	54.6		Faster R-CNN (HRNetV2p-W48)	2019-08-20
LIP: Local Importance-based Pooling	✓ Link	41.7	63.6	45.6	25.2	45.8			Faster R-CNN (LIP-ResNet-101)	2019-08-12
Deformable ConvNets v2: More Deformable, Better Results	✓ Link	41.7			22.2	45.8	58.7		Faster R-CNN (ResNet-101, DCNv2)	2018-11-27
Feature Selective Anchor-Free Module for Single-Shot Object Detection	✓ Link	41.6	62.4						FSAF (ResNeXt-101, anchor-based branches)	2019-03-02
CornerNet-Lite: Efficient Keypoint Based Object Detection	✓ Link	41.4			23.8	43.5	57.1		CornerNet-Saccade (Hourglass-104)	2019-04-18
Grid R-CNN	✓ Link	41.3	60.3	44.4	23.4	45.8	54.1		Grid R-CNN (ResNet-101-FPN)	2018-11-29
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	41.3	59.2	44.9	23.7	44.2	54.1		Cascade R-CNN (HRNetV2p-W18)	2019-08-20
CenterNet: Keypoint Triplets for Object Detection	✓ Link	41.3	59.2	43.9	23.6	43.8	55.8		CenterNet511 (Hourglass-52)	2019-04-17
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free	✓ Link	41.1	60.2	44.1					RetinaMask (ResNet-101-FPN)	2019-01-10
MetaFormer Is Actually What You Need for Vision	✓ Link	41.0	63.1	44.8					PoolFormer-S36 (Mask R-CNN)	2021-11-22
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	40.9	61.8	44.8	24.4	43.7	53.3		Faster R-CNN (HRNetV2p-W32)	2019-08-20
VirTex: Learning Visual Representations from Textual Annotations	✓ Link	40.9							VirTex Mask R-CNN (ResNet-50-FPN)	2020-06-11
Non-local Neural Networks	✓ Link	40.8	63.1	44.5					Mask R-CNN (ResNet-101 + 1 NL)	2017-11-21
Group Normalization	✓ Link	40.8	61.6	44.4					Mask R-CNN (ResNet-50-FPN, GroupNorm, long)	2018-03-22
RepPoints: Point Set Representation for Object Detection	✓ Link	40.8							RPDet (ResNet-50, multi-scale train)	2019-04-25
Rethinking and Improving Relative Position Encoding for Vision Transformer	✓ Link	40.8							DETR-ResNet50 with iRPE-K (150 epochs)	2021-07-29
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection	✓ Link	40.7	60.7	43.3					Faster R-CNN+aLRP Loss (ResNet-50, 500 scale)	2020-09-28
Reducing Label Noise in Anchor-Free Object Detection	✓ Link	40.5	59.5	44.2	25.4	44.7	52.3		PPDet (ResNet-101-FPN)	2020-08-03
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond	✓ Link	40.3	62.4	44	24.2	44.4	52.5		GCnet (ResNet-50-FPN, GRoIE)	2019-04-25
Group Normalization	✓ Link	40.3	61	44					Mask R-CNN (ResNet-50-FPN, GroupNorm)	2018-03-22
Cascade R-CNN: Delving into High Quality Object Detection	✓ Link	40.3	59.4	43.7	22.9	43.7	54.1		Cascade R-CNN (ResNet-50-FPN+)	2017-12-03
Bottom-up Object Detection by Grouping Extreme and Center Points	✓ Link	40.3	55.1	43.7	21.6	44.0	56.1		ExtremeNet (Hourglass-104, single-scale)	2019-01-23
RepPoints: Point Set Representation for Object Detection	✓ Link	40.3							RPDet (ResNet-101)	2019-04-25
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection	✓ Link	40.2	60.3	42.3					RetinaNet+aLRP Loss (ResNet-50, 500 scale)	2020-09-28
Mask R-CNN	✓ Link	40.0							Mask R-CNN (ResNet-101-FPN)	2017-03-20
Feature Pyramid Networks for Object Detection	✓ Link	39.8	61.3	43.3	22.9	43.3	52.6		FPN+	2016-12-09
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection	✓ Link	39.7	58.8	41.5					FoveaBox+aLRP Loss (ResNet-50, 500 scale)	2020-09-28
Grid R-CNN	✓ Link	39.6	58.3	42.4	22.6	43.8	51.5		Grid R-CNN (ResNet-50-FPN)	2018-11-29
Adaptively Connected Neural Networks	✓ Link	39.5							Mask R-CNN (ResNet-50, ACNet)	2019-04-07
Feature Selective Anchor-Free Module for Single-Shot Object Detection	✓ Link	39.3	59.2						FSAF (ResNet-101, anchor-based branches)	2019-03-02
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	39.2				41.7	51.0		Mask R-CNN (HRNetV2p-W18)	2019-08-20
Non-local Neural Networks	✓ Link	39.0	61.1	41.9					Mask R-CNN (ResNet-50 + 1 NL)	2017-11-21
FoveaBox: Beyond Anchor-based Object Detector	✓ Link	38.9	58.4	41.5	22.3	43.5	51.7		FoveaBox (ResNet-101-FPN, 800x800)	2019-04-08
FCOS: Fully Convolutional One-Stage Object Detection	✓ Link	38.6	57.4	41.4	22.3	42.5	49.8		FCOS (ResNet-50-FPN + improvements)	2019-04-02
RepPoints: Point Set Representation for Object Detection	✓ Link	38.6							RPDet (ResNet-50)	2019-04-25
Libra R-CNN: Towards Balanced Learning for Object Detection	✓ Link	38.5	59.3	42.0	22.9	42.1	50.5		Libra R-CNN (ResNet-50 FPN)	2019-04-04
A novel Region of Interest Extraction Layer for Instance Segmentation	✓ Link	38.4	59.9	41.7	22.9	42.1	49.7		Mask R-CNN (ResNet-50-FPN, GRoIE)	2020-04-28
CornerNet: Detecting Objects as Paired Keypoints	✓ Link	38.4	53.8	40.9	18.6	40.5	51.8		CornerNet511 (Hourglass-104)	2018-08-03
FoveaBox: Beyond Anchor-based Object Detector	✓ Link	38.1	57.8	40.5					FoveaBox+Retina (ResNet-50)	2019-04-08
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link	38.0	58.9	41.5	22.6	40.8	49.6		Faster R-CNN (HRNetV2p-W18)	2019-08-20
FoveaBox: Beyond Anchor-based Object Detector	✓ Link	38	57.8	40.2	19.5	42.2	52.7		FoveaBox (ResNet-101-FPN, 600x600)	2019-04-08
Feature Selective Anchor-Free Module for Single-Shot Object Detection	✓ Link	37.9	58.0						FSAF (ResNet-101)	2019-03-02
Mask R-CNN	✓ Link	37.7							Mask R-CNN (ResNet-50-FPN)	2017-03-20
A novel Region of Interest Extraction Layer for Instance Segmentation	✓ Link	37.5	59.2	40.6	22.3	41.5	47.8		Faster R-CNN (ResNet-50-FPN, GRoIE)	2020-04-28
Mask R-CNN	✓ Link	36.7	59.5	38.9					Mask R-CNN (ResNeXt-101-FPN)	2017-03-20
FoveaBox: Beyond Anchor-based Object Detector	✓ Link	36.0	55.2	37.9	18.6	39.4	50.5		FoveaBox (ResNet-50-FPN, 600x600)	2019-04-08
Feature Selective Anchor-Free Module for Single-Shot Object Detection	✓ Link	35.9	55.0	37.9	19.8	39.6	48.2		FSAF (ResNet-50)	2019-03-02
Gradient Harmonized Single-stage Detector	✓ Link	35.8	55.5	38.1	19.6	39.6	46.7		GHM-C + GHM-R (RetinaNet-FPN-ResNet-50, M=30)	2018-11-13
Generating Positive Bounding Boxes for Balanced Training of Object Detectors	✓ Link	35.6	55.3						Online Fg Bal. Sampling+Hard Negative Mining (ResNet-50)	2019-09-21
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network	✓ Link	34.1	53.7		15.9	39.5	49.3		M2Det (ResNet-1o1, 320x320)	2018-11-12
Res2Net: A New Multi-scale Backbone Architecture	✓ Link	33.7	53.6		14	38.3	51.1		Faster R-CNN (Res2Net-50)	2019-04-02
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network	✓ Link	33.2	52.2		15	38.2	49.1		M2Det (VGG-16, 320x320)	2018-11-12
SOLQ: Segmenting Objects by Learning Queries	✓ Link		74.9	61.3			71.9		SOLQ (Swin-L, single scale)	2021-06-04
You Only Learn One Representation: Unified Network for Multiple Tasks	✓ Link		73.5	60.6	40.4	60.1	68.7		YOLOR-D6 (1280, single-scale, 31 fps)	2021-05-10
EfficientDet: Scalable and Efficient Object Detection	✓ Link		73.4	59.0	40.0	58.0	67.9		EfficientDet-D7x (single-scale)	2019-11-20
You Only Learn One Representation: Unified Network for Multiple Tasks	✓ Link		70.6	57.4	37.4	57.3	65.2		YOLOR-P6 (1280, single-scale, 72 fps)	2021-05-10
Focal Modulation Networks	✓ Link		70.1	55.8					FocalNet-T (SRF, Cascade Mask R-CNN)	2022-03-22
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing	✓ Link		61.2	45.6	24.4				R3-CNN (ResNet-50-FPN, GRoIE)	2021-04-03
Deep High-Resolution Representation Learning for Visual Recognition	✓ Link				26.1	47.9			Mask R-CNN (HRNetV2p-W32, cascade)	2019-08-20
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism	✓ Link					42.3			Shift-T	2022-01-26
Dynamic Head: Unifying Object Detection Heads with Attentions	✓ Link						66.3		DyHead (ResNeXt-64x4d-101-DCN, multi scale)	2021-06-15

OpenCodePapers

object-detection-on-coco-minival