DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 66.0 | | | | | | | 304 | | | Co-DETR | 2022-11-22 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 65.5 | | | | | | | 2180 | | | InternImage-H (M3I Pre-training) | 2022-11-10 |
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | ✓ Link | 65.4 | | | | | | | | | | M3I Pre-training (InternImage-H) | 2022-11-17 |
MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection | ✓ Link | 65.1 | | | | | | | | | | MoCaE | 2023-09-26 |
A Strong and Reproducible Object Detector with Only Public Datasets | ✓ Link | 64.8 | 81.7 | 71.5 | 48.6 | 67.6 | 78 | | 689 | | | Focal-Stable-DINO (Focal-Huge, no TTA) | 2023-04-25 |
DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 64.8 | | | | | | | 218 | | | Co-DETR (Swin-L) | 2022-11-22 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 64.7 | 81.9 | 71.7 | 48.5 | 67.7 | 77.9 | | | | | EVA | 2022-11-14 |
Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining | | 64.5 | 81.8 | 71.1 | 48.4 | 67.2 | 77.1 | | | | | Group DETR v2 | 2022-11-07 |
Focal Modulation Networks | ✓ Link | 64.4 | | | | | | | | | | FocalNet-H (DINO) | 2022-03-22 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 64.3 | | | | | | | 602 | | | InternImage-XL | 2022-11-10 |
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | ✓ Link | 64.2 | | | | | | | | | | FD-SwinV2-G | 2022-05-27 |
DETR Does Not Need Multi-Scale or Locality Design | ✓ Link | 63.9 | 82.1 | 70.7 | 48.2 | 66.8 | 76.7 | | 228 | | | Plain-DETR (Swin-L) | 2023-01-01 |
Reversible Column Networks | ✓ Link | 63.8 | | | | | | | | | | RevCol-H(DINO) | 2022-12-22 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 63.7 | | | | | | | | | | BEiT-3 | 2022-08-22 |
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection | ✓ Link | 63.5 | 80.8 | 69.1 | 47.2 | 66.9 | 77.0 | | 214 | | | Relation-DETR (Focal-L) | 2024-07-16 |
NMS Strikes Back | ✓ Link | 63.5 | 80.4 | 70.2 | 46.1 | 66.9 | 76.9 | | | | | DETA (Swin-L) | 2022-12-12 |
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 63.3 | | | | | | | | | | DINO (Swin-L,multi-scale, TTA) | 2022-03-07 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 63.1 | | | | | | | 3000 | | | SwinV2-G (HTC++) | 2021-11-18 |
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection | ✓ Link | 63.0 | | | | | | | | | | Grounding DINO | 2023-03-09 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 62.4 | | | | | | | | | | Florence-CoSwin-H | 2021-11-22 |
GLIPv2: Unifying Localization and Vision-Language Understanding | ✓ Link | 62.4 | | | | | | | | | | GLIPv2 (CoSwin-H, multi-scale) | 2022-06-12 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 62.3 | | | | | | | | | | GLEE-Pro | 2023-12-14 |
Grounded Language-Image Pre-training | ✓ Link | 61.5 | 79.5 | 67.7 | 45.3 | 64.9 | 75.0 | | | | | GLIP (Swin-L, multi-scale) | 2021-12-07 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 61.3 | | | | | | | | | | Soft Teacher + Swin-L (HTC++, multi-scale) | 2021-06-16 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.9 | | | | | | | | | | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | 2022-05-17 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 60.6 | 78.5 | 66.6 | | 64.0 | 74.2 | | | | | DyHead (Swin-L, multi scale, self-training) | 2021-06-15 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 60.6 | | | | | | | | | | GLEE-Plus | 2023-12-14 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.4 | | | | | | | | | | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | 2022-05-17 |
GRiT: A Generative Region-to-text Transformer for Object Understanding | ✓ Link | 60.4 | | | | | | | | | | GRiT (ViT-H, single-scale testing) | 2022-12-01 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 60.1 | | | | | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Parameter-Inverted Image Pyramid Networks | ✓ Link | 60.0 | 79.0 | 65.4 | | | | | | | | PIIP-H6B (DINO) | 2024-06-06 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 59.4 | | | | | | | | | | CBNetV2 (Dual-Swin-L HTC, single-scale) | 2021-07-01 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 58.9 | | | | | | | | | | Focal-L (DyHead, multi-scale) | 2021-07-01 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 58.7 | 77.1 | 64.5 | | 62.0 | 72.8 | | | | | DyHead (Swin-L, multi scale) | 2021-06-15 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 58.7 | | | | | | | | | | Swin-L (HTC++, multi scale) | 2021-03-25 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 57.7 | | | | | | | | | | Swin-L (HTC++, single scale) | 2021-03-25 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 57.3 | | | | | | | | | | Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale) | 2020-12-13 |
CenterNet++ for Object Detection | ✓ Link | 57.1 | 73.7 | 62.4 | 38.7 | 59.2 | 71.3 | | | | | PyCenterNet (Swin-L, multi-scale) | 2022-04-18 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 56.8 | | | | | | | | | | dBOT ViT-L (CLIP) | 2022-09-08 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 56.6 | | | | | | | | | | YOLOv7-D6 (44 fps) | 2022-07-06 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 56.5 | 74.6 | 60.5 | 37.6 | 60 | 70.6 | | | | | SOLQ (Swin-L, single scale) | 2021-06-04 |
Probabilistic two-stage detection | ✓ Link | 56.4 | 74.0 | 61.6 | 38.7 | 59.7 | 68.6 | | | | | CenterNet2 (Res2Net-101-DCN-BiFPN, self-training, 1560 single-scale) | 2021-03-12 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 56.4 | | | 27.8 | 48.7 | 59.9 | | | | | ISTR (ResNet50-FPN-3x, single-scale) | 2021-05-03 |
Instances as Queries | ✓ Link | 56.1 | 75.9 | 61.9 | 37.4 | 58.9 | 70.3 | 17G | | | | QueryInst (single-scale) | 2021-05-05 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 56.1 | | | | | | | | | | dBOT ViT-L | 2022-09-08 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 56 | | | | | | | | | | YOLOv7-E6 (56 fps) | 2022-07-06 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 55.8 | 73.2 | 61.2 | | | | | | | | YOLOv4-P7 with TTA | 2020-11-16 |
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | ✓ Link | 55.7 | 74.2 | 61.1 | 37.7 | 58.4 | 68.1 | | | | | DetectoRS (ResNeXt-101-64x4d, multi-scale) | 2020-06-03 |
You Only Learn One Representation: Unified Network for Multiple Tasks | ✓ Link | 55.4 | 73.3 | 60.6 | | | | | | | | YOLOR-D6 (1280, single-scale, 30 fps) | 2021-05-10 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 54.9 | 72.6 | 60.2 | | | | | | | | YOLOv4-P6 with TTA | 2020-11-16 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 54.9 | | | | | | | | | | YOLOv7-W6 (84 fps) | 2022-07-06 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 54.8 | | | | | | | | | | Cascade Eff-B7 NAS-FPN (1280) | 2020-12-13 |
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | ✓ Link | 54.7 | 73.5 | 60.1 | 37.4 | 57.3 | 66.4 | | | | | DetectoRS (ResNeXt-101-32x4d, multi-scale) | 2020-06-03 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 54.7 | | | | | | | | | | GLEE-Lite | 2023-12-14 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 54.3 | 72.3 | 59.5 | 36.6 | 58.2 | 65.5 | | | | | YOLOv4-P6 CSP-P6 (single-scale, 32 fps) | 2020-11-16 |
Rethinking Pre-training and Self-training | ✓ Link | 54.3 | | | | | | | | | | SpineNet-190 (1280, with Self-training on OpenImages, single-scale) | 2020-06-11 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 54.1 | 71.6 | 59.9 | 35.8 | 57.2 | 67.4 | | | | | UniverseNet-20.08d (Res2Net-101, DCN, multi-scale) | 2021-03-25 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 54 | 72.1 | 59.3 | | | | | | | | DyHead (ResNeXt-64x4d-101-DCN, multi scale) | 2021-06-15 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 53.6 | | | | | | | | | | dBOT ViT-B (CLIP) | 2022-09-08 |
Probabilistic Anchor Assignment with IoU Prediction for Object Detection | ✓ Link | 53.5 | 71.6 | 59.1 | 36.0 | 56.3 | 66.9 | | | | | PAA (ResNext-152-32x8d + DCN, multi-scale) | 2020-07-16 |
Location-Sensitive Visual Recognition with Cross-IOU Loss | ✓ Link | 53.5 | 71.1 | 59.2 | 35.2 | 56.4 | 65.8 | | | | | LSNet (Res2Net-101+ DCN, multi-scale) | 2021-04-11 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 53.5 | | | | | | | | | | dBOT ViT-B | 2022-09-08 |
ResNeSt: Split-Attention Networks | ✓ Link | 53.3 | 72.0 | 58.0 | 35.1 | 56.2 | 66.8 | | | | | ResNeSt-200 (multi-scale) | 2020-04-19 |
CBNet: A Novel Composite Backbone Network Architecture for Object Detection | ✓ Link | 53.3 | 71.9 | 58.5 | 35.5 | 55.8 | 66.7 | | | | | Cascade Mask R-CNN (Triple-ResNeXt152, multi-scale) | 2019-09-09 |
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | ✓ Link | 53.3 | 71.6 | 58.5 | 33.9 | 56.5 | 66.9 | | | | | DetectoRS (ResNeXt-101-32x4d, single-scale) | 2020-06-03 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 53.3 | 70.9 | 59.2 | 35.7 | 56.1 | 65.6 | | | | | GFLV2 (Res2Net-101, DCN, multiscale) | 2020-11-25 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 53.1 | | | | | | | | | | YOLOv7-X (114 fps) | 2022-07-06 |
RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder | ✓ Link | 52.7 | | | | | | | | | | RelationNet++ (ResNeXt-64x4d-101-DCN) | 2020-10-29 |
EfficientDet: Scalable and Efficient Object Detection | ✓ Link | 52.6 | 71.6 | 56.9 | | | | | | | | EfficientDet-D7 (1536) | 2019-11-20 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 52.5 | 70.3 | 58 | | | | | | | | YOLOv4-P5 with TTA | 2020-11-16 |
Deformable DETR: Deformable Transformers for End-to-End Object Detection | ✓ Link | 52.3 | 71.9 | 58.1 | 34.4 | 54.4 | 65.6 | 17G | | 17.3G | | Deformable DETR (ResNeXt-101+DCN) | 2020-10-08 |
Global Context Networks | ✓ Link | 52.3 | 70.9 | 56.9 | | | | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 2020-12-24 |
PP-YOLOE: An evolved version of YOLO | ✓ Link | 52.2 | 69.9 | 56.5 | 33.3 | 56.3 | 66.4 | | | | | PP-YOLOE-x(CSPRepResNet-x, 640x640, single-scale ) | 2022-03-30 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 52.1 | 71.8 | 56.5 | 35.4 | 55 | 63.6 | | | | | RetinaNet (SpineNet-190, 1280x1280) | 2019-12-10 |
RepPoints V2: Verification Meets Regression for Object Detection | ✓ Link | 52.1 | 70.1 | 57.5 | 34.5 | 54.6 | 63.6 | | | | | RepPoints v2 (ResNeXt-101, DCN, multi-scale) | 2020-07-16 |
Attention-guided Context Feature Pyramid Network for Object Detection | ✓ Link | 51.9 | 70.4 | 57 | 34.2 | 54.8 | 64.7 | | | | | AC-FPN Cascade R-CNN (X-152-32x8d-FPN-IN5k, multi scale, only CEM) | 2020-05-23 |
OTA: Optimal Transport Assignment for Object Detection | ✓ Link | 51.5 | 68.6 | 57.1 | 34.1 | 53.7 | 64.1 | | | | | OTA (ResNeXt-101+DCN, multiscale) | 2021-03-26 |
YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 51.5 | | | | | | | | | | YOLOX-x(Modified CSP v5, 640x640, single-scale) | 2021-07-18 |
PP-YOLOE: An evolved version of YOLO | ✓ Link | 51.4 | 68.9 | 55.6 | 31.4 | 55.3 | 66.1 | | | | | PP-YOLOE-l(CSPRepResNet-l, 640x640, single-scale ) | 2022-03-30 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 51.4 | | | | | | | | | | YOLOv7 (161 fps) | 2022-07-06 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 51.3 | 70.0 | 55.8 | 31.7 | 55.3 | 64.9 | | | | | UniverseNet-20.08d (Res2Net-101, DCN, single-scale) | 2021-03-25 |
Revisiting the Sibling Head in Object Detector | ✓ Link | 51.2 | 71.9 | 56.0 | 33.8 | 54.8 | 64.2 | | | | | TSD(SENet154-DCN,multi-scale) | 2020-03-17 |
YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 51.2 | 69.6 | 55.7 | 31.2 | 56.1 | 66.1 | | 99.1 | | | YOLOX-X (Modified CSP v5) | 2021-07-18 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 51.2 | | | | | | | | | | iBOT (ViT-B/16) | 2021-11-15 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 50.7 | 70.4 | 54.9 | 33.6 | 53.9 | 62.1 | | | | | RetinaNet (SpineNet-143, 1280x1280) | 2019-12-10 |
Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection | ✓ Link | 50.7 | 68.9 | 56.3 | 33.2 | 52.9 | 62.4 | | | | | ATSS (ResNetXt-64x4d-101+DCN,multi-scale) | 2019-12-05 |
Learning Data Augmentation Strategies for Object Detection | ✓ Link | 50.7 | | | 34.2 | 55.5 | 64.5 | | | | | NAS-FPN (AmoebaNet-D, learned aug) | 2019-06-26 |
Boosting R-CNN: Reweighting R-CNN Samples by RPN's Error for Underwater Object Detection | ✓ Link | 50.7 | | | | | | | | | | Boosting R-CNN* | 2022-06-28 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 50.6 | 69 | 55.3 | 31.3 | 54.3 | 63.5 | | | | | GFLV2 (Res2Net-101, DCN) | 2020-11-25 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 50.2 | 70.3 | 53.9 | 32.0 | 53.1 | 63.0 | | | | | aLRP Loss (ResNext-101-64x4d, DCN, multiscale test) | 2020-09-28 |
Scale-Equalizing Pyramid Convolution for Object Detection | ✓ Link | 50.1 | 69.8 | 54.3 | 31.3 | 53.3 | 63.7 | | | | | FreeAnchor + SEPC (DCN, ResNext-101-64x4d) | 2020-05-06 |
D2Det: Towards High Quality Object Detection and Instance Segmentation | ✓ Link | 50.1 | 69.4 | 54.9 | 32.7 | 52.7 | 62.1 | | | | | D2Det (ResNet-101-DCN, multi-scale test) | 2020-06-01 |
Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training | ✓ Link | 50.1 | 68.3 | 55.6 | 32.8 | 53.0 | 61.2 | | | | | Dynamic R-CNN (ResNet-101-DCN, multi-scale) | 2020-04-13 |
Revisiting the Sibling Head in Object Detector | ✓ Link | 49.4 | 69.6 | 54.4 | 32.7 | 52.5 | 61.0 | | | | | TSD(ResNet-101-Deformable, Image Pyramid) | 2020-03-17 |
RepPoints V2: Verification Meets Regression for Object Detection | ✓ Link | 49.4 | 68.9 | 53.4 | 30.3 | 52.1 | 62.3 | | | | | RepPoints v2 (ResNeXt-101, DCN) | 2020-07-16 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 49.4 | | | | | | | | | | A2MIM (ViT-B) | 2022-05-27 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 49.4 | | | | | | | | | | iBOT (ViT-S/16) | 2021-11-15 |
Corner Proposal Network for Anchor-free, Two-stage Object Detection | ✓ Link | 49.2 | 67.3 | 53.7 | 31.0 | 51.9 | 62.4 | | | | | CPNDet (Hourglass-104, multi-scale) | 2020-07-27 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 49 | 67.6 | 53.5 | 29.7 | 52.4 | 61.4 | 3G | | | | GFLV2 (ResNeXt-101, 32x4d, DCN) | 2020-11-25 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 48.9 | 69.3 | 52.5 | 30.8 | 51.5 | 62.1 | | | | | aLRP Loss (ResNext-101-64x4d, DCN, single scale) | 2020-09-28 |
PP-YOLOE: An evolved version of YOLO | ✓ Link | 48.9 | 66.5 | 53.0 | 28.6 | 52.9 | 63.8 | | | | | PP-YOLOE-m(CSPRepResNet-m, 640x640, single-scale ) | 2022-03-30 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 48.8 | 67.5 | 53.0 | 30.1 | 52.3 | 61.1 | | | | | UniverseNet-20.08 (Res2Net-50, DCN, single-scale) | 2021-03-25 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 48.7 | | | | | | | | | | SOLQ (ResNet101, single scale) | 2021-06-04 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 48.6 | 68.4 | 52.5 | 32 | 52.3 | 62 | | | | | RetinaNet (SpineNet-96, 1024x1024) | 2019-12-10 |
Scale-Aware Trident Networks for Object Detection | ✓ Link | 48.4 | 69.7 | 53.5 | 31.8 | 51.3 | 60.3 | | | | | TridentNet (ResNet-101-Deformable, Image Pyramid) | 2019-01-07 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 48.4 | 67.6 | 52.7 | | | | | | 54.8G | | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 2019-04-25 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 48.3 | 66.5 | 52.8 | 28.8 | 51.9 | 60.7 | 3G | | | | GFLV2 (ResNet-101-DCN) | 2020-11-25 |
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields | ✓ Link | 48.23 | | | | | | | | | | Swin-S (RPE w/ GAB) | 2023-05-08 |
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection | ✓ Link | 48.2 | 67.4 | 52.6 | 29.2 | 51.7 | 60.2 | | | | | GFL (X-101-32x4d-DCN, single-scale) | 2020-06-08 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 48.1 | | | 28.7 | 50.4 | 61.5 | | | | | ISTR (ResNet101-FPN-3x, single-scale) | 2021-05-03 |
YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 48.0 | | | | | | | | | | YOLOX-Darknet53(Darknet53, 640x640, single-scale) | 2021-07-18 |
Vision Transformer with Deformable Attention | ✓ Link | 47.9 | 69.6 | 51.2 | 32.3 | 51.8 | 63.4 | | | | | DAT-S (RetinaNet) | 2022-01-03 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 47.8 | 68.4 | 51.1 | 30.2 | 50.8 | 59.1 | | | | | aLRP Loss (ResNext-101-64x4d, single scale) | 2020-09-28 |
Matrix Nets: A New Deep Architecture for Object Detection | ✓ Link | 47.8 | 66.2 | 52.3 | 29.7 | 50.4 | 60.7 | | | | | MatrixNet Corners (ResNet-152, multi-scale) | 2019-08-13 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 47.8 | | | | | | | | | | SOLQ (ResNet50, single scale) | 2021-06-04 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 47.7 | 65.7 | 51.9 | | | | | | | | DyHead (ResNeXt-64x4d-101) | 2021-06-15 |
Soft Anchor-Point Object Detection | ✓ Link | 47.4 | 67.4 | 51.1 | 28.1 | 50.3 | 61.5 | | | | | SAPD (ResNeXt-101, single-scale) | 2019-11-27 |
Path Aggregation Network for Instance Segmentation | ✓ Link | 47.4 | 67.2 | 51.8 | 30.1 | 51.7 | 60.0 | | | | | PANet (ResNeXt-101, multi-scale) | 2018-03-05 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 47.3 | 65.9 | 51.2 | 28.0 | 49.7 | 59.8 | 15G | | 71.7G | | HTC (HRNetV2p-W48) | 2019-08-20 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 47.1 | 63.9 | 44.7 | 22.8 | 43.9 | 54.6 | | | | | HTC (ResNeXt-101-FPN) | 2019-01-22 |
CenterNet: Keypoint Triplets for Object Detection | ✓ Link | 47.0 | 64.5 | 50.7 | 28.9 | 49.9 | 58.9 | | | | | CenterNet511 (Hourglass-104, multi-scale) | 2019-04-17 |
Multiple Anchor Learning for Visual Object Detection | ✓ Link | 47.0 | | | | | | | | | | MAL (ResNeXt101, multi-scale) | 2019-12-04 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 46.8 | | | | | | | | | | ISTR (ResNet50-FPN-3x) | 2021-05-03 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 46.7 | 66.3 | 50.6 | 29.1 | 50.1 | 61.7 | | | | | RetinaNet (SpineNet-49, 896x896) | 2019-12-10 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 46.5 | 67.4 | 50.9 | 30.3 | 49.7 | 57.1 | | | | | RPDet (ResNet-101-DCN, multi-scale) | 2019-04-25 |
HoughNet: Integrating near and long-range evidence for bottom-up object detection | ✓ Link | 46.4 | 65.1 | 50.7 | 29.1 | 48.5 | 58.1 | | | | | HoughNet (MS) | 2020-07-05 |
Reducing Label Noise in Anchor-Free Object Detection | ✓ Link | 46.3 | 64.8 | 51.6 | 31.4 | 49.9 | 56.4 | | | | | PPDet (ResNeXt-101-FPN, multiscale) | 2020-08-03 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 46.2 | 64.3 | 50.5 | 27.8 | 49.9 | 57 | | | | | GFLV2 (ResNet-101) | 2020-11-25 |
SNIPER: Efficient Multi-Scale Training | ✓ Link | 46.1 | 67.0 | 51.6 | 29.6 | 48.9 | 58.1 | 29G | | | | SNIPER (ResNet-101) | 2018-05-23 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 46.1 | 64.0 | 50.3 | 27.1 | 48.6 | 58.3 | 15G | | 61.8G | | Mask R-CNN (HRNetV2p-W48 + cascade) | 2019-08-20 |
NAS-FCOS: Fast Neural Architecture Search for Object Detection | ✓ Link | 46.1 | | | | | | | | | | ResNeXt-64x4d-101 NAS-FCOS @128-256 w/improvements | 2019-06-11 |
Deformable ConvNets v2: More Deformable, Better Results | ✓ Link | 46.0 | 67.9 | 50.8 | 27.8 | 49.1 | 59.5 | | | | | DCNv2 (ResNet-101, multi-scale) | 2018-11-27 |
Localization Uncertainty Estimation for Anchor-Free Object Detection | | 46 | | | | | | | | | | Gaussian-FCOS | 2020-06-28 |
InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting | ✓ Link | 45.9 | 64.2 | 50 | 26.3 | 49 | 58.6 | | | | | Cascade R-CNN-FPN (ResNet-101, map-guided) | 2019-08-21 |
Multiple Anchor Learning for Visual Object Detection | ✓ Link | 45.9 | | | | | | | | | | MAL (ResNeXt101, single-scale) | 2019-12-04 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 45.8 | 64.5 | | 27.8 | 48.3 | 57.6 | | | | | CenterMask+VoVNetV2-99 (single-scale) | 2019-11-15 |
An Analysis of Scale Invariance in Object Detection - SNIP | | 45.7 | 67.3 | 51.1 | 29.3 | 48.8 | 57.1 | | | | | D-RFCN + SNIP (DPN-98 with flip, multi-scale) | 2017-11-22 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 45.5 | 64.1 | 49.5 | 27 | 49 | 56.7 | | | | | YOLOv4 (CD53) | 2020-11-16 |
Attention-guided Context Feature Pyramid Network for Object Detection | ✓ Link | 45 | 64.4 | 49 | 26.9 | 47.7 | 56.6 | | | | | AC-FPN Cascade R-CNN(ResNet-101, single scale) | 2020-05-23 |
FreeAnchor: Learning to Match Anchors for Visual Object Detection | ✓ Link | 44.8 | 64.3 | 48.4 | 27 | 47.9 | 56 | | | | | FreeAnchor (ResNeXt-101) | 2019-09-05 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 44.7 | 64.1 | 48.4 | 27.6 | 47.5 | 55.6 | | | | | FCOS (ResNeXt-64x4d-101-FPN 4 + improvements) | 2019-04-02 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 44.7 | 63.1 | 48.6 | 27.1 | | 55.9 | | | | | CenterMask+VoVNet2-57 (single-scale) | 2019-11-15 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 44.6 | 65.2 | 48.6 | 29.7 | 47.1 | 54.6 | | | | | FSAF (ResNeXt-101, multi-scale) | 2019-03-02 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 44.6 | 65.0 | 47.5 | 24.6 | 48.1 | 58.3 | | | | | aLRP Loss (ResNext-101, DCN, 500 scale) | 2020-09-28 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 44.6 | 63.4 | 48.4 | | 47.2 | | | | | | CenterMask + X-101-32x8d (single-scale) | 2019-11-15 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 44.3 | 63.8 | 47.6 | 25.9 | 47.7 | 61.1 | | | | | RetinaNet (SpineNet-49, 640x640) | 2019-12-10 |
You Only Look One-level Feature | ✓ Link | 44.3 | 62.9 | 47.5 | 24.0 | 48.5 | 60.4 | | | | | YOLOF-DC5 | 2021-03-17 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 44.3 | 62.3 | 48.5 | 26.8 | 47.7 | 54.1 | | | | | GFLV2 (ResNet-50) | 2020-11-25 |
Feature Intertwiner for Object Detection | ✓ Link | 44.2 | 67.5 | 51.1 | 27.2 | 50.3 | 57.7 | | | | | InterNet (ResNet-101-FPN, multi-scale) | 2019-03-28 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 44.2 | 64.6 | 49.3 | 29.2 | 47.9 | 55.1 | 34G | | | | M2Det (VGG-16, multi-scale) | 2018-11-12 |
LIP: Local Importance-based Pooling | ✓ Link | 43.9 | 65.7 | 48.1 | 25.4 | 46.7 | 56.3 | | | | | Faster R-CNN (LIP-ResNet-101-MD w FPN) | 2019-08-12 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 43.9 | 64.4 | 48 | 29.6 | 49.6 | 54.3 | 27G | | | | M2Det (ResNet-101, multi-scale) | 2018-11-12 |
Learning Spatial Fusion for Single-Shot Object Detection | ✓ Link | 43.9 | 64.1 | 49.2 | 27.0 | 46.6 | 53.4 | | | | | YOLOv3 @800 + ASFF* (Darknet-53) | 2019-11-21 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 43.9 | 63.5 | 47.7 | 26.8 | 46.9 | 55.6 | | | | | FoveaBox (ResNeXt-101) | 2019-04-08 |
Bottom-up Object Detection by Grouping Extreme and Center Points | ✓ Link | 43.7 | 60.5 | 47.0 | 24.1 | 46.9 | 57.6 | 180G | | | | ExtremeNet (Hourglass-104, multi-scale) | 2019-01-23 |
YOLOv4: Optimal Speed and Accuracy of Object Detection | ✓ Link | 43.5 | 65.7 | 47.3 | 26.7 | 46.7 | 53.3 | | | | | YOLOv4-608 | 2020-04-23 |
SNIPER: Efficient Multi-Scale Training | ✓ Link | 43.5 | 65.0 | 48.6 | 26.1 | 46.3 | 56.0 | 29G | | | | SNIPER (ResNet-50) | 2018-05-23 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 43.5 | | 46.5 | 22.2 | | 57.8 | 16G | | 21.7G | | CenterNet (HRNetV2-W48) | 2019-08-20 |
An Analysis of Scale Invariance in Object Detection - SNIP | | 43.4 | 65.5 | 48.4 | 27.2 | 46.5 | 54.9 | | | | | D-RFCN + SNIP (ResNet-101, multi-scale) | 2017-11-22 |
Grid R-CNN | ✓ Link | 43.2 | 63.0 | 46.6 | 25.1 | 46.5 | 55.2 | | | | | Grid R-CNN (ResNeXt-101-FPN) | 2018-11-29 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 43.2 | 62.8 | 46.6 | 26.5 | 46.2 | 53.3 | | | | | FCOS (ResNeXt-101-64x4d-FPN) | 2019-04-02 |
CornerNet-Lite: Efficient Keypoint Based Object Detection | ✓ Link | 43.2 | | | 24.4 | 44.6 | 57.3 | | | | | CornerNet-Saccade (Hourglass-104, multi-scale) | 2019-04-18 |
PP-YOLOE: An evolved version of YOLO | ✓ Link | 43.1 | 60.5 | 46.6 | 23.2 | 46.4 | 56.9 | | | | | PP-YOLOE-s(CSPRepResNet-s, 640x640, single-scale ) | 2022-03-30 |
Libra R-CNN: Towards Balanced Learning for Object Detection | ✓ Link | 43.0 | 64 | 47 | 25.3 | 45.6 | 54.6 | | | | | Libra R-CNN (ResNeXt-101-FPN) | 2019-04-04 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 43 | 60.7 | 46.8 | | | | | | | | DyHead (ResNet-50) | 2021-06-15 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 42.8 | 65.0 | 46.3 | 24.9 | 46.2 | 54.7 | | | | | RPDet (ResNet-101-DCN) | 2019-04-25 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 42.8 | 62.3 | 46.1 | 23.7 | 45.2 | 57.3 | | | | | SpineNet-49 (640, RetinaNet, single-scale) | 2019-12-10 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 | | | | | Cascade R-CNN (ResNet-101-FPN+, cascade) | 2017-12-03 |
Cascade R-CNN: High Quality Object Detection and Instance Segmentation | ✓ Link | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 | 15G | | | | Cascade R-CNN | 2019-06-24 |
Scale-Aware Trident Networks for Object Detection | ✓ Link | 42.7 | 63.6 | 46.5 | 23.9 | 46.6 | 56.6 | | | | | TridentNet (ResNet-101) | 2019-01-07 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 42.7 | 62.2 | 46.1 | 26.0 | 45.6 | 52.6 | | | | | FCOS (ResNeXt-32x8d-101-FPN) | 2019-04-02 |
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free | ✓ Link | 42.6 | 62.5 | 46.0 | 24.8 | 45.6 | 53.8 | 12G | | | | RetinaMask (ResNeXt-101-FPN-GN) | 2019-01-10 |
TOOD: Task-aligned One-stage Object Detection | ✓ Link | 42.5 | 60.3 | 46.4 | | | | | | | | TAL + TAP | 2021-08-17 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 42.4 | 63.6 | 46.4 | 24.9 | 44.6 | 53.0 | 16G | | 20.8G | | Faster R-CNN (HRNetV2p-W48) | 2019-08-20 |
Hierarchical Shot Detector | ✓ Link | 42.3 | 61.2 | 46.9 | 22.8 | 47.3 | 55.9 | | | | | HSD (Rest101, 768x768, single-scale test) | 2019-10-01 |
CornerNet: Detecting Objects as Paired Keypoints | ✓ Link | 42.1 | 57.8 | 45.3 | 20.8 | 44.8 | 56.7 | | | | | CornerNet511 (Hourglass-104, multi-scale) | 2018-08-03 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 42.1 | | | | | | | | | | FoveaBox (ResNeXt-101) | 2019-04-08 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 42.0 | 60.4 | 45.3 | 25.4 | 45.0 | 51.0 | | | | | FCOS (HRNet-W32-5l) | 2019-04-02 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 41.9 | | | | | | | | | | FoveaBox (ResNeXt-101) | 2019-04-08 |
Single-Shot Refinement Neural Network for Object Detection | ✓ Link | 41.8 | 62.9 | 45.7 | 25.6 | 45.1 | 54.1 | | | | | RefineDet512+ (ResNet-101) | 2017-11-18 |
Gradient Harmonized Single-stage Detector | ✓ Link | 41.6 | 62.8 | 44.2 | 22.3 | 45.1 | 55.3 | | | | | GHM-C + GHM-R (RetinaNet-FPN-ResNeXt-101) | 2018-11-13 |
Objects as Points | ✓ Link | 41.6 | | | 21.5 | 43.9 | 56.0 | 26G | | | | CenterNet-DLA (DLA-34, multi-scale) | 2019-04-16 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 41.5 | 60.5 | 44.6 | 23.3 | 45 | 58 | | | | | RetinaNet (SpineNet-49S, 640x640) | 2019-12-10 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 41 | 62.9 | 44.3 | 23.6 | 44.1 | 51.7 | | | | | RPDet (ResNet-101) | 2019-04-25 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 41.0 | 59.7 | 45 | 22.1 | 46.5 | 53.8 | 34G | | | | M2Det (VGG-16, single-scale) | 2018-11-12 |
LeYOLO, New Scalable and Efficient CNN Architecture for Object Detection | ✓ Link | 41.0 | | | | | | | 2.4 | | 8.4 | LeYOLO (Large@768) | 2024-06-20 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 40.9 | 61.5 | 44 | 24 | 44.2 | 51.3 | 38G | | | | FSAF (ResNet-101, single-scale) | 2019-03-02 |
Focal Loss for Dense Object Detection | ✓ Link | 40.8 | 61.1 | 44.1 | 24.1 | 44.2 | 51.2 | 4G | | | | RetinaNet (ResNeXt-101-FPN) | 2017-08-07 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 40.6 | 59.9 | 44 | 22.6 | 42.7 | 52.1 | 12G | | | | Cascade R-CNN (ResNet-50-FPN+, cascade) | 2017-12-03 |
Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution | ✓ Link | 40.6 | 58.9 | 44.5 | 22.0 | 42.8 | 52.6 | 5G | | | | Faster R-CNN (Cascade RPN) | 2019-09-15 |
Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation | ✓ Link | 40.6 | | | 24.6 | 43.9 | 53.3 | | | | | ResNet-50-DW-DPN (Deformable Kernels) | 2019-10-07 |
Acquisition of Localization Confidence for Accurate Object Detection | ✓ Link | 40.6 | | | | | | | | | | IoU-Net | 2018-07-30 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 40.5 | 59.3 | | 23.4 | 42.6 | 51.0 | 16G | | 27.3G | | FCOS (HRNetV2p-W48) | 2019-08-20 |
Bounding Box Regression with Uncertainty for Accurate Object Detection | ✓ Link | 40.4 | | | | | | | | | | ResNet-50-FPN Mask R-CNN + KL Loss + var voting + soft-NMS | 2018-09-23 |
RDSNet: A New Deep Architecture for Reciprocal Object Detection and Instance Segmentation | ✓ Link | 40.3 | 60.1 | 43 | 22.1 | 43.5 | 51.5 | | | | | RDSNet (ResNet-101, RetinaNet, mask, MBRM) | 2019-12-11 |
Bottom-up Object Detection by Grouping Extreme and Center Points | ✓ Link | 40.2 | 55.5 | 43.2 | 20.4 | 43.2 | 53.1 | 180G | | | | ExtremeNet (Hourglass-104, single-scale) | 2019-01-23 |
Cross-Iteration Batch Normalization | ✓ Link | 40.1 | 60.5 | 44.1 | 35.8 | 57.3 | 38.5 | | | | | Mask R-CNN (ResNet-101-FPN, CBN) | 2020-02-13 |
Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution | ✓ Link | 40.1 | 59.4 | 43.8 | 22.1 | 42.4 | 51.6 | 5G | | | | Fast R-CNN (Cascade RPN) | 2019-09-15 |
Mask R-CNN | ✓ Link | 39.8 | 62.3 | 43.4 | 22.1 | 43.2 | 51.2 | 9G | | | | Mask R-CNN (ResNeXt-101-FPN) | 2017-03-20 |
Region Proposal by Guided Anchoring | ✓ Link | 39.8 | 59.2 | 43.5 | 21.8 | 42.6 | 50.7 | | | | | GA-Faster-RCNN | 2019-01-10 |
NAS-FCOS: Fast Neural Architecture Search for Object Detection | ✓ Link | 39.8 | | | | | | | | | | ResNet-50 NAS-FCOS @256 | 2019-06-11 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 39.8 | | | | | | | | | | A2MIM (ResNet-50 2x) | 2022-05-27 |
ChainerCV: a Library for Deep Learning in Computer Vision | ✓ Link | 39.5 | | | | | | | | | | FPN (ResNet101 backbone) | 2017-08-28 |
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free | ✓ Link | 39.4 | 58.6 | 42.3 | 21.9 | 42.0 | 51.0 | 9G | | | | RetinaMask (ResNet-50-FPN) | 2019-01-10 |
LeYOLO, New Scalable and Efficient CNN Architecture for Object Detection | ✓ Link | 39.3 | | | | | | | | | 5.8 | LeYOLO (Medium@640) | 2024-06-20 |
Attention Augmented Convolutional Networks | ✓ Link | 39.2 | | | | | | | | 24.5G | | AA-ResNet-10 + RetinaNet | 2019-04-22 |
Multiple Anchor Learning for Visual Object Detection | ✓ Link | 39.2 | | | | | | | | | | MAL (ResNet50, single-scale) | 2019-12-04 |
Focal Loss for Dense Object Detection | ✓ Link | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 | 4G | | | | RetinaNet (ResNet-101-FPN) | 2017-08-07 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 38.8 | 61.1 | 41.9 | 21.3 | 41.8 | 49.8 | 3G | | | | Cascade R-CNN (ResNet-101-FPN+) | 2017-12-03 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 38.8 | 59.4 | 41.7 | 20.5 | 43.9 | 53.4 | 27G | | | | M2Det (ResNet-101, single-scale) | 2018-11-12 |
SaccadeNet: A Fast and Accurate Object Detector | ✓ Link | 38.5 | 55.6 | 41.4 | 19.2 | 42.1 | 50.6 | 46G | | | | SaccadeNet (DLA-34-DCN) | 2020-03-26 |
Mask R-CNN | ✓ Link | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 | 9G | | | | Mask R-CNN (ResNet-101-FPN) | 2017-03-20 |
LeYOLO, New Scalable and Efficient CNN Architecture for Object Detection | ✓ Link | 38.2 | | | | | | | 1.9 | | 4.51 | LeYOLO (Small@640) | 2024-06-20 |
Segmentation is All You Need | | 38.1 | | | | | | | | | | WSMA-Seg | 2019-04-30 |
Compact Global Descriptor for Neural Networks | ✓ Link | 37.9 | | | | | | | | | | Faster R-CNN + FPN + CGD | 2019-07-23 |
CornerNet: Detecting Objects as Paired Keypoints | ✓ Link | 37.8 | 53.7 | 40.1 | 17.0 | 39.0 | 50.5 | | | | | CornerNet511 (Hourglass-52, single-scale) | 2018-08-03 |
Single-Shot Refinement Neural Network for Object Detection | ✓ Link | 37.6 | 58.7 | 40.8 | 22.7 | 40.3 | 48.3 | | | | | RefineDet512+ (VGG-16) | 2017-11-18 |
Deformable Convolutional Networks | ✓ Link | 37.5 | 58.0 | | 19.4 | 40.1 | 52.5 | | | | | DeformConv-R-FCN (Aligned-Inception-ResNet) | 2017-03-17 |
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era | ✓ Link | 37.4 | 58 | 40.1 | 17.5 | 41.1 | 51.2 | | | | | Faster R-CNN (ImageNet+300M) | 2017-07-10 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 36.9 | | | | | | | | | | Mask R-CNN (Bottleneck-injected ResNet-50, FPN) | 2020-11-25 |
Beyond Skip Connections: Top-Down Modulation for Object Detection | ✓ Link | 36.8 | | | | | | | | | | Faster R-CNN + TDM | 2016-12-20 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 36.5 | 59 | 39.2 | 20.3 | 38.8 | 46.4 | 3G | | | | Cascade R-CNN (ResNet-50-FPN+) | 2017-12-03 |
Single-Shot Refinement Neural Network for Object Detection | ✓ Link | 36.4 | 57.5 | 39.5 | 16.6 | 39.9 | 51.4 | | | | | RefineDet512 (ResNet-101) | 2017-11-18 |
Feature Pyramid Networks for Object Detection | ✓ Link | 36.2 | | | | | | 2G | | | | Faster R-CNN + FPN | 2016-12-09 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 35.9 | | | | | | | | | | Faster R-CNN (Bottleneck-injected ResNet-50 and FPN) | 2020-11-25 |