EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 57.8 | 28.86 | EVA | 2022-11-14 |
NMS Strikes Back | ✓ Link | 48.5 | 20.15 | DETA
(Swin-L) | 2022-12-12 |
Grounded Language-Image Pre-training | ✓ Link | 48.0 | 24.89 | GLIP-L
(Swin-L) | 2021-12-07 |
GRiT: A Generative Region-to-text Transformer for Object Understanding | ✓ Link | 42.9 | 15.72 | GRiT
(ViT-H) | 2022-12-01 |
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 42.1 | 15.76 | DINO (Swin-L) | 2022-03-07 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 39.0 | 12.36 | CBNetV2
(Swin-L) | 2021-07-01 |
A ConvNet for the 2020s | ✓ Link | 37.5 | 12.68 | ConvNeXt-XL
(Cascade Mask R-CNN) | 2022-01-10 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 37.0 | 11.72 | InternImage-L (Cascade Mask R-CNN) | 2022-11-10 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 35.3 | 10.00 | DyHead
(Swin-L) | 2021-06-15 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 34.3 | | ViTDet (ViT-H) | 2022-03-30 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 34.25 | 7.79 | ViT-Adapter (BEiTv2-L) | 2022-05-17 |
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | ✓ Link | 33.7 | 11.43 | FIBER-B
(Swin-B) | 2022-06-15 |
Instances as Queries | ✓ Link | 33.2 | 8.26 | QueryInst
(Swin-L) | 2021-05-05 |
YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications | ✓ Link | 32.5 | 6.73 | YOLOv6-L6 | 2022-09-07 |
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 32.0 | 6.42 | YOLOv7-E6E | 2022-07-06 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 30.9 | 5.62 | MViTV2-H
(Cascade Mask R-CNN) | 2021-12-02 |
Robust and Accurate Object Detection via Adversarial Learning | ✓ Link | 30.8 | 7.34 | Det-AdvProp
(EfficientNet-B5) | 2021-03-23 |
YOLOv4: Optimal Speed and Accuracy of Object Detection | ✓ Link | 30.4 | 5.89 | YOLOv4-P6 | 2020-04-23 |
YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 30.3 | 7.26 | YOLOX-X | 2021-07-18 |
Probabilistic two-stage detection | ✓ Link | 29.5 | 4.29 | CenterNet2
(R2-101-DCN) | 2021-03-12 |
Grounded Language-Image Pre-training | ✓ Link | 29.1 | 8.11 | GLIP-T
(Swin-T) | 2021-12-07 |
EfficientDet: Scalable and Efficient Object Detection | ✓ Link | 28.5 | 5.44 | EfficientDet-D5
(EfficientNet-B5) | 2019-11-20 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 28.2 | 6.85 | PVTv2-B5
(Mask R-CNN) | 2021-06-25 |
VarifocalNet: An IoU-aware Dense Object Detector | ✓ Link | 28.0 | 5.27 | VFNet
(RX-101-64x4d) | 2020-08-31 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 26.0 | 4.38 | GCNet
(RX-101-32x4d-DCN) | 2019-04-25 |
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 25.1 | 2.6 | GFLv2
(R2-101-DCN) | 2020-11-25 |
RepPoints V2: Verification Meets Regression for Object Detection | ✓ Link | 24.9 | 2.7 | RepPointsV2
(RX-101-64x4d-DCN) | 2020-07-16 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 24.8 | | UniverseNet
(R2-101-DCN) | 2021-03-25 |
YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 20.6 | 2.48 | YOLOX-S | 2021-07-18 |
You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection | ✓ Link | 20.0 | 1.05 | YOLOS-B
(ViT-B) | 2021-06-01 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 19.3 | 0.16 | DyHead
(ResNet-50) | 2021-06-15 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 19.1 | 0.08 | HTC
(ResNet-50) | 2019-01-22 |
Deformable DETR: Deformable Transformers for End-to-End Object Detection | ✓ Link | 18.5 | -1.49 | Deformable-DETR
(ResNet-50) | 2020-10-08 |
Cascade R-CNN: High Quality Object Detection and Instance Segmentation | ✓ Link | 18.2 | 0.02 | Cascade R-CNN
(ResNet-50) | 2019-06-24 |
Mask R-CNN | ✓ Link | 17.1 | | Mask R-CNN
(ResNet-50) | 2017-03-20 |
End-to-End Object Detection with Transformers | ✓ Link | 17.1 | -1.82 | DETR
(ResNet-50) | 2020-05-26 |
Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection | ✓ Link | 16.8 | -0.91 | ATSS
(ResNet-50) | 2019-12-05 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 16.7 | 0.25 | FCOS
(ResNet-50) | 2019-04-02 |
Focal Loss for Dense Object Detection | ✓ Link | 16.6 | 0.18 | RetinaNet
(ResNet-50) | 2017-08-07 |
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks | ✓ Link | 16.4 | -0.41 | Faster R-CNN (ResNet-50-FPN) | 2015-06-04 |
YOLOv3: An Incremental Improvement | ✓ Link | 14.8 | -0.37 | YOLOv3
(DarkNet-53) | 2018-04-08 |
SSD: Single Shot MultiBox Detector | ✓ Link | 13.6 | 0.36 | SSD (VGG-16) | 2015-12-08 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | | 7.89 | ViTDet
(ViT-H) | 2022-03-30 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | | 1.86 | UniverseNet (R2-101-DCN) | 2021-03-25 |
Mask R-CNN | ✓ Link | | -0.11 | Mask R-CNN (ResNet-50) | 2017-03-20 |