MaxViT: Multi-Axis Vision Transformer | ✓ Link | 53.4 | | | 72.9 | 58.1 | 45.7 | 70.3 | 50 | MaxViT-B | 2022-04-04 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 53.1 | | | 72.5 | 58.1 | 45.4 | 69.8 | 49.5 | MaxViT-S | 2022-04-04 |
MaxViT: Multi-Axis Vision Transformer | ✓ Link | 52.1 | | | 71.9 | 56.8 | 44.6 | 69.1 | 48.4 | MaxViT-T | 2022-04-04 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 50.2 | | | | | | | | DAT-S++ | 2023-09-04 |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention | ✓ Link | 49.2 | | | | | | | | DAT-T++ | 2023-09-04 |
Stochastic Subsampling With Average Pooling | | 42.1 | | | 59.4 | 45.9 | | | | DyHead (SAP) | 2024-09-25 |
On the Ideal Number of Groups for Isometric Gradient Propagation | | 40.7 | | | 61.2 | 44.6 | | | | Faster R-CNN (ideal number of groups) | 2023-02-07 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 56.4 | | | | | | | UniRepLKNet-XL++ | 2023-11-27 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 55.8 | | | | | | | UniRepLKNet-L++ | 2023-11-27 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 54.8 | | | | | | | UniRepLKNet-B++ | 2023-11-27 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 54.3 | | | | | | | UniRepLKNet-S++ | 2023-11-27 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | | 54.1 | | | | | | | MixMIM-L | 2022-05-26 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 53 | | | | | | | UniRepLKNet-S | 2023-11-27 |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | ✓ Link | | 52.2 | | | | | | | MixMIM-B | 2022-05-26 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition | ✓ Link | | 51.7 | | | | | | | UniRepLKNet-T | 2023-11-27 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | | 48.6 | | | | | | | BiFormer-B (IN1k pretrain, MaskRCNN 12ep) | 2023-03-15 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | | 48.5 | | | | | | | DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep) | 2024-10-11 |
BiFormer: Vision Transformer with Bi-Level Routing Attention | ✓ Link | | 47.8 | | | | | | | BiFormer-S (IN1k pretrain, MaskRCNN 12ep) | 2023-03-15 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | | 47.5 | | | | | | | DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep) | 2024-10-11 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | | 47.1 | | | | | | | DeBiFormer-B (IN1k pretrain, Retina) | 2024-10-11 |
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention | ✓ Link | | 45.6 | | | | | | | DeBiFormer-S (IN1k pretrain, Retina) | 2024-10-11 |
YOLO-Drone:Airborne real-time detection of dense small objects from high-altitude perspective | | | 35.45 | | | | | | | YOLO-Drone | 2023-04-14 |
Benchmark for Generic Product Detection: A Low Data Baseline for Dense Object Detection | ✓ Link | | | 3153 | | | | | | retinanet | 2019-12-19 |
Paint Transformer: Feed Forward Neural Painting with Stroke Prediction | ✓ Link | | | 4.2 | | | | | | Lpixel | 2021-08-09 |