TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 66.3 | | TokenLearner | 2021-06-21 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 66.2 | | TubeViT-L | 2022-12-06 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 63.2 | | MoViNet-A6 | 2021-03-21 |
Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors | | 62.29 | | DEEP-HAL with ODF+SDF (AssembleNet++) | 2020-01-14 |
AssembleNet++: Assembling Modality Representations via Attention Connections | ✓ Link | 59.8 | | AssembleNet++ 50 | 2020-08-18 |
AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures | ✓ Link | 58.6 | | AssembleNet | 2019-05-30 |
AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures | ✓ Link | 58.6 | | AssembleNet-101 | 2019-05-30 |
VicTR: Video-conditioned Text Representations for Activity Recognition | | 57.6 | | VicTR (ViT-L/14) | 2023-04-05 |
AssembleNet++: Assembling Modality Representations via Attention Connections | ✓ Link | 54.98 | | AssembleNet++ 50 without object | 2020-08-18 |
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | ✓ Link | 50.7 | | BIKE | 2022-12-31 |
Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors | | 50.16 | | DEEP-HAL with ODF+SDF (I3D) | 2020-01-14 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 48.5 | | MoViNet-A4 | 2021-03-21 |
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition | | 47.8 | | AdaFocus (weak supervision, MViT-B-24, 32x3) | 2023-11-28 |
Multiscale Vision Transformers | ✓ Link | 47.7 | | MViT-B-24, 32x3 (Kinetics-600 pretraining) | 2021-04-22 |
VidTr: Video Transformer Without Convolutions | | 47.3 | | En-VidTr-L | 2021-04-23 |
Multiscale Vision Transformers | ✓ Link | 47.1 | | MViT-B, 32x3 (Kinetics-600 pretraining) | 2021-04-22 |
Multiscale Vision Transformers | ✓ Link | 46.3 | | MViT-B-24, 32x3 (Kinetics-400 pretraining) | 2021-04-22 |
SlowFast Networks for Video Recognition | ✓ Link | 45.2 | | SlowFast (Kinetics-600 pretraining, NL) | 2018-12-10 |
Multiscale Vision Transformers | ✓ Link | 44.3 | | MViT-B, 32x3 (Kinetics-400 pretraining) | 2021-04-22 |
ActionCLIP: A New Paradigm for Video Action Recognition | ✓ Link | 44.3 | | ActionCLIP (ViT-B/16) | 2021-09-17 |
Multiscale Vision Transformers | ✓ Link | 43.9 | | MViT-B, 16x4 (Kinetics-600 pretraining) | 2021-04-22 |
VidTr: Video Transformer Without Convolutions | | 43.5 | | VidTr-L | 2021-04-23 |
Pose And Joint-Aware Action Recognition | ✓ Link | 43.23 | | JMRN + R101-NL-LFB | 2020-10-16 |
Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition with CNNs | | 43.1 | | HAF+BoW/FV/OFF halluc. +MSK×8/PN | 2019-06-13 |
Long-Term Feature Banks for Detailed Video Understanding | ✓ Link | 42.5 | | LFB | 2018-12-12 |
SlowFast Networks for Video Recognition | ✓ Link | 42.5 | | SlowFast (Kinetics-400 pretraining, NL) | 2018-12-10 |
SlowFast Networks for Video Recognition | ✓ Link | 42.1 | | SlowFast (Kinetics-600 pretraining) | 2018-12-10 |
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition | | 41.4 | | AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4) | 2023-11-28 |
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition | | 41.2 | | AdaFocus (weak supervision, X3D-L, 32x3) | 2023-11-28 |
Timeception for Complex Action Recognition | ✓ Link | 41.1 | | Timeception (R3D) | 2018-12-04 |
PA3D: Pose-Action 3D Machine for Video Recognition | | 41 | | PA3D + (GCN + I3D + NL I3D) | 2019-06-01 |
PoTion: Pose MoTion Representation for Action Recognition | | 40.8 | | PoTion + (GCN + I3D + NL I3D) | 2018-06-01 |
Multiscale Vision Transformers | ✓ Link | 40 | | MViT-B, 16x4 (Kinetics-400 pretraining) | 2021-04-22 |
Videos as Space-Time Region Graphs | | 39.7 | | STRG | 2018-06-05 |
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition | | 39.3 | | AdaFocus (weak supervision, Slowfast-R50, 16x8) | 2023-11-28 |
Revisiting spatio-temporal layouts for compositional action recognition | ✓ Link | 38.5 | | STLT + I3D | 2021-11-02 |
Evolving Space-Time Neural Architectures for Videos | | 38.1 | | EvaNet | 2018-11-26 |
Timeception for Complex Action Recognition | ✓ Link | 37.2 | | Timeception (I3D) | 2018-12-04 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 32.9 | | I3D | 2017-05-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 32.5 | | MoViNet-A2 | 2021-03-21 |
Timeception for Complex Action Recognition | ✓ Link | 31.6 | | Timeception (R2D) | 2018-12-04 |
Temporal Relational Reasoning in Videos | ✓ Link | 25.2 | | MultiScale TRN | 2017-11-22 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 25.2 | 6.9x1 | Co Slow_64 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 24.1 | 54.9x1 | Slow-8×8 | 2021-05-31 |
Asynchronous Temporal Fields for Action Recognition | ✓ Link | 22.4 | | Asyn-TF | 2016-12-19 |
Compressed Video Action Recognition | ✓ Link | 21.9 | | CoViAR | 2017-12-02 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 21.5 | 6.9x1 | Co Slow_8 | 2021-05-31 |
Two-Stream Convolutional Networks for Action Recognition in Videos | ✓ Link | 18.6 | | 2-Strm | 2014-06-09 |
Pose And Joint-Aware Action Recognition | ✓ Link | 16.2 | | JMRN (Pose only) | 2020-10-16 |