VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 88.7 | VideoMAE V2-g | 2023-03-29 |
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification | ✓ Link | 88.6 | DejaVid | 2025-01-01 |
Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors | | 87.56 | DEEP-HAL with ODF+SDF(I3D) | 2020-01-14 |
High-order Tensor Pooling with Attention for Action Recognition | | 87.21 | TO+MaxExp+IDT | 2021-10-11 |
Tensor Representations for Action Recognition | ✓ Link | 86.11 | SCK⊕(I3D)+IDT | 2020-12-28 |
High-order Tensor Pooling with Attention for Action Recognition | | 85.70 | SO+MaxExp+IDT | 2021-10-11 |
Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition | ✓ Link | 85.10 | R2+1D-BERT | 2020-08-03 |
Pose And Joint-Aware Action Recognition | ✓ Link | 84.53 | Ours + ResNext101 BERT | 2020-10-16 |
SMART Frame Selection for Action Recognition | | 84.36 | SMART | 2020-12-19 |
Omni-sourced Webly-supervised Learning for Video Recognition | ✓ Link | 83.8 | OmniSource (SlowOnly-8x8-R101-RGB + I3D Flow) | 2020-03-29 |
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video | ✓ Link | 83.4 | ZeroI2V ViT-L/14 | 2023-10-02 |
PERF-Net: Pose Empowered RGB-Flow Net | | 83.2 | PERF-Net (distilled S3D-G) | 2020-09-28 |
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | ✓ Link | 83.1 | BIKE | 2022-12-31 |
Bubblenet: A Disperse Recurrent Structure To Recognize Activities | | 82.60 | BubbleNET | 2020-10-30 |
Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition with CNNs | | 82.48 | HAF+BoW/FV halluc | 2019-06-13 |
Cooperative Cross-Stream Network for Discriminative Action Representation | | 81.9 | CCS + TSN (ImageNet+Kinetics pretrained) | 2019-08-27 |
Representation Flow for Action Recognition | ✓ Link | 81.1 | RepFlow-50 ([2+1]D CNN, FcF, Non-local block) | 2018-10-02 |
Contextual Action Cues from Camera Sensor for Multi-Stream Action Recognition | | 80.92 | Multi-stream I3D | 2019-03-20 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 80.9 | MARS+RGB+FLow (64 frames, Kinetics pretrained) | 2019-06-01 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 80.9 | Two-stream I3D | 2017-05-22 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 80.7 | Two-Stream I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 80.5 | LGD-3D Two-stream | 2019-06-13 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 80.5 | D3D + D3D | 2018-12-19 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 79.6 | AMD(ViT-B/16) | 2023-11-06 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 79.3 | D3D (Kinetics-600 pretraining) | 2018-12-19 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 78.9 | LGD-3D Flow | 2019-06-13 |
Hidden Two-Stream Convolutional Networks for Action Recognition | ✓ Link | 78.7 | Hidden Two-Stream | 2017-04-02 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 78.7 | R[2+1]D-TwoStream (Kinetics pretrained) | 2017-11-30 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 78.7 | D3D (Kinetics-400 pretraining) | 2018-12-19 |
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | | 77.8 | I3D RGB + DMC-Net (I3D) | 2019-01-11 |
Busy-Quiet Video Disentangling for Video Classification | ✓ Link | 77.6 | BQN | 2021-03-29 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 77.4 | MSNet-R50 (16 frames, ImageNet pretrained) | 2020-07-20 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 77.3 | Flow-I3D (Kinetics pre-training) | 2017-05-22 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 77.1 | Flow-I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
Large Scale Holistic Video Understanding | ✓ Link | 76.5 | HATNet (32 frames) | 2019-04-25 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 76.4 | R[2+1]D-Flow (Kinetics pretrained) | 2017-11-30 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 75.9 | S3D-G (ImageNet, Kinetics-400 pretrained) | 2017-12-13 |
FASTER Recurrent Networks for Efficient Video Classification | | 75.7 | FASTER32 (Kinetics pretrain) | 2019-06-10 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 75.7 | LGD-3D RGB | 2019-06-13 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 74.8 | RGB-I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 74.5 | R[2+1]D-RGB (Kinetics pretrained) | 2017-11-30 |
VidTr: Video Transformer Without Convolutions | | 74.4 | VidTr-L | 2021-04-23 |
Contrastive Video Representation Learning via Adversarial Perturbations | | 74.3 | ADL+ResNet+IDT | 2018-07-24 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 74.3 | RGB-I3D (Kinetics pre-training) | 2017-05-22 |
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition | ✓ Link | 74.2 | Optical Flow Guided Feature | 2017-11-29 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 72.7 | R[2+1D]D-TwoStream (Sports1M pretrained) | 2017-11-30 |
End-to-End Learning of Motion Representation for Video Understanding | ✓ Link | 72.6 | TVNet+IDT | 2018-04-02 |
Spatiotemporal Multiplier Networks for Video Action Recognition | ✓ Link | 72.2 | STM Network+IDT | 2017-07-01 |
Attention Distillation for Learning Video Representations | | 72.0 | Prob-Distill | 2019-04-05 |
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | | 71.8 | DMC-Net (I3D) | 2019-01-11 |
Learning spatio-temporal representations with temporal squeeze pooling | | 71.5 | TesNet (ImageNet pretrained) | 2020-02-11 |
Hierarchical Feature Aggregation Networks for Video Action Recognition | | 71.13 | HF-ECOLite (ImageNet+Kinetics pretrain) | 2019-05-29 |
Appearance-and-Relation Networks for Video Classification | ✓ Link | 70.9 | ARTNet w/ TSN | 2017-11-24 |
Spatiotemporal Residual Networks for Video Action Recognition | ✓ Link | 70.3 | ST-ResNet + IDT | 2016-11-07 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 70.1 | R[2+1]D-Flow (Sports1M pretrained) | 2017-11-30 |
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition | ✓ Link | 69.4 | Temporal Segment Networks | 2016-08-02 |
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition | ✓ Link | 69 | TS-LSTM | 2017-03-30 |
Self-supervised Video Transformer | ✓ Link | 67.2 | SVT | 2021-12-02 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 66.6 | R[2+1]D-RGB (Sports1M pretrained) | 2017-11-30 |
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors | ✓ Link | 65.9 | TDD + IDT | 2015-05-19 |
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | ✓ Link | 65.9 | VIMPAC | 2021-06-21 |
Convolutional Two-Stream Network Fusion for Video Action Recognition | ✓ Link | 65.4 | S:VGG-16, T:VGG-16 (ImageNet pretrained) | 2016-04-22 |
Dynamic Image Networks for Action Recognition | ✓ Link | 65.2 | Dynamic Image Networks + IDT | 2016-06-01 |
Long-term Temporal Convolutions for Action Recognition | ✓ Link | 64.8 | LTC | 2016-04-15 |
R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition | | 62.8 | R-STAN-50 | 2019-06-19 |
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | | 62.8 | DMC-Net (ResNet-18) | 2019-01-11 |
SUSiNet: See, Understand and Summarize it | | 62.7 | SUSiNet (multi, Kinetics pretrained) | 2018-12-03 |
Two-Stream Convolutional Networks for Action Recognition in Videos | ✓ Link | 59.4 | Two-Stream (ImageNet pretrained) | 2014-06-09 |
ActionFlowNet: Learning Motion Representation for Action Recognition | | 56.4 | ActionFlowNet | 2016-12-09 |
R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition | | 55.16 | R-STAN-152 | 2019-06-19 |
ConvNet Architecture Search for Spatiotemporal Feature Learning | ✓ Link | 54.9 | Res3D | 2017-08-16 |
DistInit: Learning Video Representations Without a Single Labeled Video | | 54.8 | R(2+1)D-18 (DistInit pretraining) | 2019-01-26 |
Pose And Joint-Aware Action Recognition | ✓ Link | 54.2 | JRMN | 2020-10-16 |
Towards Universal Representation for Unseen Action Recognition | | 51.8 | CD-UAR | 2018-03-22 |
Learning Spatiotemporal Features with 3D Convolutional Networks | ✓ Link | 51.6 | C3D | 2014-12-02 |
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples | ✓ Link | 49.2 | R[2+1]D (VideoMoCo) | 2021-03-10 |
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples | ✓ Link | 43.6 | 3D-ResNet-18 (VideoMoCo) | 2021-03-10 |