Enhancing Video Transformers for Action Understanding with VLM-aided Training | | 99.7 | | | FTP-UniFormerV2-L/14 | 2024-03-24 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 99.6 | | | VideoMAE V2-g | 2023-03-29 |
OmniVec: Learning robust representations with cross modal sharing | | 99.6 | | | OmniVec | 2023-11-07 |
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 99.6 | | | OmniVec2 | 2024-01-01 |
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | ✓ Link | 98.8 | | | BIKE | 2022-12-31 |
SMART Frame Selection for Action Recognition | | 98.64 | | | SMART | 2020-12-19 |
Omni-sourced Webly-supervised Learning for Video Recognition | ✓ Link | 98.6 | | | OmniSource (SlowOnly-8x8-R101-RGB + I3D-Flow) | 2020-03-29 |
PERF-Net: Pose Empowered RGB-Flow Net | | 98.6 | | | PERF-Net (multi-distilled S3D) | 2020-09-28 |
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video | ✓ Link | 98.6 | | | ZeroI2V ViT-L/14 | 2023-10-02 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 98.2 | | | LGD-3D Two-stream | 2019-06-13 |
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition | ✓ Link | 98.2 | | | Text4Vis | 2022-07-04 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 98.0 | | | Two-Stream I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 97.8 | | | MARS+RGB+Flow (64 frames, Kinetics pretrained) | 2019-06-01 |
Large Scale Holistic Video Understanding | ✓ Link | 97.8 | | | HATNet (32 frames) | 2019-04-25 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 97.8 | | | Two-Stream I3D (Kinetics pre-training) | 2017-05-22 |
Bubblenet: A Disperse Recurrent Structure To Recognize Activities | | 97.62 | | | BubbleNET | 2020-10-30 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 97.6 | | | D3D + D3D | 2018-12-19 |
Busy-Quiet Video Disentangling for Video Classification | ✓ Link | 97.6 | | | BQN | 2021-03-29 |
Cooperative Cross-Stream Network for Discriminative Action Representation | | 97.4 | | | CCS + TSN (ImageNet+Kinetics pretrained) | 2019-08-27 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 97.3 | | | R[2+1]D-TwoStream (Kinetics pretrained) | 2017-11-30 |
Contextual Action Cues from Camera Sensor for Multi-Stream Action Recognition | | 97.2 | | | Multi-stream I3D | 2019-03-20 |
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | | 97.2 | | | CA2ST(B/16) | 2025-03-30 |
Hidden Two-Stream Convolutional Networks for Action Recognition | ✓ Link | 97.1 | | | Hidden Two-Stream | 2017-04-02 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 97.1 | | | D3D (Kinetics-600 pretraining) | 2018-12-19 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 97.1 | | | AMD(ViT-B/16) | 2023-11-06 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 97 | | | D3D (Kinetics-400 pretraining) | 2018-12-19 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 97 | | | LGD-3D RGB | 2019-06-13 |
An Image is Worth 16x16 Words, What is a Video Worth? | ✓ Link | 97 | | | STAM-32 (ImageNet/Kinetics pretraining) | 2021-03-25 |
FASTER Recurrent Networks for Efficient Video Classification | | 96.9 | | | FASTER32 | 2019-06-10 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 96.8 | | | R[2+1]D-RGB (Kinetics pretrained) | 2017-11-30 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 96.8 | | | S3D-G (ImageNet, Kinetics-400 pretrained) | 2017-12-13 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 96.8 | | | LGD-3D Flow | 2019-06-13 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 96.7 | | | Flow-I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
VidTr: Video Transformer Without Convolutions | | 96.7 | | | VidTr-L | 2021-04-23 |
Two-Stream Video Classification with Cross-Modality Attention | | 96.5 | | | CMA iter1-S | 2019-08-01 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 96.5 | | | Flow-I3D (Kinetics pre-training) | 2017-05-22 |
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | | 96.5 | | | I3D RGB + DMC-Net (I3D) | 2019-01-11 |
$A^2$-Nets: Double Attention Networks | | 96.4 | | | A2-Net (ResNet-50) | 2018-10-27 |
Multi-Fiber Networks for Video Recognition | | 96.0 | | | MF-Net, RGB only (ImageNet+Kinetics pretrained) | 2018-07-30 |
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition | ✓ Link | 96 | | | Optical Flow Guided Feature | 2017-11-29 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 95.8 | | | MARS+RGB+Flow (16 frames) | 2019-06-01 |
Attention Distillation for Learning Video Representations | | 95.7 | | | Prob-Distill | 2019-04-05 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 95.6 | | | RGB-I3D (Imagenet+Kinetics pre-training) | 2017-05-22 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 95.5 | | | R[2+1]D-Flow (Kinetics pretrained) | 2017-11-30 |
End-to-End Learning of Motion Representation for Video Understanding | ✓ Link | 95.4 | | | TVNet+IDT | 2018-04-02 |
Learning spatio-temporal representations with temporal squeeze pooling | | 95.2 | | | TesNet (ImageNet pretrained) | 2020-02-11 |
I3D-LSTM: A New Model for Human Action Recognition | ✓ Link | 95.1 | | | I3D-LSTM | 2019-08-09 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 95.1 | | | RGB-I3D (Kinetics pre-training) | 2017-05-22 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 95 | | | R[2+1]D-TwoStream (Sports-1M pretrained) | 2017-11-30 |
LIGAR: Lightweight General-purpose Action Recognition | ✓ Link | 94.85 | | | X3D MobileNet-V3 LGD-GC | 2021-08-30 |
Spatiotemporal Residual Networks for Video Action Recognition | ✓ Link | 94.6 | | | ST-ResNet + IDT | 2016-11-07 |
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? | ✓ Link | 94.5 | | | ResNeXt-101 (64f) | 2017-11-27 |
R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition | | 94.5 | | | R-STAN-101 | 2019-06-19 |
Temporal-Spatial Mapping for Action Recognition | | 94.3 | | | TSN+TSM | 2018-09-11 |
Appearance-and-Relation Networks for Video Classification | ✓ Link | 94.3 | | | ARTNet w/ TSN | 2017-11-24 |
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition | ✓ Link | 94.2 | | | Temporal Segment Networks | 2016-08-02 |
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition | ✓ Link | 94.1 | | | TS-LSTM | 2017-03-30 |
Self-supervised Video Transformer | ✓ Link | 93.7 | | | SVT | 2021-12-02 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 93.6 | | | R[2+1]D-RGB (Sports-1M pretrained) | 2017-11-30 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 93.4 | | | Two-stream I3D | 2017-05-22 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 93.3 | | | R[2+1]D-Flow (Sports-1M pretrained) | 2017-11-30 |
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | ✓ Link | 92.7 | | | VIMPAC | 2021-06-21 |
Convolutional Two-Stream Network Fusion for Video Action Recognition | ✓ Link | 92.5 | | | S:VGG-16, T:VGG-16 (ImageNet pretrain) | 2016-04-22 |
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | | 92.3 | | | DMC-Net (I3D) | 2019-01-11 |
Dance with Flow: Two-in-One Stream Action Detection | ✓ Link | 92 | | | two-in-one two stream | 2019-04-01 |
Long-term Temporal Convolutions for Action Recognition | ✓ Link | 91.7 | | | LTC | 2016-04-15 |
R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition | | 91.5 | | | R-STAN-50 | 2019-06-19 |
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors | ✓ Link | 91.5 | | | TDD + IDT | 2015-05-19 |
Towards Good Practices for Very Deep Two-Stream ConvNets | ✓ Link | 91.4 | | | Very deep two-stream ConvNet | 2015-07-08 |
Efficient Action Recognition Using Confidence Distillation | | 91.2 | | | 3D ResNeXt-101 + Confidence Distillation | 2021-09-05 |
Multi-region two-stream R-CNN for action detection | | 91.1 | | | MR Two-Sream R-CNN | 2016-09-17 |
Dynamic Image Networks for Action Recognition | ✓ Link | 89.1 | | | Dynamic Image Networks + IDT | 2016-06-01 |
Beyond Short Snippets: Deep Networks for Video Classification | ✓ Link | 88.6 | | | Two-stream+LSTM | 2015-03-31 |
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks | ✓ Link | 88.6 | | | P3D (ImageNet + Sports1M) | 2017-11-28 |
Two-Stream Convolutional Networks for Action Recognition in Videos | ✓ Link | 88.0 | | | Two-Stream (ImageNet pretrained) | 2014-06-09 |
Real-time Action Recognition with Enhanced Motion Vector CNNs | ✓ Link | 86.4 | | | MV-CNN | 2016-04-26 |
Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer | ✓ Link | 86.1 | | | Dynamics 2 for DenseNet-201 Transformer | 2023-02-17 |
DistInit: Learning Video Representations Without a Single Labeled Video | | 85.8 | | | R(2+1)D-18 (DistInit pretraining) | 2019-01-26 |
ConvNet Architecture Search for Spatiotemporal Feature Learning | ✓ Link | 85.8 | | | Res3D | 2017-08-16 |
ActionFlowNet: Learning Motion Representation for Action Recognition | | 83.9 | | | ActionFlowNet | 2016-12-09 |
Learning Spatiotemporal Features with 3D Convolutional Networks | ✓ Link | 82.3 | | | C3D | 2014-12-02 |
HalluciNet-ing Spatiotemporal Representations Using a 2D-CNN | ✓ Link | 79.83 | | | HalluciNet (ResNet-50) | 2019-12-10 |
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples | ✓ Link | 78.7 | | | R[2+1]D (VideoMoCo) | 2021-03-10 |
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples | ✓ Link | 74.1 | | | 3D-ResNet-18 (VideoMoCo) | 2021-03-10 |
Large-Scale Video Classification with Convolutional Neural Networks | ✓ Link | 65.4 | | | Slow Fusion + Finetune top 3 layers | 2014-06-23 |
MLGCN: Multi-Laplacian Graph Convolutional Networks for Human Action Recognition | | 63.27 | | | MLGCN | 2019-09-11 |
Towards Universal Representation for Unseen Action Recognition | | 42.5 | | | CD-UAR | 2018-03-22 |
[]() | | 35.2 | | | SL | |
PoTion: Pose MoTion Representation for Action Recognition | | 29.3 | | | I3D + PoTion | 2018-06-01 |
Federated Self-supervised Learning for Video Understanding | ✓ Link | | 73.16 | | R3D-18 | 2022-07-05 |
Adaptive frame selection in two dimensional convolutional neural network action recognition | ✓ Link | | | 98.05 | ResNet50 | 2022-12-28 |