OpenCodePapers

action-recognition-in-videos-on-something-1

Action Recognition
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop 1 AccuracyTop 5 AccuracyParam.GFLOPsModelNameReleaseDate
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link70.0InternVideo2022-12-06
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link68.791.9VideoMAE V2-g2023-03-29
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning✓ Link67.388.8Side4Video (EVA ViT-E/142023-11-27
What Can Simple Arithmetic Operations Do for Temporal Modeling?✓ Link65.688.6ATM2023-07-18
Temporally-Adaptive Models for Efficient Video Understanding✓ Link63.7TAdaFormer-L/142023-08-10
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning✓ Link63.087.8TDS-CLIP-ViT-L/14(8frames)2024-08-20
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer✓ Link62.788.0UniFormerV2-L2022-09-22
Learning Correlation Structures for Vision Transformers61.3StructVit-B-4-12024-04-05
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link60.987.350.1259x3UniFormer-B (IN-1K + Kinetics400)2021-09-29
Temporally-Adaptive Models for Efficient Video Understanding✓ Link60.7TAdaConvNeXtV2-B2023-08-10
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition✓ Link58.3TPS2022-07-27
Multi-scale Motion-Aware Module for Video Action Recognition57.9MSMA (8+16frames)2023-02-19
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link57.684.921.441.8x3UniFormer-B (IN-1K + Kinetics600)2021-09-29
Stand-Alone Inter-Frame Attention in Video Models✓ Link57.3SIFA2022-06-14
EAN: Event Adaptive Network for Enhanced Action Recognition✓ Link57.283.9EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)2021-07-22
Motion-driven Visual Tempo Learning for Video-based Action Recognition✓ Link57.2TCM (Ensemble)2022-02-24
Busy-Quiet Video Disentangling for Video Classification✓ Link57.184.2BQNEn (ImageNet + K400 pretrained)2021-03-29
TDN: Temporal Difference Networks for Efficient Action Recognition✓ Link56.884.1TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)2020-12-18
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link56.684.4SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)2021-02-14
CT-Net: Channel Tensorization Network for Video Classification✓ Link56.6CT-Net Ensemble (R50, 8+12+16+24)2021-06-03
Action Recognition With Motion Diversification and Dynamic Selection56.6MoDS (8+16frames)2022-07-15
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing56.5MLP-3D2022-06-13
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link56.182.8RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)2021-11-02
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link55.883.9SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)2021-02-14
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link55.582.6RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)2021-11-02
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance✓ Link55.382.8PAN ResNet101 (RGB only, no Flow)2020-08-08
Gate-Shift Networks for Video Action Recognition✓ Link55.16GSM Ensemble InceptionV3 (ImageNet pretrained)2019-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link55.1MSNet-R50En (ensemble)2020-07-20
AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding55.0AE-Net (8+16frames)2022-07-21
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link54.5982.305.8M20.9x6VoV3D-L (32frames, Kinetics pretrained, single)2020-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link54.483.8MSNet-R50En (8+16 ensemble, ImageNet pretrained)2020-07-20
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link54.382.9SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)2021-02-14
Region-based Non-local Operation for Video Classification✓ Link54.182.2RNL+TSM Ensemble(R50+R101, ImageNet pretrained)2020-07-17
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link54.081.1RSANet-R50 (16 frames, ImageNet pretrained, a single clip)2021-11-02
MVFNet: Multi-View Fusion Network for Efficient Video Recognition✓ Link54.0MVFNet-R50EN2020-12-13
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition53.5STPG (8+16frames)2022-08-09
Action recognition with spatial-temporal discriminative filter banks53.4GB + DF + LB (ResNet152, ImageNet pretrained)2019-08-20
Video Classification with Channel-Separated Convolutional Networks✓ Link53.3ip-CSN-152 (IG-65M pretraining)2019-04-04
MARS: Motion-Augmented RGB Stream for Action Recognition✓ Link53.0MARS+RGB+Flow (64 frames, Kinetics pretrained)2019-06-01
Region-based Non-local Operation for Video Classification✓ Link52.781.5RNL+TSM Ensemble(ResNet50, ImageNet pretrained)2020-07-17
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link52.6880.433.3M11.5x6VoV3D-M (32frames, Kinetics pretrained, single)2020-12-01
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention52.681.3TSM+W3 (16 frames, ResNet50)2020-04-02
Action Keypoint Network for Efficient Video Recognition52.5AK-Net2022-01-17
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link52.182.3MSNet-R50 (16 frames, ImageNet pretrained)2020-07-20
Video Classification with Channel-Separated Convolutional Networks✓ Link52.1ir-CSN-152 (IG-65M pretraining)2019-04-04
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link51.979.6RSANet-R50 (8 frames, ImageNet pretrained, a single clip)2021-11-02
Gate-Shift Networks for Video Action Recognition✓ Link51.68GSM InceptionV3 (16 frames, ImageNet pretrained)2019-12-01
Video Classification with Channel-Separated Convolutional Networks✓ Link51.6R(2+1)D-152 (IG-65M pretraining)2019-04-04
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link50.980.3MSNet-R50 (8 frames, ImageNet pretrained)2020-07-20
TSM: Temporal Shift Module for Efficient Video Understanding✓ Link50.7TSM (RGB + Flow)2018-11-20
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link50.678.75.8M20.9x6VoV3D-L (32frames, from scratch, single)2020-12-01
Moments in Time Dataset: one million videos for event understanding✓ Link50ResNet50 I3D (Moments pretrained)2018-01-09
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link49.878.03.3M11.5x6VoV3D-M (32frames, from scratch, single)2020-12-01
TSM: Temporal Shift Module for Efficient Video Understanding✓ Link49.778.5TSMEn2018-11-20
Temporal Reasoning Graph for Activity Recognition49.7TRG (Inception-V3)2019-08-27
Temporal Reasoning Graph for Activity Recognition49.586.1TRG (ResNet-50)2019-08-27
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link49.578.05.8M9.3x6VoV3D-L (16frames, from scratch, single)2020-12-01
Video Classification with Channel-Separated Convolutional Networks✓ Link49.3ir-CSN-1522019-04-04
Recurrent Space-time Graph Neural Networks✓ Link49.2RSTG (Kinetics pretrained)2019-04-11
Moments in Time Dataset: one million videos for event understanding✓ Link48.6ResNet50 I3D (Kinetics pretrained)2018-01-09
Video Classification with Channel-Separated Convolutional Networks✓ Link48.4ir-CSN-1012019-04-04
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link48.278.7S3D-G (ImageNet pretrained)2017-12-13
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link48.176.93.3M5.7x6VoV3D-M (16frames, from scratch, single)2020-12-01
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link47.378.1S3D2017-12-13
TSM: Temporal Shift Module for Efficient Video Understanding✓ Link47.277.1TSM2018-11-20
ECO: Efficient Convolutional Network for Online Video Understanding✓ Link46.4ECO-Net (ImageNet pretrained)2018-04-24
ECO: Efficient Convolutional Network for Online Video Understanding✓ Link46.4ECO-Net2018-04-24
Videos as Space-Time Region Graphs46.1NL I3D + GCN2018-06-05
Non-local Neural Networks✓ Link44.4NL I3D2017-11-21
Motion Feature Network: Fixed Motion Filter for Action Recognition43.9Motion Feature Net2018-07-26
Temporal Relational Reasoning in Videos✓ Link42.012-Stream TRN2017-11-22
Hierarchical Feature Aggregation Networks for Video Action Recognition41.97HF-TSN (ImageNet pretraining)2019-05-29
MARS: Motion-Augmented RGB Stream for Action Recognition✓ Link40.4MARS+RGB+Flow (16 frames, Kinetics pretrained)2019-06-01
Temporal Relational Reasoning in Videos✓ Link34.4M-TRN2017-11-22