OpenCodePapers

action-recognition-in-videos-on-something

Action Recognition
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop-1 AccuracyTop-5 AccuracyParametersGFLOPsModelNameReleaseDate
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link77.395.76331192x6MVD (Kinetics400 pretrain, ViT-H, 16 frame)2022-12-08
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification✓ Link77.296.3DejaVid2025-01-01
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link77.2InternVideo2022-12-06
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link77.1InternVideo2-1B2024-03-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link77.095.910132544x6VideoMAE V2-g2023-03-29
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link76.795.5305597x6MVD (Kinetics400 pretrain, ViT-L, 16 frame)2022-12-08
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link76.5Hiera-L (no extra data)2023-06-01
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link76.195.2TubeViT-L2022-12-06
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link75.495.23051436x3VideoMAE (no extra data, ViT-L, 32x2)2022-03-23
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning✓ Link75.294.0Side4Video (EVA ViT-E/14)2023-11-27
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link75.095.02182828*3MaskFeat (Kinetics600 pretrain, MViT-L)2021-12-16
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link74.794.9311276x6MAR (50% mask, ViT-L, 16x4)2022-07-24
What Can Simple Arithmetic Operations Do for Temporal Modeling?✓ Link74.694.4ATM2023-07-18
The effectiveness of MAE pre-pretraining for billion-scale pretraining✓ Link74.4MAWS (ViT-L)2023-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link74.394.6305597x6VideoMAE (no extra data, ViT-L, 16frame)2022-03-23
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link73.894.4311131x6MAR (75% mask, ViT-L, 16x4)2022-07-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link73.794.087180x6MVD (Kinetics400 pretrain, ViT-B, 16 frame)2022-12-08
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders✓ Link73.7ViC-MAE (ViT-L)2023-03-21
Temporally-Adaptive Models for Efficient Video Understanding✓ Link73.6TAdaFormer-L/142023-08-10
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning✓ Link73.493.8TDS-CLIP-ViT-L/14(8frames)2024-08-20
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link73.394.1213.1MViTv2-L (IN-21K + Kinetics400 pretrain)2021-12-02
Asymmetric Masked Distillation for Pre-Training Small Foundation Models73.394.087180x6AMD(ViT-B/16)2023-11-06
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer✓ Link73.094.55154UniFormerV2-L2022-09-22
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning✓ Link72.393.98248ST-Adapter (ViT-L, CLIP)2022-06-27
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video✓ Link72.293.0ZeroI2V ViT-L/142023-10-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link72.1225x3MViT-B (IN-21K + Kinetics400 pretrain)2021-12-02
CAST: Cross-Attention in Space and Time for Video Action Recognition✓ Link71.6CAST(ViT-B/16)2023-11-30
Learning Correlation Structures for Vision Transformers71.5StructVit-B-4-12024-04-05
Omnivore: A Single Model for Many Visual Modalities✓ Link71.493.5OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)2022-01-20
BEVT: BERT Pretraining of Video Transformers✓ Link71.4-89321x3BEVT (IN-1K + Kinetics400 pretrain)2021-12-02
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link71.292.850.1259x3UniFormer-B (IN-1K + Kinetics400 pretrain)2021-09-29
Temporally-Adaptive Models for Efficient Video Understanding✓ Link71.1TAdaConvNeXtV2-B2023-08-10
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link71.092.89486x6MAR (50% mask, ViT-B, 16x4)2022-07-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link70.992.82257x6MVD (Kinetics400 pretrain, ViT-S, 16 frame)2022-12-08
Co-training Transformer with Videos and Images Improves Action Recognition70.992.5CoVeR(JFT-3B)2021-12-14
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link70.892.487180x6VideoMAE (no extra data, ViT-B, 16frame)2022-03-23
Asymmetric Masked Distillation for Pre-Training Small Foundation Models70.292.52257x6AMD(ViT-S/16)2023-11-06
Implicit Temporal Modeling with Learnable Alignment for Video Recognition✓ Link70.291.8ILA (ViT-L/14)2023-04-20
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning✓ Link70.192.868.5197x3MorphMLP-B (IN-1K)2021-11-24
Co-training Transformer with Videos and Images Improves Action Recognition69.891.9CoVeR(JFT-300M)2021-12-14
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition✓ Link69.8TPS2022-07-27
Stand-Alone Inter-Frame Attention in Video Models✓ Link69.8SIFA2022-06-14
Video Swin Transformer✓ Link69.692.789321x3Swin-B (IN-21K + Kinetics400 pretrain)2021-06-24
TDN: Temporal Difference Networks for Efficient Action Recognition✓ Link69.692.2198x3TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)2020-12-18
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link69.591.99441x6MAR (75% mask, ViT-B, 16x4)2022-07-24
Object-Region Video Transformers✓ Link69.591.5N/AN/AORViT Mformer-L (ORViT blocks)2021-10-13
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link69.492.121.441.8x3UniFormer-S (IN-1K + Kinetics600 pretrain)2021-09-29
Mutual Modality Learning for Video Action Classification✓ Link69.0292.70MML (ensemble)2020-11-04
Multiscale Vision Transformers✓ Link68.791.553.2M236x3MViT-B-24, 32x32021-04-22
Multiview Transformers for Video Recognition✓ Link68.590.4MTV-B2022-01-12
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing68.5MLP-3D2022-06-13
TDN: Temporal Difference Networks for Efficient Action Recognition✓ Link68.291.6198x1TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)2020-12-18
Multi-scale Motion-Aware Module for Video Action Recognition68.2MSMA (8+16frames)2023-02-19
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers✓ Link68.191.2N/A1181x3Mformer-L2021-06-09
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning✓ Link68.1VIMPAC2021-06-21
Object-Region Video Transformers✓ Link67.990.5N/AN/AORViT Mformer (ORViT blocks)2021-10-13
Multiscale Vision Transformers✓ Link67.891.336.6170x3MViT-B, 32x3(Kinetics600 pretrain)2021-04-22
Group Contextualization for Video Recognition✓ Link67.891.227.4110.1GC-TDN Ensemble (R50,8+16)2022-03-18
CT-Net: Channel Tensorization Network for Video Classification✓ Link67.891.183.8280CT-Net Ensemble (R50, 8+12+16+24)2021-06-03
Motion-driven Visual Tempo Learning for Video-based Action Recognition✓ Link67.8TCM (Ensemble)2022-02-24
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link67.791.1SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)2021-02-14
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link67.791.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips2021-11-02
Global Temporal Difference Network for Action Recognition67.6GTDNet2022-11-23
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link67.491SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)2021-02-14
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link67.3590.505.8M20.9x6VoV3D-L (32frames, Kinetics pretrained, single)2020-12-01
SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition67.391PLAR2023-05-21
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link67.390.8RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)2021-11-02
Space-time Mixing Attention for Video Transformer✓ Link67.290.8N/A850x1X-Vit (x16)2021-06-10
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link67.289.8TAda2D-En (ResNet-50, 8+16 frames)2021-10-12
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers✓ Link67.190.6N/A958.8x3Mformer-HR2021-06-09
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link67.190.4TAdaConvNeXt-T2021-10-12
Action Recognition With Motion Diversification and Dynamic Selection67.1MoDS (8+16frames)2022-07-15
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition67.0STPG (8+16frames)2022-08-09
Mutual Modality Learning for Video Action Classification✓ Link66.8391.30MML (single)2020-11-04
Implicit Temporal Modeling with Learnable Alignment for Video Recognition✓ Link66.890.3ILA (ViT-B/16)2023-04-20
TSM: Temporal Shift Module for Efficient Video Understanding✓ Link66.691.3TSM (RGB + Flow)2018-11-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link66.690.6MSNet-R50En (8+16 ensemble, ImageNet pretrained)2020-07-20
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance✓ Link66.590.6PAN ResNet101 (RGB only, no Flow)2020-08-08
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention66.590.4TSM+W3 (16 frames, RGB ResNet-50)2020-04-02
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers✓ Link66.590.1Mformer2021-06-09
MVFNet: Multi-View Fusion Network for Efficient Video Recognition✓ Link66.3MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)2020-12-13
Multiscale Vision Transformers✓ Link66.290.2MViT-B, 16x42021-04-22
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link6689.8RSANet-R50 (16 frames, ImageNet pretrained, a single clip)2021-11-02
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link65.889.55.8M20.9x6VoV3D-L (32frames, from scratch, single)2020-12-01
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition✓ Link65.789.818.3E3D-L2023-03-05
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition✓ Link65.789.8SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)2021-02-14
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link65.689.2TAda2D (ResNet-50, 16 frames)2021-10-12
ViViT: A Video Vision Transformer✓ Link65.489.8ViViT-L/16x2 Fact. encoder2021-03-29
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link65.2489.483.3M11.5x6VoV3D-M (32frames, Kinetics pretrained, single)2020-12-01
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation✓ Link65.2bLVNet2019-12-02
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition✓ Link64.9487.9DirecFormer2022-03-19
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link64.889.1RSANet-R50 (8 frames, ImageNet pretrained, a single clip)2021-11-02
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link64.789.4MSNet-R50 (16 frames, ImageNet pretrained)2020-07-20
Action Keypoint Network for Efficient Video Recognition64.3AK-Net2022-01-17
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link64.288.83.3M11.5x6VoV3D-M (32frames, from scratch, single)2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link64.188.65.8M9.3x6VoV3D-L (16frames, from scratch, single)2020-12-01
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link64.088.0TAda2D (ResNet-50, 8 frames)2021-10-12
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link63.589.04.8M10.3x1MoViNet-A22021-03-21
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification✓ Link63.288.23.3M5.7x6VoV3D-M (16frames, from scratch, single)2020-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link6388.4MSNet-R50 (8 frames, ImageNet pretrained)2020-07-20
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link62.789.04.6M6.0x1MoViNet-A12021-03-21
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks62.586.2OmniVL2022-09-15
Is Space-Time Attention All You Need for Video Understanding?✓ Link62.5TimeSformer-HR2021-02-09
Is Space-Time Attention All You Need for Video Understanding?✓ Link62.3TimeSformer-L2021-02-09
Temporal Reasoning Graph for Activity Recognition62.290.3TRG (ResNet-50)2019-08-27
Temporal Pyramid Network for Action Recognition✓ Link62.0TPN (TSM-50)2020-04-07
A Multigrid Method for Efficiently Training Video Models✓ Link61.7Multigrid2019-12-02
SlowFast Networks for Video Recognition✓ Link61.7SlowFast2018-12-10
Temporal Reasoning Graph for Activity Recognition61.391.4TRG (Inception-V3)2019-08-27
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link61.388.23.1M2.7x1MoViNet-A02021-03-21
Cooperative Cross-Stream Network for Discriminative Action Representation61.289.3CCS + two-stream + TRN2019-08-27
VidTr: Video Transformer Without Convolutions60.2VidTr-L2021-04-23
Is Space-Time Attention All You Need for Video Understanding?✓ Link59.5TimeSformer2021-02-09
Self-supervised Video Transformer✓ Link59.2SVT2021-12-02
Few-Shot Video Classification via Temporal Alignment52.3TAM (5-shot)2019-06-27
The "something something" video database for learning and evaluating visual common sense✓ Link51.3380.46model3D_1 with left-right augmentation and fps jitter2017-06-13
Attention Distillation for Learning Video Representations49.979.1Prob-Distill2019-04-05
Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos✓ Link47.73STM + TRNMultiscale2019-09-11
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link112213113321InternVideo2-6B2024-03-22
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link93.451.1MViTv2-B (IN-21K + Kinetics400 pretrain)2021-12-02
Relational Self-Attention: What's Missing in Attention for Video Understanding✓ Link91.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)2021-11-02
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link5.3M23.7x1MoViNet-A32021-03-21
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link2828x3MViT-L (IN-21K + Kinetics400 pretrain)2021-12-02