OpenCodePapers

action-classification-on-kinetics-400

VideoAction Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAcc@1Acc@5FLOPs (G) x viewsClip acc@1Parameters (M)Clip acc@5ModelNameReleaseDate
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning93.6OmniVec22024-01-01
Enhancing Video Transformers for Action Understanding with VLM-aided Training93.4FTP-UniFormerV2-L/142024-03-24
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link92.1InternVideo2-6B2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link91.6InternVideo2-1B2024-03-22
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link91.1InternVideo2022-12-06
OmniVec: Learning robust representations with cross modal sharing91.1OmniVec2023-11-07
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link90.998.9176400x4x3632TubeViT-H (ImageNet-1k)2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link90.698.71434×3×4304Unmasked Teacher (ViT-L)2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link90.698.7UMT-L (ViT-L/16)2023-03-28
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link90.298.695300x4x3307TubeVit-L (ImageNet-1k)2022-12-06
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer✓ Link90.098.475300x3x2354UniFormerV2-L (ViT-L, 336)2022-09-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link90.098.4VideoMAE V2-g (64x266x266)2023-03-29
Make Your Training Flexible: Towards Deployment-Efficient Video Models✓ Link90.0440x3x497FluxViT-B2025-03-18
Multiview Transformers for Video Recognition✓ Link89.998.3735700x4x3MTV-H (WTS 60M)2022-01-12
Temporally-Adaptive Models for Efficient Video Understanding✓ Link89.9TAdaFormer-L/142023-08-10
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link89.7EVA2022-11-14
AM Flow: Adapters for Temporal Processing in Action Recognition89.6AM/12 ViT-B Dinov22024-11-04
What Can Simple Arithmetic Operations Do for Temporal Modeling?✓ Link89.498.3ATM2023-07-18
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification✓ Link89.198.2DejaVid2025-01-01
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link88.9CoCa (finetuned)2022-05-04
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models✓ Link88.798.4BIKE (CLIP ViT-L/14)2022-12-31
Implicit Temporal Modeling with Learnable Alignment for Video Recognition✓ Link88.797.8ILA (ViT-L/14)2023-04-20
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning✓ Link88.698.2Side4Video (EVA, ViT-E/14)2023-11-27
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link88.697.68700x3x486TubeVit-B (ImageNet-1k)2022-12-06
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link88.598.1VideoMAE V2-g2023-03-29
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities✓ Link88.197.8ONE-PEACE2023-05-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models✓ Link88.0154x3x424FluxViT-S2025-03-18
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link88.0CoCa (frozen)2022-05-04
Scaling Vision Transformers to 22 Billion Parameters✓ Link88.0ViT-22B2023-02-10
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition✓ Link87.897.6Text4Vis (CLIP ViT-L/14)2022-07-04
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link87.8Hiera-H (no extra data)2023-06-01
Frozen CLIP Models are Efficient Video Learners✓ Link87.797.8EVL (CLIP ViT-L/14@336px, frozen, 32 frames)2022-08-06
Dual-path Adaptation from Image to Video Transformers✓ Link87.797.8DualPath w/ ViT-L/142023-03-17
Expanding Language-Image Pretrained Models for General Video Recognition✓ Link87.797.4X-CLIP(ViT-L/14, CLIP)2022-08-04
AIM: Adapting Image Models for Efficient Video Action Recognition✓ Link87.597.7AIM (CLIP ViT-L/14, 32x224)2023-02-06
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link87.497.6VideoMAE (no extra data, ViT-H, 32x320x320)2022-03-23
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning✓ Link87.297.6ST-Adapter (ViT-L, CLIP)2022-06-27
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video✓ Link87.297.6ZeroI2V ViT-L/142023-10-02
Co-training Transformer with Videos and Images Improves Action Recognition87.297.5CoVeR (JFT-3B)2021-12-14
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link87.297.4MVD (K400 pretrain, ViT-H, 16x224x224)2022-12-08
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link87.197.7mPLUG-22023-02-01
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link87.097.4MaskFeat (K600, MViT-L)2021-12-16
VicTR: Video-conditioned Text Representations for Activity Recognition87.0VicTR (ViT-L/14)2023-04-05
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link86.8Video-SwinV2-G (ImageNet-22k and external 70M pretrain)2021-11-18
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link86.797.3MaskFeat (no extra data, MViT-L)2021-12-16
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link86.697.1VideoMAE (no extra data, ViT-H)2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link86.497.0MVD (K400 pretrain, ViT-L, 16x224x224)2022-12-08
Temporally-Adaptive Models for Efficient Video Understanding✓ Link86.4TAdaConvNeXtV2-B2023-08-10
Co-training Transformer with Videos and Images Improves Action Recognition86.397.2CoVeR (JFT-300M)2021-12-14
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link86.197.3VideoMAE (no extra data, ViT-L, 32x320x320)2022-03-23
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link86.197.0MViTv2-L (ImageNet-21k pretrain)2021-12-02
Implicit Temporal Modeling with Learnable Alignment for Video Recognition✓ Link85.797.2ILA (ViT-B/16)2023-04-20
Dual-path Adaptation from Image to Video Transformers✓ Link85.497.1DualPath w/ ViT-B/162023-03-17
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?✓ Link85.4TokenLearner 16at18 (L/10)2021-06-21
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link85.396.3MAR (50% mask, ViT-L, 16x4)2022-07-24
CAST: Cross-Attention in Space and Time for Video Action Recognition✓ Link85.3CAST(ViT-B/16)2023-11-30
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link85.296.8VideoMAE (no extra data, ViT-L, 16x4)2022-03-23
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders✓ Link85.1ViC-MAE (ViT-L)2023-03-21
VideoMamba: State Space Model for Efficient Video Understanding✓ Link85.0VideoMamba-M8002024-03-11
Video Swin Transformer✓ Link84.996.7Swin-L (384x384, ImageNet-21k pretrain)2021-06-24
ViViT: A Video Vision Transformer✓ Link84.995.8ViViT-H/16x2 (JFT)2021-03-29
Omnivore: A Single Model for Many Visual Modalities✓ Link84.196.1OMNIVORE (Swin-L)2022-01-20
Omnivore: A Single Model for Many Visual Modalities✓ Link84.096.2OMNIVORE (Swin-B)2022-01-20
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link83.996.0MAR (75% mask, ViT-L, 16x4)2022-07-24
ActionCLIP: A New Paradigm for Video Action Recognition✓ Link83.897.1ActionCLIP (CLIP-pretrained)2021-09-17
Omni-sourced Webly-supervised Learning for Video Recognition✓ Link83.6OmniSource irCSN-152 (IG-Kinetics-65M pretrain)2020-03-29
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link83.495.8MVD (K400 pretrain, ViT-B, 16x224x224)2022-12-08
Learning Correlation Structures for Vision Transformers83.4StructViT-B-4-12024-04-05
Video Swin Transformer✓ Link83.195.9Swin-L (ImageNet-21k pretrain)2021-06-24
Stand-Alone Inter-Frame Attention in Video Models✓ Link83.1SIFA2022-06-14
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link82.994.5259x4UniFormer-B (ImageNet-1K)2021-09-29
Large-scale weakly-supervised pre-training for video action recognition✓ Link82.8irCSN-152 (IG-Kinetics-65M pretrain)2019-05-02
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition✓ Link82.7594.86DirecFormer2022-03-19
Video Swin Transformer✓ Link82.795.5Swin-B (ImageNet-21k pretrain)2021-06-24
Video Classification with Channel-Separated Convolutional Networks✓ Link82.6ir-CSN-152 (IG-65M pretraining)2019-04-04
Video Classification with Channel-Separated Convolutional Networks✓ Link82.595.3ip-CSN-152 (IG-65M pretraining)2019-04-04
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition✓ Link82.5TPS2022-07-27
Implicit Temporal Modeling with Learnable Alignment for Video Recognition✓ Link82.495.8ILA (ViT-B/32)2023-04-20
Asymmetric Masked Distillation for Pre-Training Small Foundation Models82.295.3180x1587AMD(ViT-B/16)2023-11-06
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text✓ Link82.195.5VATT-Large2021-04-22
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders✓ Link81.795.2AdaMAE2022-11-16
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link81.595.1VideoMAE (no extra data, ViT-B, 16x4)2022-03-23
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link81.5386x1MoViNet-A62021-03-21
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing81.4MLP-3D2022-06-13
Video Classification with Channel-Separated Convolutional Networks✓ Link81.395.1R[2+1]D-152 (IG-65M pretraining)2019-04-04
Learning Spatio-Temporal Representation with Local and Global Diffusion81.295.2LGD-3D Two-stream (ResNet-101)2019-06-13
Multiscale Vision Transformers✓ Link81.295.1MViT-B, 64x32021-04-22
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers✓ Link81.195.2Motionformer-HR2021-06-09
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link81.094.8MVD (K400 pretrain, ViT-S, 16x224x224)2022-12-08
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link81.094.4MAR (50% mask, ViT-B, 16x4)2022-07-24
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link80.994.9281x1MoViNet-A52021-03-21
Attention Bottlenecks for Multimodal Fusion✓ Link80.894.6MBT (AV)2021-06-30
Is Space-Time Attention All You Need for Video Understanding?✓ Link80.794.77140x3121.4TimeSformer-L2021-02-09
Video Swin Transformer✓ Link80.694.6Swin-B (ImageNet-1k pretrain)2021-06-24
Video Swin Transformer✓ Link80.694.5Swin-S (ImageNet-1k pretrain)2021-06-24
VidTr: Video Transformer Without Convolutions80.594.6En-VidTr-L2021-04-23
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link80.594.5105x1MoViNet-A42021-03-21
Omni-sourced Webly-supervised Learning for Video Recognition✓ Link80.594.4OmniSource SlowOnly R101 8x8(ImageNet pretrain)2020-03-29
An Image is Worth 16x16 Words, What is a Video Worth?✓ Link80.51040x1STAM (64 Frames)2021-03-25
X3D: Expanding Architectures for Efficient Video Recognition✓ Link80.494.6X3D-XXL2020-04-09
Revisiting 3D ResNets for Video Recognition✓ Link80.494.4R3D-RS-2002021-09-03
Omni-sourced Webly-supervised Learning for Video Recognition✓ Link80.494.4OmniSource SlowOnly R101 8x8 (Scratch)2020-03-29
Multiscale Vision Transformers✓ Link80.294.4MViT-B, 32x32021-04-22
Asymmetric Masked Distillation for Pre-Training Small Foundation Models80.194.557X1522AMD(ViT-S/16)2023-11-06
SlowFast Networks for Video Recognition✓ Link79.8SlowFast 16x8 (ResNet-101 + NL)2018-12-10
CT-Net: Channel Tensorization Network for Video Classification✓ Link79.8CT-Net Ensemble2021-06-03
Video Transformer Network✓ Link79.8ViT-B-VTN+ ImageNet-21K (84.0 [10])2021-02-01
Is Space-Time Attention All You Need for Video Understanding?✓ Link79.794.4TimeSformer-HR2021-02-09
VidTr: Video Transformer Without Convolutions79.794.2En-VidTr-M2021-04-23
Learning Spatio-Temporal Representation with Local and Global Diffusion79.494.4LGD-3D RGB (ResNet-101)2019-06-13
TDN: Temporal Difference Networks for Efficient Action Recognition✓ Link79.494.4TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)2020-12-18
VidTr: Video Transformer Without Convolutions79.494En-VidTr-S2021-04-23
MAR: Masked Autoencoders for Efficient Action Recognition✓ Link79.493.7MAR (75% mask, ViT-B, 16x4)2022-07-24
An Image is Worth 16x16 Words, What is a Video Worth?✓ Link79.3270x1STAM (16 Frames)2021-03-25
Video Classification with Channel-Separated Convolutional Networks✓ Link79.293.8ip-CSN-152 (Sports-1M pretraining)2019-04-04
Video Modeling with Correlation Networks79.2CorrNet2019-06-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks79.194.5OmniVL2022-09-15
X3D: Expanding Architectures for Efficient Video Recognition✓ Link79.193.9X3D-XL2020-04-09
MVFNet: Multi-View Fusion Network for Efficient Video Recognition✓ Link79.193.8MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)2020-12-13
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link79.193.7TAdaConvNeXt-T2021-10-12
SlowFast Networks for Video Recognition✓ Link78.993.5SlowFast 16x8 (ResNet-101)2018-12-10
What Makes Training Multi-Modal Classification Networks Hard?✓ Link78.9G-Blend (Sports-1M pretrain)2019-05-29
Video Swin Transformer✓ Link78.893.6Swin-T (ImageNet-1k pretrain)2021-06-24
Action recognition with spatial-temporal discriminative filter banks78.8GB + DF + LB (ResNet 152, ImageNet pretrained)2019-08-20
Video Transformer Network✓ Link78.693.7ViT-B-VTN (3 layers, ImageNet pretrain)2021-02-01
Multiscale Vision Transformers✓ Link78.493.5MViT-B, 16x42021-04-22
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link78.293.856.9x1MoViNet-A32021-03-21
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link78.293.5TAda2D-En (ResNet-50, 8+16 frames)2021-10-12
Self-supervised Video Transformer✓ Link78.1SVT2021-12-02
Is Space-Time Attention All You Need for Video Understanding?✓ Link7893.7TimeSformer2021-02-09
SlowFast Networks for Video Recognition✓ Link77.993.2SlowFast 8x8 (ResNet-101)2018-12-10
Representation Flow for Action Recognition✓ Link77.9RepFlow-50 ([2+1]D CNN, FcF, Non-local block)2018-10-02
Video Classification with Channel-Separated Convolutional Networks✓ Link77.892.8ip-CSN-1522019-04-04
Non-local Neural Networks✓ Link77.793.3I3D + NL2017-11-21
What Makes Training Multi-Modal Classification Networks Hard?✓ Link77.7G-Blend2019-05-29
Large Scale Holistic Video Understanding✓ Link77.6HATNet (32 frames)2019-04-25
X3D: Expanding Architectures for Efficient Video Recognition✓ Link77.592.9X3D-L2020-04-09
Collaborative Spatiotemporal Feature Learning for Video Action Recognition✓ Link77.5CoST ResNet-101 (ImageNet pretrain)2019-06-01
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link77.493.1TAda2D (ResNet-50, 16 frames)2021-10-12
Evolving Space-Time Neural Architectures for Videos77.4EvaNet2018-11-26
Region-based Non-local Operation for Video Classification✓ Link77.4RNL+TSM Ensemble(ResNet50, 8 + 16 frames)2020-07-17
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning✓ Link77.4VIMPAC2021-06-21
Busy-Quiet Video Disentangling for Video Classification✓ Link77.393.2BQN (ResNet-50)2021-03-29
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link77.293S3D-G (RGB+Flow, ImageNet pretrained)2017-12-13
SlowFast Networks for Video Recognition✓ Link7792.6SlowFast 8x8 (ResNet-50)2018-12-10
TAda! Temporally-Adaptive Convolutions for Video Understanding✓ Link76.792.6TAda2D (ResNet-50, 8 frames)2021-10-12
D3D: Distilled 3D Networks for Video Action Recognition✓ Link76.5D3D+S3D-G (RGB + RGB)2018-12-19
MotionSqueeze: Neural Motion Feature Learning for Video Understanding✓ Link76.4MSNet-R50 (16 frames, ImageNet pretrained)2020-07-20
Global Textual Relation Embedding for Relational Understanding✓ Link76.1GloRe2019-06-03
X3D: Expanding Architectures for Efficient Video Recognition✓ Link7692.3X3D-M2020-04-09
Multiscale Vision Transformers✓ Link7692.1MViT-S2021-04-22
Two-Stream Video Classification with Cross-Modality Attention75.98CMA iter1 (16 frames)2019-08-01
D3D: Distilled 3D Networks for Video Action Recognition✓ Link75.9D3D (RGB)2018-12-19
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution✓ Link75.7Oct-I3D + NL2019-04-10
SlowFast Networks for Video Recognition✓ Link75.692.1SlowFast 4x16 (ResNet-50)2018-12-10
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link75.491.9R[2+1]D-Flow (Sports-1M pretrain)2017-11-30
FASTER Recurrent Networks for Efficient Video Classification75.1FASTER322019-06-10
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link75.092.310.3x1MoViNet-A22021-03-21
MARS: Motion-Augmented RGB Stream for Action Recognition✓ Link74.9MARS+RGB+Flow (64 frames)2019-06-01
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link74.793.4S3D-G (RGB, ImageNet pretrained)2017-12-13
TSM: Temporal Shift Module for Efficient Video Understanding✓ Link74.7TSM2018-11-20
$A^2$-Nets: Double Attention Networks74.691.5A2 Net2018-10-27
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link74.391.4R[2+1]D-RGB (Sports-1M pretrain)2017-11-30
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition✓ Link73.991.1TSN2016-08-02
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link73.990.9R[2+1]D-Two-Stream2017-11-30
ConvNet Architecture Search for Spatiotemporal Feature Learning✓ Link73.9TSN2017-08-16
STM: SpatioTemporal and Motion Encoding for Action Recognition73.7STM (ResNet-50)2019-08-07
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation✓ Link73.591.2bLVNet Fan et al. (2019)2019-12-02
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link73.056.90x132.45Co Slow_642021-05-31
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification73.090.9Inception-ResNet2017-08-12
Multi-Fiber Networks for Video Recognition72.890.4MFNet2018-07-30
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link72.791.26.0x1MoViNet-A12021-03-21
Appearance-and-Relation Networks for Video Classification✓ Link72.490.4ARTNet2017-11-24
Learning Spatio-Temporal Representation with Local and Global Diffusion72.390.9LGD-3D Flow (ResNet-101)2019-06-13
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link7290R[2+1]D2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link7290R[2+1]D-RGB2017-11-30
FASTER Recurrent Networks for Efficient Video Classification71.7FASTER16 w/o sp2019-06-10
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link71.611.25x16.15Co X3D-L_642021-05-31
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset✓ Link71.189.3I3D2017-05-22
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link71.030.33x13.79Co X3D-M_642021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link69.2919.17x16.15X3D-L2021-05-31
MARS: Motion-Augmented RGB Stream for Action Recognition✓ Link68.9MARS+RGB+Flow (16 frames)2019-06-01
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link68.4566.25x166.25SlowFast-8×8-R502021-05-31
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link6887.6S3D-G (Flow, ImageNet pretrained)2017-12-13
A Closer Look at Spatiotemporal Convolutions for Action Recognition✓ Link67.587.2R[2+1]D-Flow2017-11-30
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link67.4254.87x132.45Slow-8x8-R502021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link67.330.17x13.79Co X3D-S_642021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link67.244.97x13.79X3D-M2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link67.0636.46x134.48SlowFast-4×16-R502021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link65.906.90x132.45Co Slow_82021-05-31
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link65.887.42.7x1MoViNet-A02021-03-21
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link64.712.06x13.79X3D-S2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link63.9828.61x128.04I3D-R502021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link63.031.25x16.15Co X3D-L_162021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link62.800.33x13.79Co X3D-M_162021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link60.180.17x13.79Co X3D-S_132021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link59.585.68x128.04Co I3D_82021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link59.5240.71x131.51R(2+1)D-18_162021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link59.370.64x13.79X3D-XS2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link56.865.68x128.04Co I3D_642021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link53.5220.35x131.51R(2+1)D-18_82021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos✓ Link53.404.71x112.80RCU_82021-05-31
ViViT: A Video Vision Transformer✓ Link94.7ViViT-L/16x2 3202021-03-29
Video Transformer Network✓ Link94.2ViT-B-VTN+ ImageNet-21K (84.0 [10])2021-02-01
SlowFast Networks for Video Recognition✓ Link93.9SlowFast 16x8 (ResNet-101 + NL)2018-12-10
Video Transformer Network✓ Link93.4ViT-B-VTN (1 layer, ImageNet pretrain)2021-02-01
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link225x5MViT-B (train from scratch)2021-12-02