OpenCodePapers

action-classification-on-kinetics-600

VideoAction Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop-1 AccuracyTop-5 AccuracyGFLOPsModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link91.9InternVideo2-6B2024-03-22
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link91.898.9TubeVit-H2022-12-06
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link91.6InternVideo2-1B2024-03-22
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link91.598.7TubeVit-L2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link91.3InternVideo-T2022-12-06
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound91.197.1🍷MerlotReserve-Large (+Audio)2022-01-07
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link90.997.3TubeVit-B2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link90.598.8UMT-L (ViT-L/16)2023-03-28
Multiview Transformers for Video Recognition✓ Link90.398.5MTV-H (WTS 60M)2022-01-12
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer✓ Link90.198.5UniFormerV2-L2022-09-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link89.998.5VideoMAE V2-g (64x266x266)2023-03-29
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link89.898.3mPLUG-22023-02-01
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link89.8%EVA2022-11-14
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound89.796.6🍷MerlotReserve-Base (+Audio)2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound89.496.3🍷MerlotReserve-Large (no Audio)2022-01-07
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link89.4CoCa (finetuned)2022-05-04
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link88.898.2VideoMAE V2-g2023-03-29
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link88.8Hiera-H (no extra data)2023-06-01
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link88.5CoCa (frozen)2022-05-04
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link88.398.0MaskFeat (no extra data, MViT-L)2021-12-16
Expanding Language-Image Pretrained Models for General Video Recognition✓ Link88.397.7X-CLIP(ViT-L/14, CLIP)2022-08-04
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound88.195.8🍷MerlotReserve-Base (no Audio)2022-01-07
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link87.997.9MViTv2-L (ImageNet-21k pretrain)2021-12-02
Co-training Transformer with Videos and Images Improves Action Recognition87.997.8CoVeR (JFT-3B)2021-12-14
Florence: A New Foundation Model for Computer Vision✓ Link87.897.9Florence (curated FLD-900M pretrain)2021-11-22
Co-training Transformer with Videos and Images Improves Action Recognition86.897.3CoVeR (JFT-300M)2021-12-14
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?✓ Link86.397.0TokenLearner 16at18 w. Fuser (L/10)2021-06-21
Video Swin Transformer✓ Link86.197.3Swin-L (384x384, ImageNet-21k pretrain)2021-06-24
ViViT: A Video Vision Transformer✓ Link85.896.5ViViT-H/16x2 (JFT)2021-03-29
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link85.5MViTv2-L (train from scratch)2021-12-02
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning✓ Link84.896.7259x4UniFormer-B (ImageNet-1K)2021-09-29
Space-time Mixing Attention for Video Transformer✓ Link84.596.3XViT (x16)2021-06-10
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link84.396.4281x1MoViNet-A5 (AutoAugment)2021-03-21
ViViT: A Video Vision Transformer✓ Link84.395.6ViViT-L/16x22021-03-29
Video Swin Transformer✓ Link84.096.5Swin-B (ImageNet-21k pretrain)2021-06-24
Multiscale Vision Transformers✓ Link83.896.3MViT-B-24, 32x32021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text✓ Link83.696.6VATT-Large2021-04-22
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link83.596.5386x1MoViNet-A62021-03-21
Multiscale Vision Transformers✓ Link83.496.3MViT-B, 32x32021-04-22
Learning Spatio-Temporal Representation with Local and Global Diffusion83.196.2LGD-3D Two-stream2019-06-13
Revisiting 3D ResNets for Video Recognition✓ Link83.1R3D-RS-2002021-09-03
ViViT: A Video Vision Transformer✓ Link83.095.7ViViT-L/16x2 (320x320)2021-03-29
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link82.795.7281x1MoViNet-A52021-03-21
Multiscale Vision Transformers✓ Link82.195.7MViT-B, 16x42021-04-22
PERF-Net: Pose Empowered RGB-Flow Net82.095.7PERF-Net (distilled ResNet50-G)2020-09-28
SlowFast Networks for Video Recognition✓ Link81.895.1SlowFast 16x8 (ResNet-101 + NL)2018-12-10
Learning Spatio-Temporal Representation with Local and Global Diffusion81.595.6LGD-3D RGB2019-06-13
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link81.294.9105x1MoViNet-A42021-03-21
SlowFast Networks for Video Recognition✓ Link81.195.1SlowFast 16x8 (ResNet-101)2018-12-10
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link80.880.856.9x1MoViNet-A32021-03-21
SlowFast Networks for Video Recognition✓ Link80.494.8SlowFast 8x8 (ResNet-101)2018-12-10
SlowFast Networks for Video Recognition✓ Link79.994.5SlowFast 8x8 (ResNet-50)2018-12-10
D3D: Distilled 3D Networks for Video Action Recognition✓ Link79.1D3D+S3D-G2018-12-19
SlowFast Networks for Video Recognition✓ Link78.894SlowFast 4x16 (ResNet-50)2018-12-10
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link78.6S3D-G (RGB+Flow)2017-12-13
D3D: Distilled 3D Networks for Video Action Recognition✓ Link77.9D3D2018-12-19
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link77.593.410.3x1MoViNet-A22021-03-21
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link76.6S3D-G (RGB)2017-12-13
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link76.092.66.0x1MoViNet-A12021-03-21
Learning Spatio-Temporal Representation with Local and Global Diffusion7592.4LGD-3D Flow2019-06-13
A Short Note about Kinetics-600✓ Link73.6I3D (RGB)2018-08-03
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link71.590.42.7x1MoViNet-A02021-03-21
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification✓ Link69.7S3D-G (Flow)2017-12-13
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link97.2MViTv2-B (train from scratch)2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link206x5MViT-L (train from scratch)2021-12-02