OpenCodePapers

action-classification-on-kinetics-700

VideoAction Classification
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTop-1 AccuracyTop-5 AccuracyModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link85.9InternVideo2-6B2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link85.4InternVideo2-1B2024-03-22
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link84.0InternVideo-T2022-12-06
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning✓ Link83.896.6TubeViT-L2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link83.696.7UMT-L (ViT-L/16)2023-03-28
Multiview Transformers for Video Recognition✓ Link83.496.2MTV-H (WTS 60M)2022-01-12
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link82.9%EVA2022-11-14
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer✓ Link82.796.2UniFormerV2-L2022-09-22
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link82.7CoCa (finetuned)2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link81.1CoCa (frozen)2022-05-04
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link81.1Hiera-H (no extra data)2023-06-01
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link80.495.7MaskFeat (no extra data, MViT-L)2021-12-16
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link80.494.9mPLUG-22023-02-01
AIM: Adapting Image Models for Efficient Video Action Recognition✓ Link80.4AIM (CLIP ViT-L/14, 32x224)2023-02-06
Co-training Transformer with Videos and Images Improves Action Recognition79.894.9CoVeR (JFT-3B)2021-12-14
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link79.494.9MViTv2-L (ImageNet-21k pretrain)2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link79.4MoViNet-A62021-12-02
Co-training Transformer with Videos and Images Improves Action Recognition78.594.2CoVeR (JFT-300M)2021-12-14
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link76.693.2MViTv2-B2021-12-02
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link72.3MoViNet-A62021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link71.7MoViNet-A52021-03-21
VidTr: Video Transformer Without Convolutions70.889.4En-VidTr-L2021-04-23
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link70.7MoViNet-A42021-03-21
VidTr: Video Transformer Without Convolutions70.289VidTr-L2021-04-23
VidTr: Video Transformer Without Convolutions69.588.3VidTr-M2021-04-23
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link68.0MoViNet-A32021-03-21
VidTr: Video Transformer Without Convolutions67.387.7VidTr-S2021-04-23
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link66.7MoViNet-A22021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link63.5MoViNet-A12021-03-21
MoViNets: Mobile Video Networks for Efficient Video Recognition✓ Link58.5MoViNet-A02021-03-21
Learn to cycle: Time-consistent feature discovery for action recognition✓ Link56.4676.82SRTG r3d-1012020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition✓ Link54.1774.62SRTG r(2+1)d-502020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition✓ Link53.5274.17SRTG r3d-502020-06-15
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision✓ Link51.9SEER (RegNet10B)2022-02-16
Learn to cycle: Time-consistent feature discovery for action recognition✓ Link49.4373.23SRTG r(2+1)d-342020-06-15
Learn to cycle: Time-consistent feature discovery for action recognition✓ Link49.1572.68SRTG r3d-342020-06-15