OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 93.6 | | | | | | OmniVec2 | 2024-01-01 |
Enhancing Video Transformers for Action Understanding with VLM-aided Training | | 93.4 | | | | | | FTP-UniFormerV2-L/14 | 2024-03-24 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 92.1 | | | | | | InternVideo2-6B | 2024-03-22 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 91.6 | | | | | | InternVideo2-1B | 2024-03-22 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 91.1 | | | | | | InternVideo | 2022-12-06 |
OmniVec: Learning robust representations with cross modal sharing | | 91.1 | | | | | | OmniVec | 2023-11-07 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 90.9 | 98.9 | 176400x4x3 | | 632 | | TubeViT-H (ImageNet-1k) | 2022-12-06 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 90.6 | 98.7 | 1434×3×4 | | 304 | | Unmasked Teacher (ViT-L) | 2023-03-28 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 90.6 | 98.7 | | | | | UMT-L (ViT-L/16) | 2023-03-28 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 90.2 | 98.6 | 95300x4x3 | | 307 | | TubeVit-L (ImageNet-1k) | 2022-12-06 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 90.0 | 98.4 | 75300x3x2 | | 354 | | UniFormerV2-L (ViT-L, 336) | 2022-09-22 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 90.0 | 98.4 | | | | | VideoMAE V2-g (64x266x266) | 2023-03-29 |
Make Your Training Flexible: Towards Deployment-Efficient Video Models | ✓ Link | 90.0 | | 440x3x4 | | 97 | | FluxViT-B | 2025-03-18 |
Multiview Transformers for Video Recognition | ✓ Link | 89.9 | 98.3 | 735700x4x3 | | | | MTV-H (WTS 60M) | 2022-01-12 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 89.9 | | | | | | TAdaFormer-L/14 | 2023-08-10 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 89.7 | | | | | | EVA | 2022-11-14 |
AM Flow: Adapters for Temporal Processing in Action Recognition | | 89.6 | | | | | | AM/12 ViT-B Dinov2 | 2024-11-04 |
What Can Simple Arithmetic Operations Do for Temporal Modeling? | ✓ Link | 89.4 | 98.3 | | | | | ATM | 2023-07-18 |
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification | ✓ Link | 89.1 | 98.2 | | | | | DejaVid | 2025-01-01 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 88.9 | | | | | | CoCa (finetuned) | 2022-05-04 |
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | ✓ Link | 88.7 | 98.4 | | | | | BIKE (CLIP ViT-L/14) | 2022-12-31 |
Implicit Temporal Modeling with Learnable Alignment for Video Recognition | ✓ Link | 88.7 | 97.8 | | | | | ILA (ViT-L/14) | 2023-04-20 |
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | ✓ Link | 88.6 | 98.2 | | | | | Side4Video (EVA, ViT-E/14) | 2023-11-27 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 88.6 | 97.6 | 8700x3x4 | | 86 | | TubeVit-B (ImageNet-1k) | 2022-12-06 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 88.5 | 98.1 | | | | | VideoMAE V2-g | 2023-03-29 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | ✓ Link | 88.1 | 97.8 | | | | | ONE-PEACE | 2023-05-18 |
Make Your Training Flexible: Towards Deployment-Efficient Video Models | ✓ Link | 88.0 | | 154x3x4 | | 24 | | FluxViT-S | 2025-03-18 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 88.0 | | | | | | CoCa (frozen) | 2022-05-04 |
Scaling Vision Transformers to 22 Billion Parameters | ✓ Link | 88.0 | | | | | | ViT-22B | 2023-02-10 |
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition | ✓ Link | 87.8 | 97.6 | | | | | Text4Vis (CLIP ViT-L/14) | 2022-07-04 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 87.8 | | | | | | Hiera-H (no extra data) | 2023-06-01 |
Frozen CLIP Models are Efficient Video Learners | ✓ Link | 87.7 | 97.8 | | | | | EVL (CLIP ViT-L/14@336px, frozen, 32 frames) | 2022-08-06 |
Dual-path Adaptation from Image to Video Transformers | ✓ Link | 87.7 | 97.8 | | | | | DualPath w/ ViT-L/14 | 2023-03-17 |
Expanding Language-Image Pretrained Models for General Video Recognition | ✓ Link | 87.7 | 97.4 | | | | | X-CLIP(ViT-L/14, CLIP) | 2022-08-04 |
AIM: Adapting Image Models for Efficient Video Action Recognition | ✓ Link | 87.5 | 97.7 | | | | | AIM (CLIP ViT-L/14, 32x224) | 2023-02-06 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 87.4 | 97.6 | | | | | VideoMAE (no extra data, ViT-H, 32x320x320) | 2022-03-23 |
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | ✓ Link | 87.2 | 97.6 | | | | | ST-Adapter (ViT-L, CLIP) | 2022-06-27 |
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video | ✓ Link | 87.2 | 97.6 | | | | | ZeroI2V ViT-L/14 | 2023-10-02 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 87.2 | 97.5 | | | | | CoVeR (JFT-3B) | 2021-12-14 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 87.2 | 97.4 | | | | | MVD (K400 pretrain, ViT-H, 16x224x224) | 2022-12-08 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 87.1 | 97.7 | | | | | mPLUG-2 | 2023-02-01 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 87.0 | 97.4 | | | | | MaskFeat (K600, MViT-L) | 2021-12-16 |
VicTR: Video-conditioned Text Representations for Activity Recognition | | 87.0 | | | | | | VicTR (ViT-L/14) | 2023-04-05 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 86.8 | | | | | | Video-SwinV2-G (ImageNet-22k and external 70M pretrain) | 2021-11-18 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 86.7 | 97.3 | | | | | MaskFeat (no extra data, MViT-L) | 2021-12-16 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 86.6 | 97.1 | | | | | VideoMAE (no extra data, ViT-H) | 2022-03-23 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 86.4 | 97.0 | | | | | MVD (K400 pretrain, ViT-L, 16x224x224) | 2022-12-08 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 86.4 | | | | | | TAdaConvNeXtV2-B | 2023-08-10 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 86.3 | 97.2 | | | | | CoVeR (JFT-300M) | 2021-12-14 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 86.1 | 97.3 | | | | | VideoMAE (no extra data, ViT-L, 32x320x320) | 2022-03-23 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 86.1 | 97.0 | | | | | MViTv2-L (ImageNet-21k pretrain) | 2021-12-02 |
Implicit Temporal Modeling with Learnable Alignment for Video Recognition | ✓ Link | 85.7 | 97.2 | | | | | ILA (ViT-B/16) | 2023-04-20 |
Dual-path Adaptation from Image to Video Transformers | ✓ Link | 85.4 | 97.1 | | | | | DualPath w/ ViT-B/16 | 2023-03-17 |
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 85.4 | | | | | | TokenLearner 16at18 (L/10) | 2021-06-21 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 85.3 | 96.3 | | | | | MAR (50% mask, ViT-L, 16x4) | 2022-07-24 |
CAST: Cross-Attention in Space and Time for Video Action Recognition | ✓ Link | 85.3 | | | | | | CAST(ViT-B/16) | 2023-11-30 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 85.2 | 96.8 | | | | | VideoMAE (no extra data, ViT-L, 16x4) | 2022-03-23 |
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders | ✓ Link | 85.1 | | | | | | ViC-MAE (ViT-L) | 2023-03-21 |
VideoMamba: State Space Model for Efficient Video Understanding | ✓ Link | 85.0 | | | | | | VideoMamba-M800 | 2024-03-11 |
Video Swin Transformer | ✓ Link | 84.9 | 96.7 | | | | | Swin-L (384x384, ImageNet-21k pretrain) | 2021-06-24 |
ViViT: A Video Vision Transformer | ✓ Link | 84.9 | 95.8 | | | | | ViViT-H/16x2 (JFT) | 2021-03-29 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 84.1 | 96.1 | | | | | OMNIVORE (Swin-L) | 2022-01-20 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 84.0 | 96.2 | | | | | OMNIVORE (Swin-B) | 2022-01-20 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 83.9 | 96.0 | | | | | MAR (75% mask, ViT-L, 16x4) | 2022-07-24 |
ActionCLIP: A New Paradigm for Video Action Recognition | ✓ Link | 83.8 | 97.1 | | | | | ActionCLIP (CLIP-pretrained) | 2021-09-17 |
Omni-sourced Webly-supervised Learning for Video Recognition | ✓ Link | 83.6 | | | | | | OmniSource irCSN-152 (IG-Kinetics-65M pretrain) | 2020-03-29 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 83.4 | 95.8 | | | | | MVD (K400 pretrain, ViT-B, 16x224x224) | 2022-12-08 |
Learning Correlation Structures for Vision Transformers | | 83.4 | | | | | | StructViT-B-4-1 | 2024-04-05 |
Video Swin Transformer | ✓ Link | 83.1 | 95.9 | | | | | Swin-L (ImageNet-21k pretrain) | 2021-06-24 |
Stand-Alone Inter-Frame Attention in Video Models | ✓ Link | 83.1 | | | | | | SIFA | 2022-06-14 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 82.9 | 94.5 | 259x4 | | | | UniFormer-B (ImageNet-1K) | 2021-09-29 |
Large-scale weakly-supervised pre-training for video action recognition | ✓ Link | 82.8 | | | | | | irCSN-152 (IG-Kinetics-65M pretrain) | 2019-05-02 |
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition | ✓ Link | 82.75 | 94.86 | | | | | DirecFormer | 2022-03-19 |
Video Swin Transformer | ✓ Link | 82.7 | 95.5 | | | | | Swin-B (ImageNet-21k pretrain) | 2021-06-24 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 82.6 | | | | | | ir-CSN-152 (IG-65M pretraining) | 2019-04-04 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 82.5 | 95.3 | | | | | ip-CSN-152 (IG-65M pretraining) | 2019-04-04 |
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition | ✓ Link | 82.5 | | | | | | TPS | 2022-07-27 |
Implicit Temporal Modeling with Learnable Alignment for Video Recognition | ✓ Link | 82.4 | 95.8 | | | | | ILA (ViT-B/32) | 2023-04-20 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 82.2 | 95.3 | 180x15 | | 87 | | AMD(ViT-B/16) | 2023-11-06 |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | ✓ Link | 82.1 | 95.5 | | | | | VATT-Large | 2021-04-22 |
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders | ✓ Link | 81.7 | 95.2 | | | | | AdaMAE | 2022-11-16 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 81.5 | 95.1 | | | | | VideoMAE (no extra data, ViT-B, 16x4) | 2022-03-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 81.5 | | 386x1 | | | | MoViNet-A6 | 2021-03-21 |
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing | | 81.4 | | | | | | MLP-3D | 2022-06-13 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 81.3 | 95.1 | | | | | R[2+1]D-152 (IG-65M pretraining) | 2019-04-04 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 81.2 | 95.2 | | | | | LGD-3D Two-stream (ResNet-101) | 2019-06-13 |
Multiscale Vision Transformers | ✓ Link | 81.2 | 95.1 | | | | | MViT-B, 64x3 | 2021-04-22 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 81.1 | 95.2 | | | | | Motionformer-HR | 2021-06-09 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 81.0 | 94.8 | | | | | MVD (K400 pretrain, ViT-S, 16x224x224) | 2022-12-08 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 81.0 | 94.4 | | | | | MAR (50% mask, ViT-B, 16x4) | 2022-07-24 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 80.9 | 94.9 | 281x1 | | | | MoViNet-A5 | 2021-03-21 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 80.8 | 94.6 | | | | | MBT (AV) | 2021-06-30 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 80.7 | 94.7 | 7140x3 | | 121.4 | | TimeSformer-L | 2021-02-09 |
Video Swin Transformer | ✓ Link | 80.6 | 94.6 | | | | | Swin-B (ImageNet-1k pretrain) | 2021-06-24 |
Video Swin Transformer | ✓ Link | 80.6 | 94.5 | | | | | Swin-S (ImageNet-1k pretrain) | 2021-06-24 |
VidTr: Video Transformer Without Convolutions | | 80.5 | 94.6 | | | | | En-VidTr-L | 2021-04-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 80.5 | 94.5 | 105x1 | | | | MoViNet-A4 | 2021-03-21 |
Omni-sourced Webly-supervised Learning for Video Recognition | ✓ Link | 80.5 | 94.4 | | | | | OmniSource SlowOnly R101 8x8(ImageNet pretrain) | 2020-03-29 |
An Image is Worth 16x16 Words, What is a Video Worth? | ✓ Link | 80.5 | | 1040x1 | | | | STAM (64 Frames) | 2021-03-25 |
X3D: Expanding Architectures for Efficient Video Recognition | ✓ Link | 80.4 | 94.6 | | | | | X3D-XXL | 2020-04-09 |
Revisiting 3D ResNets for Video Recognition | ✓ Link | 80.4 | 94.4 | | | | | R3D-RS-200 | 2021-09-03 |
Omni-sourced Webly-supervised Learning for Video Recognition | ✓ Link | 80.4 | 94.4 | | | | | OmniSource SlowOnly R101 8x8 (Scratch) | 2020-03-29 |
Multiscale Vision Transformers | ✓ Link | 80.2 | 94.4 | | | | | MViT-B, 32x3 | 2021-04-22 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 80.1 | 94.5 | 57X15 | | 22 | | AMD(ViT-S/16) | 2023-11-06 |
SlowFast Networks for Video Recognition | ✓ Link | 79.8 | | | | | | SlowFast 16x8 (ResNet-101 + NL) | 2018-12-10 |
CT-Net: Channel Tensorization Network for Video Classification | ✓ Link | 79.8 | | | | | | CT-Net Ensemble | 2021-06-03 |
Video Transformer Network | ✓ Link | 79.8 | | | | | | ViT-B-VTN+ ImageNet-21K (84.0 [10]) | 2021-02-01 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 79.7 | 94.4 | | | | | TimeSformer-HR | 2021-02-09 |
VidTr: Video Transformer Without Convolutions | | 79.7 | 94.2 | | | | | En-VidTr-M | 2021-04-23 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 79.4 | 94.4 | | | | | LGD-3D RGB (ResNet-101) | 2019-06-13 |
TDN: Temporal Difference Networks for Efficient Action Recognition | ✓ Link | 79.4 | 94.4 | | | | | TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 2020-12-18 |
VidTr: Video Transformer Without Convolutions | | 79.4 | 94 | | | | | En-VidTr-S | 2021-04-23 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 79.4 | 93.7 | | | | | MAR (75% mask, ViT-B, 16x4) | 2022-07-24 |
An Image is Worth 16x16 Words, What is a Video Worth? | ✓ Link | 79.3 | | 270x1 | | | | STAM (16 Frames) | 2021-03-25 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 79.2 | 93.8 | | | | | ip-CSN-152 (Sports-1M pretraining) | 2019-04-04 |
Video Modeling with Correlation Networks | | 79.2 | | | | | | CorrNet | 2019-06-07 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 79.1 | 94.5 | | | | | OmniVL | 2022-09-15 |
X3D: Expanding Architectures for Efficient Video Recognition | ✓ Link | 79.1 | 93.9 | | | | | X3D-XL | 2020-04-09 |
MVFNet: Multi-View Fusion Network for Efficient Video Recognition | ✓ Link | 79.1 | 93.8 | | | | | MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only) | 2020-12-13 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 79.1 | 93.7 | | | | | TAdaConvNeXt-T | 2021-10-12 |
SlowFast Networks for Video Recognition | ✓ Link | 78.9 | 93.5 | | | | | SlowFast 16x8 (ResNet-101) | 2018-12-10 |
What Makes Training Multi-Modal Classification Networks Hard? | ✓ Link | 78.9 | | | | | | G-Blend (Sports-1M pretrain) | 2019-05-29 |
Video Swin Transformer | ✓ Link | 78.8 | 93.6 | | | | | Swin-T (ImageNet-1k pretrain) | 2021-06-24 |
Action recognition with spatial-temporal discriminative filter banks | | 78.8 | | | | | | GB + DF + LB (ResNet 152, ImageNet pretrained) | 2019-08-20 |
Video Transformer Network | ✓ Link | 78.6 | 93.7 | | | | | ViT-B-VTN (3 layers, ImageNet pretrain) | 2021-02-01 |
Multiscale Vision Transformers | ✓ Link | 78.4 | 93.5 | | | | | MViT-B, 16x4 | 2021-04-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 78.2 | 93.8 | 56.9x1 | | | | MoViNet-A3 | 2021-03-21 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 78.2 | 93.5 | | | | | TAda2D-En (ResNet-50, 8+16 frames) | 2021-10-12 |
Self-supervised Video Transformer | ✓ Link | 78.1 | | | | | | SVT | 2021-12-02 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 78 | 93.7 | | | | | TimeSformer | 2021-02-09 |
SlowFast Networks for Video Recognition | ✓ Link | 77.9 | 93.2 | | | | | SlowFast 8x8 (ResNet-101) | 2018-12-10 |
Representation Flow for Action Recognition | ✓ Link | 77.9 | | | | | | RepFlow-50 ([2+1]D CNN, FcF, Non-local block) | 2018-10-02 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 77.8 | 92.8 | | | | | ip-CSN-152 | 2019-04-04 |
Non-local Neural Networks | ✓ Link | 77.7 | 93.3 | | | | | I3D + NL | 2017-11-21 |
What Makes Training Multi-Modal Classification Networks Hard? | ✓ Link | 77.7 | | | | | | G-Blend | 2019-05-29 |
Large Scale Holistic Video Understanding | ✓ Link | 77.6 | | | | | | HATNet (32 frames) | 2019-04-25 |
X3D: Expanding Architectures for Efficient Video Recognition | ✓ Link | 77.5 | 92.9 | | | | | X3D-L | 2020-04-09 |
Collaborative Spatiotemporal Feature Learning for Video Action Recognition | ✓ Link | 77.5 | | | | | | CoST ResNet-101 (ImageNet pretrain) | 2019-06-01 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 77.4 | 93.1 | | | | | TAda2D (ResNet-50, 16 frames) | 2021-10-12 |
Evolving Space-Time Neural Architectures for Videos | | 77.4 | | | | | | EvaNet | 2018-11-26 |
Region-based Non-local Operation for Video Classification | ✓ Link | 77.4 | | | | | | RNL+TSM Ensemble(ResNet50, 8 + 16 frames) | 2020-07-17 |
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | ✓ Link | 77.4 | | | | | | VIMPAC | 2021-06-21 |
Busy-Quiet Video Disentangling for Video Classification | ✓ Link | 77.3 | 93.2 | | | | | BQN (ResNet-50) | 2021-03-29 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 77.2 | 93 | | | | | S3D-G (RGB+Flow, ImageNet pretrained) | 2017-12-13 |
SlowFast Networks for Video Recognition | ✓ Link | 77 | 92.6 | | | | | SlowFast 8x8 (ResNet-50) | 2018-12-10 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 76.7 | 92.6 | | | | | TAda2D (ResNet-50, 8 frames) | 2021-10-12 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 76.5 | | | | | | D3D+S3D-G (RGB + RGB) | 2018-12-19 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 76.4 | | | | | | MSNet-R50 (16 frames, ImageNet pretrained) | 2020-07-20 |
Global Textual Relation Embedding for Relational Understanding | ✓ Link | 76.1 | | | | | | GloRe | 2019-06-03 |
X3D: Expanding Architectures for Efficient Video Recognition | ✓ Link | 76 | 92.3 | | | | | X3D-M | 2020-04-09 |
Multiscale Vision Transformers | ✓ Link | 76 | 92.1 | | | | | MViT-S | 2021-04-22 |
Two-Stream Video Classification with Cross-Modality Attention | | 75.98 | | | | | | CMA iter1 (16 frames) | 2019-08-01 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 75.9 | | | | | | D3D (RGB) | 2018-12-19 |
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution | ✓ Link | 75.7 | | | | | | Oct-I3D + NL | 2019-04-10 |
SlowFast Networks for Video Recognition | ✓ Link | 75.6 | 92.1 | | | | | SlowFast 4x16 (ResNet-50) | 2018-12-10 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 75.4 | 91.9 | | | | | R[2+1]D-Flow (Sports-1M pretrain) | 2017-11-30 |
FASTER Recurrent Networks for Efficient Video Classification | | 75.1 | | | | | | FASTER32 | 2019-06-10 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 75.0 | 92.3 | 10.3x1 | | | | MoViNet-A2 | 2021-03-21 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 74.9 | | | | | | MARS+RGB+Flow (64 frames) | 2019-06-01 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 74.7 | 93.4 | | | | | S3D-G (RGB, ImageNet pretrained) | 2017-12-13 |
TSM: Temporal Shift Module for Efficient Video Understanding | ✓ Link | 74.7 | | | | | | TSM | 2018-11-20 |
$A^2$-Nets: Double Attention Networks | | 74.6 | 91.5 | | | | | A2 Net | 2018-10-27 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 74.3 | 91.4 | | | | | R[2+1]D-RGB (Sports-1M pretrain) | 2017-11-30 |
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition | ✓ Link | 73.9 | 91.1 | | | | | TSN | 2016-08-02 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 73.9 | 90.9 | | | | | R[2+1]D-Two-Stream | 2017-11-30 |
ConvNet Architecture Search for Spatiotemporal Feature Learning | ✓ Link | 73.9 | | | | | | TSN | 2017-08-16 |
STM: SpatioTemporal and Motion Encoding for Action Recognition | | 73.7 | | | | | | STM (ResNet-50) | 2019-08-07 |
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation | ✓ Link | 73.5 | 91.2 | | | | | bLVNet Fan et al. (2019) | 2019-12-02 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 73.05 | | 6.90x1 | | 32.45 | | Co Slow_64 | 2021-05-31 |
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification | | 73.0 | 90.9 | | | | | Inception-ResNet | 2017-08-12 |
Multi-Fiber Networks for Video Recognition | | 72.8 | 90.4 | | | | | MFNet | 2018-07-30 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 72.7 | 91.2 | 6.0x1 | | | | MoViNet-A1 | 2021-03-21 |
Appearance-and-Relation Networks for Video Classification | ✓ Link | 72.4 | 90.4 | | | | | ARTNet | 2017-11-24 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 72.3 | 90.9 | | | | | LGD-3D Flow (ResNet-101) | 2019-06-13 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 72 | 90 | | | | | R[2+1]D | 2017-11-30 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 72 | 90 | | | | | R[2+1]D-RGB | 2017-11-30 |
FASTER Recurrent Networks for Efficient Video Classification | | 71.7 | | | | | | FASTER16 w/o sp | 2019-06-10 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 71.61 | | 1.25x1 | | 6.15 | | Co X3D-L_64 | 2021-05-31 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 71.1 | 89.3 | | | | | I3D | 2017-05-22 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 71.03 | | 0.33x1 | | 3.79 | | Co X3D-M_64 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 69.29 | | 19.17x1 | | 6.15 | | X3D-L | 2021-05-31 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 68.9 | | | | | | MARS+RGB+Flow (16 frames) | 2019-06-01 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 68.45 | | 66.25x1 | | 66.25 | | SlowFast-8×8-R50 | 2021-05-31 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 68 | 87.6 | | | | | S3D-G (Flow, ImageNet pretrained) | 2017-12-13 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | ✓ Link | 67.5 | 87.2 | | | | | R[2+1]D-Flow | 2017-11-30 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 67.42 | | 54.87x1 | | 32.45 | | Slow-8x8-R50 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 67.33 | | 0.17x1 | | 3.79 | | Co X3D-S_64 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 67.24 | | 4.97x1 | | 3.79 | | X3D-M | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 67.06 | | 36.46x1 | | 34.48 | | SlowFast-4×16-R50 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 65.90 | | 6.90x1 | | 32.45 | | Co Slow_8 | 2021-05-31 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 65.8 | 87.4 | 2.7x1 | | | | MoViNet-A0 | 2021-03-21 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 64.71 | | 2.06x1 | | 3.79 | | X3D-S | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 63.98 | | 28.61x1 | | 28.04 | | I3D-R50 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 63.03 | | 1.25x1 | | 6.15 | | Co X3D-L_16 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 62.80 | | 0.33x1 | | 3.79 | | Co X3D-M_16 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 60.18 | | 0.17x1 | | 3.79 | | Co X3D-S_13 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 59.58 | | 5.68x1 | | 28.04 | | Co I3D_8 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 59.52 | | 40.71x1 | | 31.51 | | R(2+1)D-18_16 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 59.37 | | 0.64x1 | | 3.79 | | X3D-XS | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 56.86 | | 5.68x1 | | 28.04 | | Co I3D_64 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 53.52 | | 20.35x1 | | 31.51 | | R(2+1)D-18_8 | 2021-05-31 |
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos | ✓ Link | 53.40 | | 4.71x1 | | 12.80 | | RCU_8 | 2021-05-31 |
ViViT: A Video Vision Transformer | ✓ Link | | 94.7 | | | | | ViViT-L/16x2 320 | 2021-03-29 |
Video Transformer Network | ✓ Link | | 94.2 | | | | | ViT-B-VTN+ ImageNet-21K (84.0 [10]) | 2021-02-01 |
SlowFast Networks for Video Recognition | ✓ Link | | 93.9 | | | | | SlowFast 16x8 (ResNet-101 + NL) | 2018-12-10 |
Video Transformer Network | ✓ Link | | 93.4 | | | | | ViT-B-VTN (1 layer, ImageNet pretrain) | 2021-02-01 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | | | 225x5 | | | | MViT-B (train from scratch) | 2021-12-02 |