Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 77.3 | 95.7 | 633 | 1192x6 | MVD (Kinetics400 pretrain, ViT-H, 16 frame) | 2022-12-08 |
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification | ✓ Link | 77.2 | 96.3 | | | DejaVid | 2025-01-01 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 77.2 | | | | InternVideo | 2022-12-06 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 77.1 | | | | InternVideo2-1B | 2024-03-22 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 77.0 | 95.9 | 1013 | 2544x6 | VideoMAE V2-g | 2023-03-29 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 76.7 | 95.5 | 305 | 597x6 | MVD (Kinetics400 pretrain, ViT-L, 16 frame) | 2022-12-08 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 76.5 | | | | Hiera-L (no extra data) | 2023-06-01 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 76.1 | 95.2 | | | TubeViT-L | 2022-12-06 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 75.4 | 95.2 | 305 | 1436x3 | VideoMAE (no extra data, ViT-L, 32x2) | 2022-03-23 |
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | ✓ Link | 75.2 | 94.0 | | | Side4Video (EVA ViT-E/14) | 2023-11-27 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 75.0 | 95.0 | 218 | 2828*3 | MaskFeat (Kinetics600 pretrain, MViT-L) | 2021-12-16 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 74.7 | 94.9 | 311 | 276x6 | MAR (50% mask, ViT-L, 16x4) | 2022-07-24 |
What Can Simple Arithmetic Operations Do for Temporal Modeling? | ✓ Link | 74.6 | 94.4 | | | ATM | 2023-07-18 |
The effectiveness of MAE pre-pretraining for billion-scale pretraining | ✓ Link | 74.4 | | | | MAWS (ViT-L) | 2023-03-23 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 74.3 | 94.6 | 305 | 597x6 | VideoMAE (no extra data, ViT-L, 16frame) | 2022-03-23 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 73.8 | 94.4 | 311 | 131x6 | MAR (75% mask, ViT-L, 16x4) | 2022-07-24 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 73.7 | 94.0 | 87 | 180x6 | MVD (Kinetics400 pretrain, ViT-B, 16 frame) | 2022-12-08 |
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders | ✓ Link | 73.7 | | | | ViC-MAE (ViT-L) | 2023-03-21 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 73.6 | | | | TAdaFormer-L/14 | 2023-08-10 |
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning | ✓ Link | 73.4 | 93.8 | | | TDS-CLIP-ViT-L/14(8frames) | 2024-08-20 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 73.3 | 94.1 | 213.1 | | MViTv2-L (IN-21K + Kinetics400 pretrain) | 2021-12-02 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 73.3 | 94.0 | 87 | 180x6 | AMD(ViT-B/16) | 2023-11-06 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 73.0 | 94.5 | | 5154 | UniFormerV2-L | 2022-09-22 |
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | ✓ Link | 72.3 | 93.9 | | 8248 | ST-Adapter (ViT-L, CLIP) | 2022-06-27 |
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video | ✓ Link | 72.2 | 93.0 | | | ZeroI2V ViT-L/14 | 2023-10-02 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 72.1 | | | 225x3 | MViT-B (IN-21K + Kinetics400 pretrain) | 2021-12-02 |
CAST: Cross-Attention in Space and Time for Video Action Recognition | ✓ Link | 71.6 | | | | CAST(ViT-B/16) | 2023-11-30 |
Learning Correlation Structures for Vision Transformers | | 71.5 | | | | StructVit-B-4-1 | 2024-04-05 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 71.4 | 93.5 | | | OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain) | 2022-01-20 |
BEVT: BERT Pretraining of Video Transformers | ✓ Link | 71.4 | - | 89 | 321x3 | BEVT (IN-1K + Kinetics400 pretrain) | 2021-12-02 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 71.2 | 92.8 | 50.1 | 259x3 | UniFormer-B (IN-1K + Kinetics400 pretrain) | 2021-09-29 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 71.1 | | | | TAdaConvNeXtV2-B | 2023-08-10 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 71.0 | 92.8 | 94 | 86x6 | MAR (50% mask, ViT-B, 16x4) | 2022-07-24 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 70.9 | 92.8 | 22 | 57x6 | MVD (Kinetics400 pretrain, ViT-S, 16 frame) | 2022-12-08 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 70.9 | 92.5 | | | CoVeR(JFT-3B) | 2021-12-14 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 70.8 | 92.4 | 87 | 180x6 | VideoMAE (no extra data, ViT-B, 16frame) | 2022-03-23 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 70.2 | 92.5 | 22 | 57x6 | AMD(ViT-S/16) | 2023-11-06 |
Implicit Temporal Modeling with Learnable Alignment for Video Recognition | ✓ Link | 70.2 | 91.8 | | | ILA (ViT-L/14) | 2023-04-20 |
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning | ✓ Link | 70.1 | 92.8 | 68.5 | 197x3 | MorphMLP-B (IN-1K) | 2021-11-24 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 69.8 | 91.9 | | | CoVeR(JFT-300M) | 2021-12-14 |
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition | ✓ Link | 69.8 | | | | TPS | 2022-07-27 |
Stand-Alone Inter-Frame Attention in Video Models | ✓ Link | 69.8 | | | | SIFA | 2022-06-14 |
Video Swin Transformer | ✓ Link | 69.6 | 92.7 | 89 | 321x3 | Swin-B (IN-21K + Kinetics400 pretrain) | 2021-06-24 |
TDN: Temporal Difference Networks for Efficient Action Recognition | ✓ Link | 69.6 | 92.2 | | 198x3 | TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 2020-12-18 |
MAR: Masked Autoencoders for Efficient Action Recognition | ✓ Link | 69.5 | 91.9 | 94 | 41x6 | MAR (75% mask, ViT-B, 16x4) | 2022-07-24 |
Object-Region Video Transformers | ✓ Link | 69.5 | 91.5 | N/A | N/A | ORViT Mformer-L (ORViT blocks) | 2021-10-13 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 69.4 | 92.1 | 21.4 | 41.8x3 | UniFormer-S (IN-1K + Kinetics600 pretrain) | 2021-09-29 |
Mutual Modality Learning for Video Action Classification | ✓ Link | 69.02 | 92.70 | | | MML (ensemble) | 2020-11-04 |
Multiscale Vision Transformers | ✓ Link | 68.7 | 91.5 | 53.2M | 236x3 | MViT-B-24, 32x3 | 2021-04-22 |
Multiview Transformers for Video Recognition | ✓ Link | 68.5 | 90.4 | | | MTV-B | 2022-01-12 |
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing | | 68.5 | | | | MLP-3D | 2022-06-13 |
TDN: Temporal Difference Networks for Efficient Action Recognition | ✓ Link | 68.2 | 91.6 | | 198x1 | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 2020-12-18 |
Multi-scale Motion-Aware Module for Video Action Recognition | | 68.2 | | | | MSMA (8+16frames) | 2023-02-19 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 68.1 | 91.2 | N/A | 1181x3 | Mformer-L | 2021-06-09 |
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | ✓ Link | 68.1 | | | | VIMPAC | 2021-06-21 |
Object-Region Video Transformers | ✓ Link | 67.9 | 90.5 | N/A | N/A | ORViT Mformer (ORViT blocks) | 2021-10-13 |
Multiscale Vision Transformers | ✓ Link | 67.8 | 91.3 | 36.6 | 170x3 | MViT-B, 32x3(Kinetics600 pretrain) | 2021-04-22 |
Group Contextualization for Video Recognition | ✓ Link | 67.8 | 91.2 | 27.4 | 110.1 | GC-TDN Ensemble (R50,8+16) | 2022-03-18 |
CT-Net: Channel Tensorization Network for Video Classification | ✓ Link | 67.8 | 91.1 | 83.8 | 280 | CT-Net Ensemble (R50, 8+12+16+24) | 2021-06-03 |
Motion-driven Visual Tempo Learning for Video-based Action Recognition | ✓ Link | 67.8 | | | | TCM (Ensemble) | 2022-02-24 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 67.7 | 91.1 | | | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 2021-02-14 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 67.7 | 91.1 | | | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips | 2021-11-02 |
Global Temporal Difference Network for Action Recognition | | 67.6 | | | | GTDNet | 2022-11-23 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 67.4 | 91 | | | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 2021-02-14 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 67.35 | 90.50 | 5.8M | 20.9x6 | VoV3D-L (32frames, Kinetics pretrained, single) | 2020-12-01 |
SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition | | 67.3 | 91 | | | PLAR | 2023-05-21 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 67.3 | 90.8 | | | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
Space-time Mixing Attention for Video Transformer | ✓ Link | 67.2 | 90.8 | N/A | 850x1 | X-Vit (x16) | 2021-06-10 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 67.2 | 89.8 | | | TAda2D-En (ResNet-50, 8+16 frames) | 2021-10-12 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 67.1 | 90.6 | N/A | 958.8x3 | Mformer-HR | 2021-06-09 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 67.1 | 90.4 | | | TAdaConvNeXt-T | 2021-10-12 |
Action Recognition With Motion Diversification and Dynamic Selection | | 67.1 | | | | MoDS (8+16frames) | 2022-07-15 |
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition | | 67.0 | | | | STPG (8+16frames) | 2022-08-09 |
Mutual Modality Learning for Video Action Classification | ✓ Link | 66.83 | 91.30 | | | MML (single) | 2020-11-04 |
Implicit Temporal Modeling with Learnable Alignment for Video Recognition | ✓ Link | 66.8 | 90.3 | | | ILA (ViT-B/16) | 2023-04-20 |
TSM: Temporal Shift Module for Efficient Video Understanding | ✓ Link | 66.6 | 91.3 | | | TSM (RGB + Flow) | 2018-11-20 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 66.6 | 90.6 | | | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 2020-07-20 |
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | ✓ Link | 66.5 | 90.6 | | | PAN ResNet101 (RGB only, no Flow) | 2020-08-08 |
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention | | 66.5 | 90.4 | | | TSM+W3 (16 frames, RGB ResNet-50) | 2020-04-02 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 66.5 | 90.1 | | | Mformer | 2021-06-09 |
MVFNet: Multi-View Fusion Network for Efficient Video Recognition | ✓ Link | 66.3 | | | | MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 2020-12-13 |
Multiscale Vision Transformers | ✓ Link | 66.2 | 90.2 | | | MViT-B, 16x4 | 2021-04-22 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 66 | 89.8 | | | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 65.8 | 89.5 | 5.8M | 20.9x6 | VoV3D-L (32frames, from scratch, single) | 2020-12-01 |
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition | ✓ Link | 65.7 | 89.8 | | 18.3 | E3D-L | 2023-03-05 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 65.7 | 89.8 | | | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 2021-02-14 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 65.6 | 89.2 | | | TAda2D (ResNet-50, 16 frames) | 2021-10-12 |
ViViT: A Video Vision Transformer | ✓ Link | 65.4 | 89.8 | | | ViViT-L/16x2 Fact. encoder | 2021-03-29 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 65.24 | 89.48 | 3.3M | 11.5x6 | VoV3D-M (32frames, Kinetics pretrained, single) | 2020-12-01 |
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation | ✓ Link | 65.2 | | | | bLVNet | 2019-12-02 |
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition | ✓ Link | 64.94 | 87.9 | | | DirecFormer | 2022-03-19 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 64.8 | 89.1 | | | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 64.7 | 89.4 | | | MSNet-R50 (16 frames, ImageNet pretrained) | 2020-07-20 |
Action Keypoint Network for Efficient Video Recognition | | 64.3 | | | | AK-Net | 2022-01-17 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 64.2 | 88.8 | 3.3M | 11.5x6 | VoV3D-M (32frames, from scratch, single) | 2020-12-01 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 64.1 | 88.6 | 5.8M | 9.3x6 | VoV3D-L (16frames, from scratch, single) | 2020-12-01 |
TAda! Temporally-Adaptive Convolutions for Video Understanding | ✓ Link | 64.0 | 88.0 | | | TAda2D (ResNet-50, 8 frames) | 2021-10-12 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 63.5 | 89.0 | 4.8M | 10.3x1 | MoViNet-A2 | 2021-03-21 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 63.2 | 88.2 | 3.3M | 5.7x6 | VoV3D-M (16frames, from scratch, single) | 2020-12-01 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 63 | 88.4 | | | MSNet-R50 (8 frames, ImageNet pretrained) | 2020-07-20 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 62.7 | 89.0 | 4.6M | 6.0x1 | MoViNet-A1 | 2021-03-21 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 62.5 | 86.2 | | | OmniVL | 2022-09-15 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 62.5 | | | | TimeSformer-HR | 2021-02-09 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 62.3 | | | | TimeSformer-L | 2021-02-09 |
Temporal Reasoning Graph for Activity Recognition | | 62.2 | 90.3 | | | TRG (ResNet-50) | 2019-08-27 |
Temporal Pyramid Network for Action Recognition | ✓ Link | 62.0 | | | | TPN (TSM-50) | 2020-04-07 |
A Multigrid Method for Efficiently Training Video Models | ✓ Link | 61.7 | | | | Multigrid | 2019-12-02 |
SlowFast Networks for Video Recognition | ✓ Link | 61.7 | | | | SlowFast | 2018-12-10 |
Temporal Reasoning Graph for Activity Recognition | | 61.3 | 91.4 | | | TRG (Inception-V3) | 2019-08-27 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 61.3 | 88.2 | 3.1M | 2.7x1 | MoViNet-A0 | 2021-03-21 |
Cooperative Cross-Stream Network for Discriminative Action Representation | | 61.2 | 89.3 | | | CCS + two-stream + TRN | 2019-08-27 |
VidTr: Video Transformer Without Convolutions | | 60.2 | | | | VidTr-L | 2021-04-23 |
Is Space-Time Attention All You Need for Video Understanding? | ✓ Link | 59.5 | | | | TimeSformer | 2021-02-09 |
Self-supervised Video Transformer | ✓ Link | 59.2 | | | | SVT | 2021-12-02 |
Few-Shot Video Classification via Temporal Alignment | | 52.3 | | | | TAM (5-shot) | 2019-06-27 |
The "something something" video database for learning and evaluating visual common sense | ✓ Link | 51.33 | 80.46 | | | model3D_1 with left-right augmentation and fps jitter | 2017-06-13 |
Attention Distillation for Learning Video Representations | | 49.9 | 79.1 | | | Prob-Distill | 2019-04-05 |
Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos | ✓ Link | 47.73 | | | | STM + TRNMultiscale | 2019-09-11 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 1 | 12 | 2131 | 13321 | InternVideo2-6B | 2024-03-22 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | | 93.4 | 51.1 | | MViTv2-B (IN-21K + Kinetics400 pretrain) | 2021-12-02 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | | 91.1 | | | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | 2021-11-02 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | | | 5.3M | 23.7x1 | MoViNet-A3 | 2021-03-21 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | | | | 2828x3 | MViT-L (IN-21K + Kinetics400 pretrain) | 2021-12-02 |