InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 91.9 | | | InternVideo2-6B | 2024-03-22 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 91.8 | 98.9 | | TubeVit-H | 2022-12-06 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 91.6 | | | InternVideo2-1B | 2024-03-22 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 91.5 | 98.7 | | TubeVit-L | 2022-12-06 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 91.3 | | | InternVideo-T | 2022-12-06 |
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | | 91.1 | 97.1 | | 🍷MerlotReserve-Large (+Audio) | 2022-01-07 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 90.9 | 97.3 | | TubeVit-B | 2022-12-06 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 90.5 | 98.8 | | UMT-L (ViT-L/16) | 2023-03-28 |
Multiview Transformers for Video Recognition | ✓ Link | 90.3 | 98.5 | | MTV-H (WTS 60M) | 2022-01-12 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 90.1 | 98.5 | | UniFormerV2-L | 2022-09-22 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 89.9 | 98.5 | | VideoMAE V2-g (64x266x266) | 2023-03-29 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 89.8 | 98.3 | | mPLUG-2 | 2023-02-01 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 89.8% | | | EVA | 2022-11-14 |
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | | 89.7 | 96.6 | | 🍷MerlotReserve-Base (+Audio) | 2022-01-07 |
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | | 89.4 | 96.3 | | 🍷MerlotReserve-Large (no Audio) | 2022-01-07 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 89.4 | | | CoCa (finetuned) | 2022-05-04 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 88.8 | 98.2 | | VideoMAE V2-g | 2023-03-29 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 88.8 | | | Hiera-H (no extra data) | 2023-06-01 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 88.5 | | | CoCa (frozen) | 2022-05-04 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 88.3 | 98.0 | | MaskFeat (no extra data, MViT-L) | 2021-12-16 |
Expanding Language-Image Pretrained Models for General Video Recognition | ✓ Link | 88.3 | 97.7 | | X-CLIP(ViT-L/14, CLIP) | 2022-08-04 |
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | | 88.1 | 95.8 | | 🍷MerlotReserve-Base (no Audio) | 2022-01-07 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 87.9 | 97.9 | | MViTv2-L (ImageNet-21k pretrain) | 2021-12-02 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 87.9 | 97.8 | | CoVeR (JFT-3B) | 2021-12-14 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 87.8 | 97.9 | | Florence (curated FLD-900M pretrain) | 2021-11-22 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 86.8 | 97.3 | | CoVeR (JFT-300M) | 2021-12-14 |
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | ✓ Link | 86.3 | 97.0 | | TokenLearner 16at18 w. Fuser (L/10) | 2021-06-21 |
Video Swin Transformer | ✓ Link | 86.1 | 97.3 | | Swin-L (384x384, ImageNet-21k pretrain) | 2021-06-24 |
ViViT: A Video Vision Transformer | ✓ Link | 85.8 | 96.5 | | ViViT-H/16x2 (JFT) | 2021-03-29 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 85.5 | | | MViTv2-L (train from scratch) | 2021-12-02 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 84.8 | 96.7 | 259x4 | UniFormer-B (ImageNet-1K) | 2021-09-29 |
Space-time Mixing Attention for Video Transformer | ✓ Link | 84.5 | 96.3 | | XViT (x16) | 2021-06-10 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 84.3 | 96.4 | 281x1 | MoViNet-A5 (AutoAugment) | 2021-03-21 |
ViViT: A Video Vision Transformer | ✓ Link | 84.3 | 95.6 | | ViViT-L/16x2 | 2021-03-29 |
Video Swin Transformer | ✓ Link | 84.0 | 96.5 | | Swin-B (ImageNet-21k pretrain) | 2021-06-24 |
Multiscale Vision Transformers | ✓ Link | 83.8 | 96.3 | | MViT-B-24, 32x3 | 2021-04-22 |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | ✓ Link | 83.6 | 96.6 | | VATT-Large | 2021-04-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 83.5 | 96.5 | 386x1 | MoViNet-A6 | 2021-03-21 |
Multiscale Vision Transformers | ✓ Link | 83.4 | 96.3 | | MViT-B, 32x3 | 2021-04-22 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 83.1 | 96.2 | | LGD-3D Two-stream | 2019-06-13 |
Revisiting 3D ResNets for Video Recognition | ✓ Link | 83.1 | | | R3D-RS-200 | 2021-09-03 |
ViViT: A Video Vision Transformer | ✓ Link | 83.0 | 95.7 | | ViViT-L/16x2 (320x320) | 2021-03-29 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 82.7 | 95.7 | 281x1 | MoViNet-A5 | 2021-03-21 |
Multiscale Vision Transformers | ✓ Link | 82.1 | 95.7 | | MViT-B, 16x4 | 2021-04-22 |
PERF-Net: Pose Empowered RGB-Flow Net | | 82.0 | 95.7 | | PERF-Net (distilled ResNet50-G) | 2020-09-28 |
SlowFast Networks for Video Recognition | ✓ Link | 81.8 | 95.1 | | SlowFast 16x8 (ResNet-101 + NL) | 2018-12-10 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 81.5 | 95.6 | | LGD-3D RGB | 2019-06-13 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 81.2 | 94.9 | 105x1 | MoViNet-A4 | 2021-03-21 |
SlowFast Networks for Video Recognition | ✓ Link | 81.1 | 95.1 | | SlowFast 16x8 (ResNet-101) | 2018-12-10 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 80.8 | 80.8 | 56.9x1 | MoViNet-A3 | 2021-03-21 |
SlowFast Networks for Video Recognition | ✓ Link | 80.4 | 94.8 | | SlowFast 8x8 (ResNet-101) | 2018-12-10 |
SlowFast Networks for Video Recognition | ✓ Link | 79.9 | 94.5 | | SlowFast 8x8 (ResNet-50) | 2018-12-10 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 79.1 | | | D3D+S3D-G | 2018-12-19 |
SlowFast Networks for Video Recognition | ✓ Link | 78.8 | 94 | | SlowFast 4x16 (ResNet-50) | 2018-12-10 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 78.6 | | | S3D-G (RGB+Flow) | 2017-12-13 |
D3D: Distilled 3D Networks for Video Action Recognition | ✓ Link | 77.9 | | | D3D | 2018-12-19 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 77.5 | 93.4 | 10.3x1 | MoViNet-A2 | 2021-03-21 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 76.6 | | | S3D-G (RGB) | 2017-12-13 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 76.0 | 92.6 | 6.0x1 | MoViNet-A1 | 2021-03-21 |
Learning Spatio-Temporal Representation with Local and Global Diffusion | | 75 | 92.4 | | LGD-3D Flow | 2019-06-13 |
A Short Note about Kinetics-600 | ✓ Link | 73.6 | | | I3D (RGB) | 2018-08-03 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 71.5 | 90.4 | 2.7x1 | MoViNet-A0 | 2021-03-21 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 69.7 | | | S3D-G (Flow) | 2017-12-13 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | | 97.2 | | MViTv2-B (train from scratch) | 2021-12-02 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | | | 206x5 | MViT-L (train from scratch) | 2021-12-02 |