InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 85.9 | | InternVideo2-6B | 2024-03-22 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 85.4 | | InternVideo2-1B | 2024-03-22 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 84.0 | | InternVideo-T | 2022-12-06 |
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | ✓ Link | 83.8 | 96.6 | TubeViT-L | 2022-12-06 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 83.6 | 96.7 | UMT-L (ViT-L/16) | 2023-03-28 |
Multiview Transformers for Video Recognition | ✓ Link | 83.4 | 96.2 | MTV-H (WTS 60M) | 2022-01-12 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 82.9% | | EVA | 2022-11-14 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 82.7 | 96.2 | UniFormerV2-L | 2022-09-22 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 82.7 | | CoCa (finetuned) | 2022-05-04 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 81.1 | | CoCa (frozen) | 2022-05-04 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 81.1 | | Hiera-H (no extra data) | 2023-06-01 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 80.4 | 95.7 | MaskFeat (no extra data, MViT-L) | 2021-12-16 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 80.4 | 94.9 | mPLUG-2 | 2023-02-01 |
AIM: Adapting Image Models for Efficient Video Action Recognition | ✓ Link | 80.4 | | AIM (CLIP ViT-L/14, 32x224) | 2023-02-06 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 79.8 | 94.9 | CoVeR (JFT-3B) | 2021-12-14 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 79.4 | 94.9 | MViTv2-L (ImageNet-21k pretrain) | 2021-12-02 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 79.4 | | MoViNet-A6 | 2021-12-02 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 78.5 | 94.2 | CoVeR (JFT-300M) | 2021-12-14 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 76.6 | 93.2 | MViTv2-B | 2021-12-02 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 72.3 | | MoViNet-A6 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 71.7 | | MoViNet-A5 | 2021-03-21 |
VidTr: Video Transformer Without Convolutions | | 70.8 | 89.4 | En-VidTr-L | 2021-04-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 70.7 | | MoViNet-A4 | 2021-03-21 |
VidTr: Video Transformer Without Convolutions | | 70.2 | 89 | VidTr-L | 2021-04-23 |
VidTr: Video Transformer Without Convolutions | | 69.5 | 88.3 | VidTr-M | 2021-04-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 68.0 | | MoViNet-A3 | 2021-03-21 |
VidTr: Video Transformer Without Convolutions | | 67.3 | 87.7 | VidTr-S | 2021-04-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 66.7 | | MoViNet-A2 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 63.5 | | MoViNet-A1 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 58.5 | | MoViNet-A0 | 2021-03-21 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 56.46 | 76.82 | SRTG r3d-101 | 2020-06-15 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 54.17 | 74.62 | SRTG r(2+1)d-50 | 2020-06-15 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 53.52 | 74.17 | SRTG r3d-50 | 2020-06-15 |
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | ✓ Link | 51.9 | | SEER (RegNet10B) | 2022-02-16 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 49.43 | 73.23 | SRTG r(2+1)d-34 | 2020-06-15 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 49.15 | 72.68 | SRTG r3d-34 | 2020-06-15 |