Conformers are All You Need for Visual Speech Recognition | | 12.8 | LP + Conformer | 2023-02-17 |
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | ✓ Link | 19.1 | Auto-AVSR | 2023-03-25 |
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 21.5 | SyncVSR | 2024-06-18 |
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs | ✓ Link | 21.5 | USR (self + semi-supervised) | 2024-11-04 |
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs | ✓ Link | 22.3 | USR (self-supervised) | 2024-11-04 |
Jointly Learning Visual and Auditory Speech Representations from Raw Data | ✓ Link | 23.4 | RAVEn Large | 2022-12-12 |
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing | ✓ Link | 25.4 | VSP-LLM | 2024-02-23 |
Relaxed Attention for Transformer Models | ✓ Link | 25.51 | AV-HuBERT Large + Relaxed Attention + LM | 2022-09-20 |
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models | ✓ Link | 26.2 | DistillAV | 2025-02-09 |
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction | ✓ Link | 26.9 | AV-HuBERT Large | 2022-01-05 |
Sub-word Level Lip Reading With Visual Attention | | 30.7 | VTP (more data) | 2021-10-14 |
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 31.2 | SyncVSR | 2024-06-18 |
Visual Speech Recognition for Multiple Languages in the Wild | ✓ Link | 31.5 | CTC/Attention (LRW+LRS2/3+AVSpeech) | 2022-02-26 |
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition | ✓ Link | 33.6 | RNN-T | 2019-11-08 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 37.1 | ES³ Large | 2024-01-01 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 40.3 | ES³ Base | 2024-01-01 |
Sub-word Level Lip Reading With Visual Attention | | 40.6 | VTP | 2021-10-14 |
End-to-end Audio-visual Speech Recognition with Conformers | ✓ Link | 43.3 | Hyb + Conformer | 2021-02-12 |
Large-Scale Visual Speech Recognition | | 55.1 | CTC-V2P | 2018-07-13 |
Discriminative Multi-modality Speech Recognition | ✓ Link | 57.8 | EG-seq2seq | 2020-05-12 |
Deep Audio-Visual Speech Recognition | ✓ Link | 58.9 | TM-seq2seq | 2018-09-06 |
ASR is all you need: cross-modal distillation for lip reading | | 59.8 | CTC + KD | 2019-11-28 |
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading | | 60.1 | Conv-seq2seq | 2019-10-01 |