Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | ✓ Link | 14.6 | Auto-AVSR | 2023-03-25 |
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs | ✓ Link | 15.4 | USR | 2024-11-04 |
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 16.5 | SyncVSR | 2024-06-18 |
Jointly Learning Visual and Auditory Speech Representations from Raw Data | ✓ Link | 18.6 | RAVEn Large | 2022-12-12 |
Sub-word Level Lip Reading With Visual Attention | | 22.6 | VTP (more data) | 2021-10-14 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 24.6 | ES³ Large + extLM | 2024-01-01 |
Visual Speech Recognition for Multiple Languages in the Wild | ✓ Link | 25.5 | CTC/Attention (LRW+LRS2/3+AVSpeech) | 2022-02-26 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 26.7 | ES³ Large | 2024-01-01 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 28.7 | ES³ Base + extLM | 2024-01-01 |
Sub-word Level Lip Reading With Visual Attention | | 28.9 | VTP | 2021-10-14 |
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 28.9 | SyncVSR | 2024-06-18 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 29.3 | ES³ Base* + extLM | 2024-01-01 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 30.7 | ES³ Base | 2024-01-01 |
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations | | 31.4 | ES³ Base* | 2024-01-01 |
Visual Speech Recognition for Multiple Languages in the Wild | ✓ Link | 32.9 | CTC/Attention | 2022-02-26 |
End-to-end Audio-visual Speech Recognition with Conformers | ✓ Link | 39.1 | Hybrid CTC / Attention | 2021-02-12 |
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | ✓ Link | 43.2 | MoCo + wav2vec (w/o extLM) | 2022-02-24 |
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading | ✓ Link | 44.5 | Multi-head Visual-Audio Memory | 2022-04-04 |
Deep Audio-Visual Speech Recognition | ✓ Link | 48.3 | TM-seq2seq + extLM | 2018-09-06 |
Audio-visual Recognition of Overlapped speech for the LRS2 dataset | | 48.86 | LF-MMI TDNN | 2020-01-06 |
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture | | 50 | Hybrid CTC / Attention | 2018-09-28 |
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading | | 51.7 | Conv-seq2seq | 2019-10-01 |
ASR is all you need: cross-modal distillation for lip reading | | 53.2 | CTC + KD ASR | 2019-11-28 |
Deep Audio-Visual Speech Recognition | ✓ Link | 54.7 | TM-CTC + extLM | 2018-09-06 |
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers | ✓ Link | 65.29 | LIBS | 2019-11-26 |