Paper | Code | Test WER | ModelName | ReleaseDate |
---|---|---|---|---|
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | ✓ Link | 1.4 | Whisper-Flamingo | 2024-06-14 |
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | ✓ Link | 1.5 | CTC/Attention | 2023-03-25 |
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | ✓ Link | 2.6 | MoCo + wav2vec (w/o extLM) | 2022-02-24 |
End-to-end Audio-visual Speech Recognition with Conformers | ✓ Link | 3.7 | End2end Conformer | 2021-02-12 |
Audio-visual Recognition of Overlapped speech for the LRS2 dataset | 5.9 | LF-MMI TDNN | 2020-01-06 | |
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture | 7.0 | CTC/Attention | 2018-09-28 | |
Deep Audio-Visual Speech Recognition | ✓ Link | 8.2 | TM-CTC | 2018-09-06 |
Deep Audio-Visual Speech Recognition | ✓ Link | 8.5 | TM-Seq2seq | 2018-09-06 |