Paper | Code | Test WER | ModelName | ReleaseDate |
---|---|---|---|---|
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | ✓ Link | 1.3 | Whisper | 2024-06-14 |
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | ✓ Link | 1.5 | CTC/Attention | 2023-03-25 |
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | ✓ Link | 2.7 | MoCo + wav2vec (w/o extLM) | 2022-02-24 |
End-to-end Audio-visual Speech Recognition with Conformers | ✓ Link | 3.9 | End2end Conformer | 2021-02-12 |
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition | ✓ Link | 6.6 | Whisper-LLaMA | 2023-10-10 |
Audio-visual Recognition of Overlapped speech for the LRS2 dataset | 6.7 | LF-MMI TDNN | 2020-01-06 | |
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture | 8.2 | CTC/attention | 2018-09-28 | |
Deep Audio-Visual Speech Recognition | ✓ Link | 9.7 | TM-seq2seq | 2018-09-06 |
Deep Audio-Visual Speech Recognition | ✓ Link | 10.1 | TM-CTC | 2018-09-06 |