Paper | Code | Word Error Rate (WER) | ModelName | ReleaseDate |
---|---|---|---|---|
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | ✓ Link | 0.68 | Whisper | 2024-06-14 |
Large Language Models are Strong Audio-Visual Speech Recognition Learners | ✓ Link | 0.81 | Llama-AVSR | 2024-09-18 |
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction | ✓ Link | 1.3 | AV-HuBERT Large | 2022-01-05 |
Jointly Learning Visual and Auditory Speech Representations from Raw Data | ✓ Link | 1.4 | RAVEn Large | 2022-12-12 |