OpenCodePapers

lipreading-on-lrs3-ted

Natural Language TransductionLipreading
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeWord Error Rate (WER)ModelNameReleaseDate
Conformers are All You Need for Visual Speech Recognition12.8LP + Conformer2023-02-17
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels✓ Link19.1Auto-AVSR2023-03-25
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization✓ Link21.5SyncVSR2024-06-18
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs✓ Link21.5USR (self + semi-supervised)2024-11-04
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs✓ Link22.3USR (self-supervised)2024-11-04
Jointly Learning Visual and Auditory Speech Representations from Raw Data✓ Link23.4RAVEn Large2022-12-12
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing✓ Link25.4VSP-LLM2024-02-23
Relaxed Attention for Transformer Models✓ Link25.51AV-HuBERT Large + Relaxed Attention + LM2022-09-20
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models✓ Link26.2DistillAV2025-02-09
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction✓ Link26.9AV-HuBERT Large2022-01-05
Sub-word Level Lip Reading With Visual Attention30.7VTP (more data)2021-10-14
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization✓ Link31.2SyncVSR2024-06-18
Visual Speech Recognition for Multiple Languages in the Wild✓ Link31.5CTC/Attention (LRW+LRS2/3+AVSpeech)2022-02-26
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition✓ Link33.6RNN-T2019-11-08
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations37.1ES³ Large2024-01-01
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations40.3ES³ Base2024-01-01
Sub-word Level Lip Reading With Visual Attention40.6VTP2021-10-14
End-to-end Audio-visual Speech Recognition with Conformers✓ Link43.3Hyb + Conformer2021-02-12
Large-Scale Visual Speech Recognition55.1CTC-V2P2018-07-13
Discriminative Multi-modality Speech Recognition✓ Link57.8EG-seq2seq2020-05-12
Deep Audio-Visual Speech Recognition✓ Link58.9TM-seq2seq2018-09-06
ASR is all you need: cross-modal distillation for lip reading59.8CTC + KD2019-11-28
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading60.1Conv-seq2seq2019-10-01