OpenCodePapers

speech-recognition-on-librispeech-test-clean

Speech Recognition
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeWord Error Rate (WER)ModelNameReleaseDate
High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR0.985United Med ASR2024-11-24
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models1.17SAMBA ASR2025-01-06
Step-Audio 2 Technical Report1.17Step-Audio 22025-08-27
Kimi-Audio Technical Report✓ Link1.28Kimi-Audio2025-04-25
Step-Audio 2 Technical Report✓ Link1.33Step-Audio 2 mini2025-08-27
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information✓ Link1.34FAdam2024-05-21
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition✓ Link1.4Conformer + Wav2vec 2.0 + SpecAugment-based Noisy Student Training with Libri-Light2020-10-20
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training✓ Link1.4w2v-BERT XXL2021-08-07
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition1.46parakeet-rnnt-1.1b2023-05-08
Self-training and Pre-training are Complementary for Speech Recognition✓ Link1.5Conv + Transformer + wav2vec2.0 + pseudo labeling2020-10-22
Improved Noisy Student Training for Automatic Speech Recognition✓ Link1.7ContextNet + SpecAugment-based Noisy Student Training with Libri-Light2020-05-19
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network1.7SpeechStew (1B)2021-04-05
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition1.75Multistream CNN with Self-Attentive SRU (WER includes text normalization)2020-05-21
Step-Audio 2 Technical Report1.75GPT-4o Transcribe2025-08-27
Multi-Head State Space Model for Speech Recognition1.76Stateformer2023-05-21
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations✓ Link1.8wav2vec 2.0 with Libri-Light2020-06-20
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units✓ Link1.8HuBERT with Libri-Light2021-06-14
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing✓ Link1.8WavLM Large2021-10-26
E-Branchformer: Branchformer with Enhanced merging for speech recognition✓ Link1.81E-Branchformer (L) + Internal Language Model Estimation2022-09-30
CR-CTC: Consistency regularization on CTC for improved speech recognition✓ Link1.88Zipformer+pruned transducer w/ CR-CTC (no external language model)2024-10-07
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context✓ Link1.9ContextNet(L)2020-05-07
Conformer: Convolution-augmented Transformer for Speech Recognition✓ Link1.9Conformer(L)2020-05-16
Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation1.9Transformer+Time reduction+Self Knowledge distillation2021-03-17
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context✓ Link2ContextNet(M)2020-05-07
Improving RNN Transducer Based ASR with Auxiliary Tasks✓ Link2.0Transformer Transducer2020-11-05
Conformer: Convolution-augmented Transformer for Speech Recognition✓ Link2Conformer(M)2020-05-16
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network2.0SpeechStew (100M)2021-04-05
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models✓ Link2.0Qwen-Audio2023-11-14
Zipformer: A faster and better encoder for automatic speech recognition✓ Link2.00Zipformer+pruned transducer (no external language model)2023-10-17
CR-CTC: Consistency regularization on CTC for improved speech recognition✓ Link2.02Zipformer+CR-CTC (no external language model)2024-10-07
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures✓ Link2.03Conv + Transformer AM + Pseudo-Labeling (ConvLM with Transformer Rescoring)2019-11-19
Iterative Pseudo-Labeling for Speech Recognition✓ Link2.10Conv + Transformer AM + Iterative Pseudo-Labeling (n-gram LM + Transformer Rescoring)2020-05-19
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces2.10CTC + Transformer LM rescoring2020-05-19
Conformer: Convolution-augmented Transformer for Speech Recognition✓ Link2.1Conformer(S)2020-05-16
Graph Convolutions Enrich the Self-Attention in Transformers!✓ Link2.11Branchformer + GFSA2023-12-07
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions✓ Link2.20Multi-Stream Self-Attention With Dilated 1D Convolutions2019-10-01
Librispeech Transducer Model with Internal Language Model Prior Correction✓ Link2.23LSTM Transducer2021-04-07
Transformer-based Acoustic Modeling for Hybrid Speech Recognition2.26Hybrid + Transformer LM rescoring2019-10-22
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation✓ Link2.3Hybrid model with Transformer rescoring2019-05-08
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context✓ Link2.3ContextNet(S)2020-05-07
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures✓ Link2.31Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only)2019-11-19
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition✓ Link2.47Squeezeformer (L)2022-06-02
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition✓ Link2.5LAS + SpecAugment2019-04-18
A Comparative Study on Transformer vs RNN in Speech Applications✓ Link2.6Transformer2019-09-13
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions✓ Link2.69QuartzNet15x52019-10-22
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition✓ Link2.7LAS (no LM)2019-04-18
Self-training and Pre-training are Complementary for Speech Recognition✓ Link2.7wav2vec_wav2letter2020-10-22
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit✓ Link2.8Espresso2019-09-18
Jasper: An End-to-End Convolutional Neural Acoustic Model✓ Link2.84Jasper DR 10x5 (+ Time/Freq Masks)2019-04-05
Step-Audio 2 Technical Report2.92Doubao LLM ASR2025-08-27
Step-Audio 2 Technical Report2.93Qwen Omni2025-08-27
Jasper: An End-to-End Convolutional Neural Acoustic Model✓ Link2.95Jasper DR 10x52019-04-05
Neural Network Language Modeling with Letter-based Features and Importance Sampling3.06tdnn + chain + rnnlm rescoring2018-04-15
Fully Convolutional Speech Recognition3.26Convolutional Speech Recognition2018-12-17
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets✓ Link3.4MT4SSL2022-11-14
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition✓ Link3.60Model Unit Exploration2019-02-05
Improved training of end-to-end attention models for speech recognition✓ Link3.82Seq-to-seq attention2018-05-08
CRF-based Single-stage Acoustic Modeling with CTC Topology✓ Link4.09CTC-CRF 4gram-LM2019-04-16
[]()4.3HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations
Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions4.4Centaurus (30 M)2025-01-22
[]()4.8HMM-TDNN + iVectors
Letter-Based Speech Recognition with Gated ConvNets✓ Link4.8Gated ConvNets2017-12-22
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin✓ Link5.33Deep Speech 22015-12-08
Improving End-to-End Speech Recognition with Policy Learning5.42CTC + policy learning2017-12-19
[]()5.5HMM-DNN + pNorm*
The PyTorch-Kaldi Speech Recognition Toolkit✓ Link6.2Li-GRU2018-11-19
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces✓ Link6.4Snips2018-05-25
Semi-Supervised Speech Recognition via Local Prior Matching✓ Link7.19Local Prior Matching (Large Model)2020-02-24
[]()8.0HMM-(SAT)GMM
Amortized Neural Networks for Low-Latency Speech Recognition8.6AmNet2021-08-03