speech-recognition-on-librispeech-test-other

Speech Recognition

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Word Error Rate (WER)	ModelName	ReleaseDate
Kimi-Audio Technical Report	✓ Link	2.42	Kimi-Audio	2025-04-25
Step-Audio 2 Technical Report		2.42	Step-Audio 2	2025-08-27
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models		2.48	SAMBA ASR	2025-01-06
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information	✓ Link	2.49	FAdam	2024-05-21
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training	✓ Link	2.5	w2v-BERT XXL	2021-08-07
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition	✓ Link	2.6	Conformer + Wav2vec 2.0 + SpecAugment-based Noisy Student Training with Libri-Light	2020-10-20
Step-Audio 2 Technical Report	✓ Link	2.86	Step-Audio 2 mini	2025-08-27
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units	✓ Link	2.9	HuBERT with Libri-Light	2021-06-14
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	✓ Link	3.0	wav2vec 2.0 with Libri-Light	2020-06-20
Self-training and Pre-training are Complementary for Speech Recognition	✓ Link	3.1	Conv + Transformer + wav2vec2.0 + pseudo labeling	2020-10-22
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	✓ Link	3.2	WavLM Large	2021-10-26
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network		3.3	SpeechStew (1B)	2021-04-05
Improved Noisy Student Training for Automatic Speech Recognition	✓ Link	3.4	ContextNet + SpecAugment-based Noisy Student Training with Libri-Light	2020-05-19
E-Branchformer: Branchformer with Enhanced merging for speech recognition	✓ Link	3.65	E-Branchformer (L) + Internal Language Model Estimation	2022-09-30
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language	✓ Link	3.7	data2vec	2022-02-07
Iterative Pseudo-Labeling for Speech Recognition	✓ Link	3.83	Conv + Transformer AM + Iterative Pseudo-Labeling (n-gram LM + Transformer Rescoring)	2020-05-19
Conformer: Convolution-augmented Transformer for Speech Recognition	✓ Link	3.9	Conformer(L)	2020-05-16
CR-CTC: Consistency regularization on CTC for improved speech recognition	✓ Link	3.95	Zipformer+pruned transducer w/ CR-CTC (no external language model)	2024-10-07
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network		4.0	SpeechStew (100M)	2021-04-05
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context	✓ Link	4.1	ContextNet(L)	2020-05-07
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	✓ Link	4.1	wav2vec 2.0	2020-06-20
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	✓ Link	4.11	Conv + Transformer AM (ConvLM with Transformer Rescoring)	2019-11-19
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces		4.20	CTC + Transformer LM rescoring	2020-05-19
Improving RNN Transducer Based ASR with Auxiliary Tasks	✓ Link	4.20	Transformer Transducer	2020-11-05
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	✓ Link	4.2	Qwen-Audio	2023-11-14
Step-Audio 2 Technical Report		4.23	GPT-4o Transcribe	2025-08-27
Conformer: Convolution-augmented Transformer for Speech Recognition	✓ Link	4.3	Conformer(M)	2020-05-16
CR-CTC: Consistency regularization on CTC for improved speech recognition	✓ Link	4.35	Zipformer+CR-CTC (no external language model)	2024-10-07
Zipformer: A faster and better encoder for automatic speech recognition	✓ Link	4.38	Zipformer+pruned transducer (no external language model)	2023-10-17
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition		4.46	Multistream CNN with Self-Attentive SRU	2020-05-21
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context	✓ Link	4.5	ContextNet(M)	2020-05-07
Transformer-based Acoustic Modeling for Hybrid Speech Recognition		4.85	hybrid + Transformer LM rescoring	2019-10-22
Graph Convolutions Enrich the Self-Attention in Transformers!	✓ Link	4.94	Branchformer + GFSA	2023-12-07
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation	✓ Link	5.0	Hybrid model with Transformer rescoring	2019-05-08
Conformer: Convolution-augmented Transformer for Speech Recognition	✓ Link	5.0	Conformer(S)	2020-05-16
Step-Audio 2 Technical Report		5.07	Qwen Omni	2025-08-27
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	✓ Link	5.18	Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only)	2019-11-19
Step-Audio 2 Technical Report		5.32	Doubao LLM ASR	2025-08-27
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context	✓ Link	5.5	ContextNet(S)	2020-05-07
Librispeech Transducer Model with Internal Language Model Prior Correction	✓ Link	5.6	LSTM Transducer	2021-04-07
A Comparative Study on Transformer vs RNN in Speech Applications	✓ Link	5.7	Transformer	2019-09-13
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition	✓ Link	5.8	LAS + SpecAugment	2019-04-18
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions	✓ Link	5.80	Multi-Stream Self-Attention With Dilated 1D Convolutions	2019-10-01
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition	✓ Link	5.97	Squeezeformer (L)	2022-06-02
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition	✓ Link	6.5	LAS (no LM)	2019-04-18
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition	✓ Link	6.85	Conformer with Relaxed Attention	2021-07-02
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions	✓ Link	7.25	QuartzNet15x5	2019-10-22
Neural Network Language Modeling with Letter-based Features and Importance Sampling		7.63	tdnn + chain + rnnlm rescoring	2018-04-15
Jasper: An End-to-End Convolutional Neural Acoustic Model	✓ Link	7.84	Jasper DR 10x5 (+ Time/Freq Masks)	2019-04-05
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit	✓ Link	8.7	Espresso	2019-09-18
Jasper: An End-to-End Convolutional Neural Acoustic Model	✓ Link	8.79	Jasper DR 10x5	2019-04-05
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets	✓ Link	9.6	MT4SSL	2022-11-14
Fully Convolutional Speech Recognition		10.47	Convolutional Speech Recognition	2018-12-17
CRF-based Single-stage Acoustic Modeling with CTC Topology	✓ Link	10.65	CTC-CRF 4gram-LM	2019-04-16
[]()		12.5	TDNN + pNorm + speed up/down speech
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	✓ Link	13.25	Deep Speech 2	2015-12-08
Semi-Supervised Speech Recognition via Local Prior Matching	✓ Link	15.28	Local Prior Matching (Large Model, ConvLM LM)	2020-02-24
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces	✓ Link	16.5	Snips	2018-05-25
Semi-Supervised Speech Recognition via Local Prior Matching	✓ Link	20.84	Local Prior Matching (Large Model)	2020-02-24

OpenCodePapers

speech-recognition-on-librispeech-test-other