High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR | | 0.985 | United Med ASR | 2024-11-24 |
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models | | 1.17 | SAMBA ASR | 2025-01-06 |
Step-Audio 2 Technical Report | | 1.17 | Step-Audio 2 | 2025-08-27 |
Kimi-Audio Technical Report | ✓ Link | 1.28 | Kimi-Audio | 2025-04-25 |
Step-Audio 2 Technical Report | ✓ Link | 1.33 | Step-Audio 2 mini | 2025-08-27 |
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | ✓ Link | 1.34 | FAdam | 2024-05-21 |
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition | ✓ Link | 1.4 | Conformer + Wav2vec 2.0 + SpecAugment-based Noisy Student Training with Libri-Light | 2020-10-20 |
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training | ✓ Link | 1.4 | w2v-BERT XXL | 2021-08-07 |
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition | | 1.46 | parakeet-rnnt-1.1b | 2023-05-08 |
Self-training and Pre-training are Complementary for Speech Recognition | ✓ Link | 1.5 | Conv + Transformer + wav2vec2.0 + pseudo labeling | 2020-10-22 |
Improved Noisy Student Training for Automatic Speech Recognition | ✓ Link | 1.7 | ContextNet + SpecAugment-based Noisy Student Training with Libri-Light | 2020-05-19 |
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network | | 1.7 | SpeechStew (1B) | 2021-04-05 |
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition | | 1.75 | Multistream CNN with Self-Attentive SRU (WER includes text normalization) | 2020-05-21 |
Step-Audio 2 Technical Report | | 1.75 | GPT-4o Transcribe | 2025-08-27 |
Multi-Head State Space Model for Speech Recognition | | 1.76 | Stateformer | 2023-05-21 |
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | ✓ Link | 1.8 | wav2vec 2.0 with Libri-Light | 2020-06-20 |
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | ✓ Link | 1.8 | HuBERT with Libri-Light | 2021-06-14 |
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | ✓ Link | 1.8 | WavLM Large | 2021-10-26 |
E-Branchformer: Branchformer with Enhanced merging for speech recognition | ✓ Link | 1.81 | E-Branchformer (L) + Internal Language Model Estimation | 2022-09-30 |
CR-CTC: Consistency regularization on CTC for improved speech recognition | ✓ Link | 1.88 | Zipformer+pruned transducer w/ CR-CTC (no external language model) | 2024-10-07 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 1.9 | ContextNet(L) | 2020-05-07 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 1.9 | Conformer(L) | 2020-05-16 |
Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation | | 1.9 | Transformer+Time reduction+Self Knowledge distillation | 2021-03-17 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 2 | ContextNet(M) | 2020-05-07 |
Improving RNN Transducer Based ASR with Auxiliary Tasks | ✓ Link | 2.0 | Transformer Transducer | 2020-11-05 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 2 | Conformer(M) | 2020-05-16 |
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network | | 2.0 | SpeechStew (100M) | 2021-04-05 |
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models | ✓ Link | 2.0 | Qwen-Audio | 2023-11-14 |
Zipformer: A faster and better encoder for automatic speech recognition | ✓ Link | 2.00 | Zipformer+pruned transducer (no external language model) | 2023-10-17 |
CR-CTC: Consistency regularization on CTC for improved speech recognition | ✓ Link | 2.02 | Zipformer+CR-CTC (no external language model) | 2024-10-07 |
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | ✓ Link | 2.03 | Conv + Transformer AM + Pseudo-Labeling (ConvLM with Transformer Rescoring) | 2019-11-19 |
Iterative Pseudo-Labeling for Speech Recognition | ✓ Link | 2.10 | Conv + Transformer AM + Iterative Pseudo-Labeling (n-gram LM + Transformer Rescoring) | 2020-05-19 |
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces | | 2.10 | CTC + Transformer LM rescoring | 2020-05-19 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 2.1 | Conformer(S) | 2020-05-16 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 2.11 | Branchformer + GFSA | 2023-12-07 |
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions | ✓ Link | 2.20 | Multi-Stream Self-Attention With Dilated 1D Convolutions | 2019-10-01 |
Librispeech Transducer Model with Internal Language Model Prior Correction | ✓ Link | 2.23 | LSTM Transducer | 2021-04-07 |
Transformer-based Acoustic Modeling for Hybrid Speech Recognition | | 2.26 | Hybrid + Transformer LM rescoring | 2019-10-22 |
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation | ✓ Link | 2.3 | Hybrid model with Transformer rescoring | 2019-05-08 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 2.3 | ContextNet(S) | 2020-05-07 |
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | ✓ Link | 2.31 | Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only) | 2019-11-19 |
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition | ✓ Link | 2.47 | Squeezeformer (L) | 2022-06-02 |
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | ✓ Link | 2.5 | LAS + SpecAugment | 2019-04-18 |
A Comparative Study on Transformer vs RNN in Speech Applications | ✓ Link | 2.6 | Transformer | 2019-09-13 |
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions | ✓ Link | 2.69 | QuartzNet15x5 | 2019-10-22 |
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | ✓ Link | 2.7 | LAS (no LM) | 2019-04-18 |
Self-training and Pre-training are Complementary for Speech Recognition | ✓ Link | 2.7 | wav2vec_wav2letter | 2020-10-22 |
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit | ✓ Link | 2.8 | Espresso | 2019-09-18 |
Jasper: An End-to-End Convolutional Neural Acoustic Model | ✓ Link | 2.84 | Jasper DR 10x5 (+ Time/Freq Masks) | 2019-04-05 |
Step-Audio 2 Technical Report | | 2.92 | Doubao LLM ASR | 2025-08-27 |
Step-Audio 2 Technical Report | | 2.93 | Qwen Omni | 2025-08-27 |
Jasper: An End-to-End Convolutional Neural Acoustic Model | ✓ Link | 2.95 | Jasper DR 10x5 | 2019-04-05 |
Neural Network Language Modeling with Letter-based Features and Importance Sampling | | 3.06 | tdnn + chain + rnnlm rescoring | 2018-04-15 |
Fully Convolutional Speech Recognition | | 3.26 | Convolutional Speech Recognition | 2018-12-17 |
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets | ✓ Link | 3.4 | MT4SSL | 2022-11-14 |
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition | ✓ Link | 3.60 | Model Unit Exploration | 2019-02-05 |
Improved training of end-to-end attention models for speech recognition | ✓ Link | 3.82 | Seq-to-seq attention | 2018-05-08 |
CRF-based Single-stage Acoustic Modeling with CTC Topology | ✓ Link | 4.09 | CTC-CRF 4gram-LM | 2019-04-16 |
[]() | | 4.3 | HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations | |
Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions | | 4.4 | Centaurus (30 M) | 2025-01-22 |
[]() | | 4.8 | HMM-TDNN + iVectors | |
Letter-Based Speech Recognition with Gated ConvNets | ✓ Link | 4.8 | Gated ConvNets | 2017-12-22 |
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | ✓ Link | 5.33 | Deep Speech 2 | 2015-12-08 |
Improving End-to-End Speech Recognition with Policy Learning | | 5.42 | CTC + policy learning | 2017-12-19 |
[]() | | 5.5 | HMM-DNN + pNorm* | |
The PyTorch-Kaldi Speech Recognition Toolkit | ✓ Link | 6.2 | Li-GRU | 2018-11-19 |
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces | ✓ Link | 6.4 | Snips | 2018-05-25 |
Semi-Supervised Speech Recognition via Local Prior Matching | ✓ Link | 7.19 | Local Prior Matching (Large Model) | 2020-02-24 |
[]() | | 8.0 | HMM-(SAT)GMM | |
Amortized Neural Networks for Low-Latency Speech Recognition | | 8.6 | AmNet | 2021-08-03 |