Kimi-Audio Technical Report | ✓ Link | 2.42 | Kimi-Audio | 2025-04-25 |
Step-Audio 2 Technical Report | | 2.42 | Step-Audio 2 | 2025-08-27 |
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models | | 2.48 | SAMBA ASR | 2025-01-06 |
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | ✓ Link | 2.49 | FAdam | 2024-05-21 |
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training | ✓ Link | 2.5 | w2v-BERT XXL | 2021-08-07 |
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition | ✓ Link | 2.6 | Conformer + Wav2vec 2.0 + SpecAugment-based Noisy Student Training with Libri-Light | 2020-10-20 |
Step-Audio 2 Technical Report | ✓ Link | 2.86 | Step-Audio 2 mini | 2025-08-27 |
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | ✓ Link | 2.9 | HuBERT with Libri-Light | 2021-06-14 |
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | ✓ Link | 3.0 | wav2vec 2.0 with Libri-Light | 2020-06-20 |
Self-training and Pre-training are Complementary for Speech Recognition | ✓ Link | 3.1 | Conv + Transformer + wav2vec2.0 + pseudo labeling | 2020-10-22 |
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | ✓ Link | 3.2 | WavLM Large | 2021-10-26 |
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network | | 3.3 | SpeechStew (1B) | 2021-04-05 |
Improved Noisy Student Training for Automatic Speech Recognition | ✓ Link | 3.4 | ContextNet + SpecAugment-based Noisy Student Training with Libri-Light | 2020-05-19 |
E-Branchformer: Branchformer with Enhanced merging for speech recognition | ✓ Link | 3.65 | E-Branchformer (L) + Internal Language Model Estimation | 2022-09-30 |
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language | ✓ Link | 3.7 | data2vec | 2022-02-07 |
Iterative Pseudo-Labeling for Speech Recognition | ✓ Link | 3.83 | Conv + Transformer AM + Iterative Pseudo-Labeling (n-gram LM + Transformer Rescoring) | 2020-05-19 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 3.9 | Conformer(L) | 2020-05-16 |
CR-CTC: Consistency regularization on CTC for improved speech recognition | ✓ Link | 3.95 | Zipformer+pruned transducer w/ CR-CTC
(no external language model) | 2024-10-07 |
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network | | 4.0 | SpeechStew (100M) | 2021-04-05 |
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | ✓ Link | 4.1 | wav2vec 2.0 | 2020-06-20 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 4.1 | ContextNet(L) | 2020-05-07 |
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | ✓ Link | 4.11 | Conv + Transformer AM (ConvLM with Transformer Rescoring) | 2019-11-19 |
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces | | 4.20 | CTC + Transformer LM rescoring | 2020-05-19 |
Improving RNN Transducer Based ASR with Auxiliary Tasks | ✓ Link | 4.20 | Transformer Transducer | 2020-11-05 |
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models | ✓ Link | 4.2 | Qwen-Audio | 2023-11-14 |
Step-Audio 2 Technical Report | | 4.23 | GPT-4o Transcribe | 2025-08-27 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 4.3 | Conformer(M) | 2020-05-16 |
CR-CTC: Consistency regularization on CTC for improved speech recognition | ✓ Link | 4.35 | Zipformer+CR-CTC
(no external language model) | 2024-10-07 |
Zipformer: A faster and better encoder for automatic speech recognition | ✓ Link | 4.38 | Zipformer+pruned transducer
(no external language model) | 2023-10-17 |
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition | | 4.46 | Multistream CNN with Self-Attentive SRU | 2020-05-21 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 4.5 | ContextNet(M) | 2020-05-07 |
Transformer-based Acoustic Modeling for Hybrid Speech Recognition | | 4.85 | hybrid + Transformer LM rescoring | 2019-10-22 |
Graph Convolutions Enrich the Self-Attention in Transformers! | ✓ Link | 4.94 | Branchformer + GFSA | 2023-12-07 |
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation | ✓ Link | 5.0 | Hybrid model with Transformer rescoring | 2019-05-08 |
Conformer: Convolution-augmented Transformer for Speech Recognition | ✓ Link | 5.0 | Conformer(S) | 2020-05-16 |
Step-Audio 2 Technical Report | | 5.07 | Qwen Omni | 2025-08-27 |
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | ✓ Link | 5.18 | Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only) | 2019-11-19 |
Step-Audio 2 Technical Report | | 5.32 | Doubao LLM ASR | 2025-08-27 |
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | ✓ Link | 5.5 | ContextNet(S) | 2020-05-07 |
Librispeech Transducer Model with Internal Language Model Prior Correction | ✓ Link | 5.6 | LSTM Transducer | 2021-04-07 |
A Comparative Study on Transformer vs RNN in Speech Applications | ✓ Link | 5.7 | Transformer | 2019-09-13 |
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | ✓ Link | 5.8 | LAS + SpecAugment | 2019-04-18 |
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions | ✓ Link | 5.80 | Multi-Stream Self-Attention With Dilated 1D Convolutions | 2019-10-01 |
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition | ✓ Link | 5.97 | Squeezeformer (L) | 2022-06-02 |
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | ✓ Link | 6.5 | LAS (no LM) | 2019-04-18 |
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition | ✓ Link | 6.85 | Conformer with Relaxed Attention | 2021-07-02 |
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions | ✓ Link | 7.25 | QuartzNet15x5 | 2019-10-22 |
Neural Network Language Modeling with Letter-based Features and Importance Sampling | | 7.63 | tdnn + chain + rnnlm rescoring | 2018-04-15 |
Jasper: An End-to-End Convolutional Neural Acoustic Model | ✓ Link | 7.84 | Jasper DR 10x5 (+ Time/Freq Masks) | 2019-04-05 |
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit | ✓ Link | 8.7 | Espresso | 2019-09-18 |
Jasper: An End-to-End Convolutional Neural Acoustic Model | ✓ Link | 8.79 | Jasper DR 10x5 | 2019-04-05 |
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets | ✓ Link | 9.6 | MT4SSL | 2022-11-14 |
Fully Convolutional Speech Recognition | | 10.47 | Convolutional Speech Recognition | 2018-12-17 |
CRF-based Single-stage Acoustic Modeling with CTC Topology | ✓ Link | 10.65 | CTC-CRF 4gram-LM | 2019-04-16 |
[]() | | 12.5 | TDNN + pNorm + speed up/down speech | |
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | ✓ Link | 13.25 | Deep Speech 2 | 2015-12-08 |
Semi-Supervised Speech Recognition via Local Prior Matching | ✓ Link | 15.28 | Local Prior Matching (Large Model, ConvLM LM) | 2020-02-24 |
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces | ✓ Link | 16.5 | Snips | 2018-05-25 |
Semi-Supervised Speech Recognition via Local Prior Matching | ✓ Link | 20.84 | Local Prior Matching (Large Model) | 2020-02-24 |