Very Deep Transformers for Neural Machine Translation | ✓ Link | 46.4 | 44.4 | | | Transformer+BT (ADMIN init) | 2020-08-18 |
Understanding Back-Translation at Scale | ✓ Link | 45.6 | 43.8 | 180G | | Noisy back-translation | 2018-08-28 |
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information | ✓ Link | 44.3 | 41.7 | | | mRASP+Fine-Tune | 2020-10-07 |
R-Drop: Regularized Dropout for Neural Networks | ✓ Link | 43.95 | | | | Transformer + R-Drop | 2021-06-28 |
Very Deep Transformers for Neural Machine Translation | ✓ Link | 43.8 | 41.8 | | | Transformer (ADMIN init) | 2020-08-18 |
Understanding the Difficulty of Training Transformers | ✓ Link | 43.8 | | | | Admin | 2020-04-17 |
Incorporating BERT into Neural Machine Translation | ✓ Link | 43.78 | | | | BERT-fused NMT | 2020-02-17 |
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | ✓ Link | 43.5 | | | | MUSE(Paralllel Multi-scale Attention) | 2019-11-17 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 43.4 | | | | T5 | 2019-10-23 |
Joint Source-Target Self Attention with Locality Constraints | ✓ Link | 43.3 | | | | Local Joint Self-attention | 2019-05-16 |
Depth Growing for Neural Machine Translation | ✓ Link | 43.27 | | 24G | | Depth Growing | 2019-07-03 |
Scaling Neural Machine Translation | ✓ Link | 43.2 | | 55G | | Transformer Big | 2018-06-01 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 43.2 | | | | DynamicConv | 2019-01-29 |
Time-aware Large Kernel Convolutions | ✓ Link | 43.2 | | | | TaLK Convolutions | 2020-02-08 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 43.1 | | | | LightConv | 2019-01-29 |
Learning to Encode Position for Transformer with Continuous Dynamical Model | ✓ Link | 42.7 | | | | FLOATER-large | 2020-03-13 |
OmniNet: Omnidirectional Representations from Transformers | ✓ Link | 42.6 | | | | OmniNetP | 2021-03-01 |
Fast and Simple Mixture of Softmaxes with BPE and Hybrid-LightRNN for Language Generation | ✓ Link | 42.1 | | | | Transformer Big + MoS | 2018-09-25 |
Finetuning Pretrained Transformers into RNNs | ✓ Link | 42.1 | | | | T2R + Pretrain | 2021-03-24 |
Synthesizer: Rethinking Self-Attention in Transformer Models | ✓ Link | 41.85 | | | | Synthesizer (Random + Vanilla) | 2020-05-02 |
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing | ✓ Link | 41.8 | | | | Hardware Aware Transformer | 2020-05-28 |
Self-Attention with Relative Position Representations | ✓ Link | 41.5 | | | | Transformer (big) + Relative Position Representations | 2018-03-06 |
Deliberation Networks: Sequence Generation Beyond One-Pass Decoding | | 41.5 | | | | Stack 4-layer RNNSearch + Dual Learning + Deliberation Network | 2017-12-01 |
Weighted Transformer Network for Machine Translation | ✓ Link | 41.4 | | | | Weighted Transformer (large) | 2017-11-06 |
Convolutional Sequence to Sequence Learning | ✓ Link | 41.3 | | | | ConvS2S (ensemble) | 2017-05-08 |
The Evolved Transformer | ✓ Link | 41.3 | | | | Evolved Transformer Big | 2019-01-30 |
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation | ✓ Link | 41.0 | | 132G | 2.81G | RNMT+ | 2018-04-26 |
Attention Is All You Need | ✓ Link | 41.0 | | 23G | 2300000000.0G | Transformer Big | 2017-06-12 |
The Evolved Transformer | ✓ Link | 40.6 | | | | Evolved Transformer Base | 2019-01-30 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 40.6 | | | | ResMLP-12 | 2021-05-07 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ✓ Link | 40.56 | | 142G | | MoE | 2017-01-23 |
Memory-Efficient Adaptive Optimization | ✓ Link | 40.5 | | | | Transformer | 2019-01-30 |
Convolutional Sequence to Sequence Learning | ✓ Link | 40.46 | | 143G | | ConvS2S | 2017-05-08 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 40.3 | | | | ResMLP-6 | 2021-05-07 |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 40 | | | | TransformerBase + AutoDropout | 2021-01-05 |
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation | ✓ Link | 39.9 | | 279G | | GNMT+RL | 2016-09-26 |
Lite Transformer with Long-Short Range Attention | ✓ Link | 39.6 | | | | Lite Transformer | 2020-04-24 |
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation | ✓ Link | 39.2 | | 119G | | Deep-Att + PosUnk | 2016-06-14 |
Random Feature Attention | | 39.2 | | | | Rfa-Gate-arccos | 2021-03-03 |
Attention Is All You Need | ✓ Link | 38.1 | | 23G | 330000000.0G | Transformer Base | 2017-06-12 |
Addressing the Rare Word Problem in Neural Machine Translation | ✓ Link | 37.5 | | | | LSTM6 + PosUnk | 2014-10-30 |
[]() | | 37 | | | | PBMT | |
Sequence to Sequence Learning with Neural Networks | ✓ Link | 36.5 | | | | SMT+LSTM5 | 2014-09-10 |
Neural Machine Translation by Jointly Learning to Align and Translate | ✓ Link | 36.2 | | | | RNN-search50* | 2014-09-01 |
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation | ✓ Link | 35.9 | | | | Deep-Att | 2016-06-14 |
A Convolutional Encoder Model for Neural Machine Translation | ✓ Link | 35.7 | | | | Deep Convolutional Encoder; single-layer decoder | 2016-11-07 |
Sequence to Sequence Learning with Neural Networks | ✓ Link | 34.8 | | | | LSTM | 2014-09-10 |
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation | ✓ Link | 34.54 | | | | CSLM + RNN + WP | 2014-06-03 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 33.9 | | | | FLAN 137B (zero-shot) | 2021-09-03 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 33.8 | | | | FLAN 137B (few-shot, k=9) | 2021-09-03 |
Recurrent Neural Network Regularization | ✓ Link | 29.03 | | | | Regularized LSTM | 2014-09-08 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 28.11 | | | | Unsupervised PBSMT | 2018-04-20 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 27.6 | | | | PBSMT + NMT | 2018-04-20 |
Can Active Memory Replace Attention? | ✓ Link | 26.4 | | | | GRU+Attention | 2016-10-27 |
Unsupervised Statistical Machine Translation | ✓ Link | 26.22 | | | | SMT + iterative backtranslation (unsupervised) | 2018-09-04 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 25.14 | | | | Unsupervised NMT + Transformer | 2018-04-20 |
Unsupervised Neural Machine Translation | ✓ Link | 14.36 | | | | Unsupervised attentional encoder-decoder + BPE | 2017-10-30 |