Lessons on Parameter Sharing across Layers in Transformers | ✓ Link | 35.14 | 33.54 | | | | Transformer Cycle (Rev) | 2021-04-13 |
Understanding Back-Translation at Scale | ✓ Link | 35.0 | 33.8 | | 146G | | Noisy back-translation | 2018-08-28 |
Rethinking Perturbations in Encoder-Decoders for Fast Training | ✓ Link | 33.89 | 32.35 | | | | Transformer+Rep(Uni) | 2021-04-05 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 32.1 | | 11110M | | | T5-11B | 2019-10-23 |
BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation | ✓ Link | 31.26 | | | | | BiBERT | 2021-09-09 |
R-Drop: Regularized Dropout for Neural Networks | ✓ Link | 30.91 | | | 49G | | Transformer + R-Drop | 2021-06-28 |
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation | ✓ Link | 30.78 | | | | | Bi-SimCut | 2022-06-06 |
Incorporating BERT into Neural Machine Translation | ✓ Link | 30.75 | | | | | BERT-fused NMT | 2020-02-17 |
Data Diversification: A Simple Strategy For Neural Machine Translation | ✓ Link | 30.7 | | | | | Data Diversification - Transformer | 2019-11-05 |
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation | ✓ Link | 30.56 | | | | | SimCut | 2022-06-06 |
Mask Attention Networks: Rethinking and Strengthen Transformer | ✓ Link | 30.4 | | 215M | | | Mask Attention Network (big) | 2021-03-25 |
Very Deep Transformers for Neural Machine Translation | ✓ Link | 30.1 | 29.5 | 256M | | | Transformer (ADMIN init) | 2020-08-18 |
PowerNorm: Rethinking Batch Normalization in Transformers | ✓ Link | 30.1 | | | | | PowerNorm (Transformer) | 2020-03-17 |
Depth Growing for Neural Machine Translation | ✓ Link | 30.07 | | | 24G | | Depth Growing | 2019-07-03 |
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | ✓ Link | 29.9 | | | | | MUSE(Parallel Multi-scale Attention) | 2019-11-17 |
The Evolved Transformer | ✓ Link | 29.8 | 29.2 | 218M | | | Evolved Transformer Big | 2019-01-30 |
OmniNet: Omnidirectional Representations from Transformers | ✓ Link | 29.8 | | | | | OmniNetP | 2021-03-01 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 29.7 | | 213M | | | DynamicConv | 2019-01-29 |
Joint Source-Target Self Attention with Locality Constraints | ✓ Link | 29.7 | | | | | Local Joint Self-attention | 2019-05-16 |
Time-aware Large Kernel Convolutions | ✓ Link | 29.6 | | 209M | | | TaLK Convolutions | 2020-02-08 |
Fast and Simple Mixture of Softmaxes with BPE and Hybrid-LightRNN for Language Generation | ✓ Link | 29.6 | | | | | Transformer Big + MoS | 2018-09-25 |
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation | | 29.57 | | | | | AdvAug (aut+adv) | 2020-06-21 |
PartialFormer: Modeling Part Instead of Whole for Machine Translation | ✓ Link | 29.56 | | 68M | | | PartialFormer | 2023-10-23 |
Improving Neural Language Modeling via Adversarial Training | ✓ Link | 29.52 | | | | | Transformer Big + adversarial MLE | 2019-06-10 |
Scaling Neural Machine Translation | ✓ Link | 29.3 | | 210M | 9G | | Transformer Big | 2018-06-01 |
Subformer: A Parameter Reduced Transformer | | 29.3 | | | | | Subformer-xlarge | 2021-01-01 |
Synchronous Bidirectional Neural Machine Translation | ✓ Link | 29.21 | | | | | SB-NMT | 2019-05-13 |
Self-Attention with Relative Position Representations | ✓ Link | 29.2 | | | | | Transformer (big) + Relative Position Representations | 2018-03-06 |
Learning to Encode Position for Transformer with Continuous Dynamical Model | ✓ Link | 29.2 | | | | | FLOATER-large | 2020-03-13 |
Modeling Localness for Self-Attention Networks | | 29.2 | | | | | Local Transformer | 2018-10-24 |
FRAGE: Frequency-Agnostic Word Representation | ✓ Link | 29.11 | | | | | Transformer Big with FRAGE | 2018-09-18 |
Mask Attention Networks: Rethinking and Strengthen Transformer | ✓ Link | 29.1 | | 63M | | | Mask Attention Network (base) | 2021-03-25 |
Mega: Moving Average Equipped Gated Attention | ✓ Link | 29.01 | 27.96 | 67M | | | Mega | 2022-09-21 |
Neural Machine Translation with Adequacy-Oriented Learning | | 28.99 | | | | | adequacy-oriented NMT | 2018-11-21 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 28.9 | | 202M | | | LightConv | 2019-01-29 |
Weighted Transformer Network for Machine Translation | ✓ Link | 28.9 | | | | | Weighted Transformer (large) | 2017-11-06 |
Universal Transformers | ✓ Link | 28.9 | | | | | universal transformer base | 2018-07-10 |
KERMIT: Generative Insertion-Based Modeling for Sequences | | 28.7 | | | | | KERMIT | 2019-06-04 |
Finetuning Pretrained Transformers into RNNs | ✓ Link | 28.7 | | | | | T2R + Pretrain | 2021-03-24 |
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation | | 28.58 | | | | | AdvAug (aut) | 2020-06-21 |
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation | ✓ Link | 28.5 | | | 44G | 2.81G | RNMT+ | 2018-04-26 |
Synthesizer: Rethinking Self-Attention in Transformer Models | ✓ Link | 28.47 | | | | | Synthesizer (Random + Vanilla) | 2020-05-02 |
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing | ✓ Link | 28.4 | | 48M | | | Hardware Aware Transformer | 2020-05-28 |
Attention Is All You Need | ✓ Link | 28.4 | | | 871G | 2300000000.0G | Transformer Big | 2017-06-12 |
Simple Recurrent Units for Highly Parallelizable Recurrence | ✓ Link | 28.4 | | | 34G | | Transformer + SRU | 2017-09-08 |
The Evolved Transformer | ✓ Link | 28.4 | | | 2488G | | Evolved Transformer Base | 2019-01-30 |
Random Feature Attention | | 28.2 | | | | | Rfa-Gate-arccos | 2021-03-03 |
Deep Residual Output Layers for Neural Language Generation | ✓ Link | 28.1 | | | | | Transformer-DRILL Base | 2019-05-14 |
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation | | 28.08 | | | | | AdvAug (mixup) | 2020-06-21 |
Incorporating a Local Translation Mechanism into Non-autoregressive Translation | ✓ Link | 27.35 | | | | | CMLM+LAT+4 iterations | 2020-11-12 |
Attention Is All You Need | ✓ Link | 27.3 | | | | 330000000.0G | Transformer Base | 2017-06-12 |
Levenshtein Transformer | ✓ Link | 27.27 | | | | | Levenshtein Transformer (distillation) | 2019-05-27 |
Non-autoregressive Translation with Disentangled Context Transformer | ✓ Link | 27.06 | | | | | DisCo + Mask-Predict (non-autoregressive) | |
Adaptively Sparse Transformers | ✓ Link | 26.93 | | | | | Adaptively Sparse Transformer (alpha-entmax) | 2019-08-30 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 26.8 | | | | | ResMLP-12 | 2021-05-07 |
Non-Autoregressive Translation by Learning Target Categorical Codes | ✓ Link | 26.6 | | | | | CNAT | 2021-03-21 |
Lite Transformer with Long-Short Range Attention | ✓ Link | 26.5 | | 17.3M | | | Lite Transformer | 2020-04-24 |
Convolutional Sequence to Sequence Learning | ✓ Link | 26.4 | | | 54G | | ConvS2S (ensemble) | 2017-05-08 |
ResMLP: Feedforward networks for image classification with data-efficient training | ✓ Link | 26.4 | | | | | ResMLP-6 | 2021-05-07 |
Accelerating Neural Transformer via an Average Attention Network | ✓ Link | 26.31 | | | | | Average Attention Network | 2018-05-02 |
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation | ✓ Link | 26.3 | | | | | GNMT+RL | 2016-09-26 |
Depthwise Separable Convolutions for Neural Machine Translation | ✓ Link | 26.1 | | | | | SliceNet | 2017-06-09 |
Accelerating Neural Transformer via an Average Attention Network | ✓ Link | 26.05 | | | | | Average Attention Network (w/o FFN) | 2018-05-02 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ✓ Link | 26.03 | | | 24G | | MoE | 2017-01-23 |
Accelerating Neural Transformer via an Average Attention Network | ✓ Link | 25.91 | | | | | Average Attention Network (w/o gate) | 2018-05-02 |
Adaptively Sparse Transformers | ✓ Link | 25.89 | | | | | Adaptively Sparse Transformer (1.5-entmax) | 2019-08-30 |
Dense Information Flow for Neural Machine Translation | ✓ Link | 25.52 | | | | | DenseNMT | 2018-06-03 |
Glancing Transformer for Non-Autoregressive Neural Machine Translation | ✓ Link | 25.21 | | | | | GLAT | 2020-08-18 |
Incorporating a Local Translation Mechanism into Non-autoregressive Translation | ✓ Link | 25.20 | | | | | CMLM+LAT+1 iterations | 2020-11-12 |
Convolutional Sequence to Sequence Learning | ✓ Link | 25.16 | | | 72G | | ConvS2S | 2017-05-08 |
Neural Machine Translation in Linear Time | ✓ Link | 23.75 | | | | | ByteNet | 2016-10-31 |
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ✓ Link | 23.64 | | | | | FlowSeq-large (NPD n = 30) | 2019-09-05 |
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ✓ Link | 23.14 | | | | | FlowSeq-large (NPD n = 15) | 2019-09-05 |
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ✓ Link | 22.94 | | | | | FlowSeq-large (IWD n = 15) | 2019-09-05 |
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement | ✓ Link | 21.54 | | | | | Denoising autoencoders (non-autoregressive) | 2018-02-19 |
Effective Approaches to Attention-based Neural Machine Translation | ✓ Link | 20.9 | | | | | RNN Enc-Dec Att | 2015-08-17 |
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ✓ Link | 20.85 | | | | | FlowSeq-large | 2019-09-05 |
[]() | | 20.7 | | | | | PBMT | |
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation | ✓ Link | 20.7 | | | 119G | | Deep-Att | 2016-06-14 |
Edinburgh's Syntax-Based Systems at WMT 2015 | | 20.7 | | | | | Phrase Based MT | 2015-09-01 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 20.23 | | | | | PBSMT + NMT | 2018-04-20 |
Non-Autoregressive Neural Machine Translation | ✓ Link | 19.17 | | | | | NAT +FT + NPD | 2017-11-07 |
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow | ✓ Link | 18.55 | | | | | FlowSeq-base | 2019-09-05 |
Sequence-Level Knowledge Distillation | ✓ Link | 18.5 | | | | | Seq-KD + Seq-Inter + Word-KD | 2016-06-25 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 17.94 | | | | | Unsupervised PBSMT | 2018-04-20 |
Neural Semantic Encoders | ✓ Link | 17.9 | | | | | NSE-NSE | 2016-07-14 |
Phrase-Based & Neural Unsupervised Machine Translation | ✓ Link | 17.16 | | | | | Unsupervised NMT + Transformer | 2018-04-20 |
Unsupervised Statistical Machine Translation | ✓ Link | 14.08 | | | | | SMT + iterative backtranslation (unsupervised) | 2018-09-04 |
Effective Approaches to Attention-based Neural Machine Translation | ✓ Link | 14.0 | | | | | Reverse RNN Enc-Dec | 2015-08-17 |
Effective Approaches to Attention-based Neural Machine Translation | ✓ Link | 11.3 | | | | | RNN Enc-Dec | 2015-08-17 |
Multi-branch Attentive Transformer | ✓ Link | | 29.9 | | | | MAT | 2020-06-18 |