Integrating Pre-trained Language Model into Neural Machine Translation | | 40.43 | | PiNMT | 2023-10-30 |
BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation | ✓ Link | 38.61 | 73.8M | BiBERT | 2021-09-09 |
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation | ✓ Link | 38.37 | | Bi-SimCut | 2022-06-06 |
Relaxed Attention for Transformer Models | ✓ Link | 37.96 | 24.1M | Cutoff + Relaxed Attention + LM | 2022-09-20 |
Deterministic Reversible Data Augmentation for Neural Machine Translation | ✓ Link | 37.95 | | DRDA | 2024-06-04 |
R-Drop: Regularized Dropout for Neural Networks | ✓ Link | 37.90 | | Transformer + R-Drop + Cutoff | 2021-06-28 |
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation | ✓ Link | 37.81 | | SimCut | 2022-06-06 |
Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule | ✓ Link | 37.78 | | Cutoff+Knee | 2020-03-09 |
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation | ✓ Link | 37.6 | | Cutoff | 2020-09-29 |
CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation | ✓ Link | 37.53 | | CipherDAug | 2022-04-01 |
R-Drop: Regularized Dropout for Neural Networks | ✓ Link | 37.25 | | Transformer + R-Drop | 2021-06-28 |
Data Diversification: A Simple Strategy For Neural Machine Translation | ✓ Link | 37.2 | | Data Diversification | 2019-11-05 |
UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost | | 36.88 | | UniDrop | 2021-04-11 |
Sequence Generation with Mixed Representations | ✓ Link | 36.41 | | MixedRepresentations | 2020-07-11 |
Mask Attention Networks: Rethinking and Strengthen Transformer | ✓ Link | 36.3 | 37M | Mask Attention Network (small) | 2021-03-25 |
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | ✓ Link | 36.3 | | MUSE(Parallel Multi-scale Attention) | 2019-11-17 |
Rethinking Perturbations in Encoder-Decoders for Fast Training | ✓ Link | 36.22 | 37M | Transformer+Rep(Sim)+WDrop | 2021-04-05 |
Multi-branch Attentive Transformer | ✓ Link | 36.22 | | MAT | 2020-06-18 |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 35.8 | | TransformerBase + AutoDropout | 2021-01-05 |
Joint Source-Target Self Attention with Locality Constraints | ✓ Link | 35.7 | | Local Joint Self-attention | 2019-05-16 |
Time-aware Large Kernel Convolutions | ✓ Link | 35.5 | | TaLK Convolutions | 2020-02-08 |
Autoregressive Knowledge Distillation through Imitation Learning | ✓ Link | 35.4 | | ImitKD + Full | 2020-09-15 |
DeLighT: Deep and Light-weight Transformer | ✓ Link | 35.3 | | DeLighT | 2020-08-03 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 35.2 | | DynamicConv | 2019-01-29 |
Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks | | 35.1385 | | Transformer | 2022-05-15 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 34.8 | | LightConv | 2019-01-29 |
Attention Is All You Need | ✓ Link | 34.44 | | Transformer | 2017-06-12 |
Random Feature Attention | | 34.4 | | Rfa-Gate-arccos | 2021-03-03 |
Latent Alignment and Variational Attention | ✓ Link | 33.1 | | Variational Attention | 2018-07-10 |
Classical Structured Prediction Losses for Sequence to Sequence Learning | ✓ Link | 32.84 | | Minimum Risk Training [Edunov2017] | 2017-11-14 |
Non-Autoregressive Translation by Learning Target Categorical Codes | ✓ Link | 31.15 | | CNAT | 2021-03-21 |
Towards Neural Phrase-based Machine Translation | ✓ Link | 30.08 | | Neural PBMT + LM [Huang2018] | 2017-06-17 |
Tag-less Back-Translation | | 28.83 | | Back-Translation Finetuning | 2019-12-22 |
An Actor-Critic Algorithm for Sequence Prediction | ✓ Link | 28.53 | | Actor-Critic [Bahdanau2017] | 2016-07-24 |