ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ✓ Link | 99.2% | ALBERT | 2019-09-26 |
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | | 99.2% | StructBERTRoBERTa ensemble | 2019-08-13 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 99.2% | ALICE | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 99.2% | MT-DNN-SMART | 2019-11-08 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 98.9% | RoBERTa (ensemble) | 2019-07-26 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 96.7% | T5-11B | 2019-10-23 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 96.3% | T5-3B | 2019-10-23 |
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing | ✓ Link | 96% | DeBERTaV3large | 2021-11-18 |
[]() | | 95.4% | ELECTRA | |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 95.3% | DeBERTa (large) | 2020-06-05 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding | ✓ Link | 94.9% | XLNet (single model) | 2019-06-19 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 94.8% | T5-Large 770M | 2019-10-23 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | ✓ Link | 94.7% | RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned) | 2022-08-15 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 94.6% | ERNIE 2.0 Large | 2019-07-29 |
A Statistical Framework for Low-bitwidth Training of Deep Neural Networks | ✓ Link | 94.5 | PSQ (Chen et al., 2020) | 2020-10-27 |
Entailment as Few-Shot Learner | ✓ Link | 94.5% | RoBERTa-large 355M + Entailment as Few-shot Learner | 2021-04-29 |
SpanBERT: Improving Pre-training by Representing and Predicting Spans | ✓ Link | 94.3% | SpanBERT | 2019-07-24 |
TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding | | 94.08% | TRANS-BLSTM | 2020-03-16 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 93.7% | T5-Base | 2019-10-23 |
Adversarial Self-Attention for Language Understanding | ✓ Link | 93.6% | ASA + RoBERTa | 2022-06-25 |
CLEAR: Contrastive Learning for Sentence Representation | | 93.4% | MLM+ subs+ del-span | 2020-12-31 |
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT | | 93.0 | Q-BERT (Shen et al., 2020) | 2019-09-12 |
Q8BERT: Quantized 8Bit BERT | ✓ Link | 93.0 | Q8BERT (Zafrir et al., 2019) | 2019-10-14 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 92.9% | ERNIE 2.0 Base | 2019-07-29 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | 92.7% | BERT-LARGE | 2018-10-11 |
Big Bird: Transformers for Longer Sequences | ✓ Link | 92.2% | BigBird | 2020-07-28 |
RealFormer: Transformer Likes Residual Attention | ✓ Link | 91.89% | RealFormer | 2020-12-21 |
Adversarial Self-Attention for Language Understanding | ✓ Link | 91.4% | ASA + BERT-base | 2022-06-25 |
ERNIE: Enhanced Language Representation with Informative Entities | ✓ Link | 91.3% | ERNIE | 2019-05-17 |
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language | ✓ Link | 91.1% | data2vec | 2022-02-07 |
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization | ✓ Link | 91.0% | Charformer-Tall | 2021-06-23 |
How to Train BERT with an Academic Budget | ✓ Link | 90.6 | 24hBERT | 2021-04-15 |
SenseBERT: Driving Some Sense into BERT | | 90.6% | SenseBERT-base 110M | 2019-08-15 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 90.4% | TinyBERT-6 67M | 2019-09-23 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 90.3% | T5-Small | 2019-10-23 |
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | ✓ Link | 90.2% | DistilBERT 66M | 2019-10-02 |
SqueezeBERT: What can computer vision teach NLP about efficient neural networks? | ✓ Link | 90.1% | SqueezeBERT | 2020-06-19 |
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention | ✓ Link | 88.7% | Nyströmformer | 2021-02-07 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 87.7% | TinyBERT-4 14.5M | 2019-09-23 |
FNet: Mixing Tokens with Fourier Transforms | ✓ Link | 85% | FNet-Large | 2021-05-09 |
LM-CPPF: Paraphrasing-Guided Data Augmentation for Contrastive Prompt-Based Few-Shot Fine-Tuning | ✓ Link | 70.2% | LM-CPPF RoBERTa-base | 2023-05-29 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | SMART-BERT | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | SMARTRoBERTa | 2019-11-08 |