Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 96% | Vega v2 6B (KD-based prompt transfer) | 2022-12-04 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 95.7% | PaLM 540B (fine-tuned) | 2022-04-05 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 94.1% | Turing NLR v5 XXL 5.4B (fine-tuned) | 2022-12-04 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 93.5% | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 93.2% | DeBERTa-1.5B | 2020-06-05 |
Muppet: Massive Multi-task Representations with Pre-Finetuning | ✓ Link | 92.8% | MUPPET Roberta Large | 2021-01-26 |
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing | ✓ Link | 92.7% | DeBERTaV3large | 2021-11-18 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 92.5% | T5-XXL 11B | 2019-11-08 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 92.5% | T5-XXL 11B (fine-tuned) | 2019-10-23 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 92.1% | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 92.1% | UL2 20B (fine-tuned) | 2022-05-10 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 92.0% | SMARTRoBERTa | 2019-11-08 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 91.7% | FLAN 137B (prompt-tuned) | 2021-09-03 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 91.1% | T5-XL 3B | 2019-10-23 |
Entailment as Few-Shot Learner | ✓ Link | 90.5% | RoBERTa-large 355M + Entailment as Few-shot Learner | 2021-04-29 |
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ✓ Link | 89.2% | ALBERT | 2019-09-26 |
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | | 88.7% | Adv-RoBERTa ensemble | 2019-08-13 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 88.2% | RoBERTa | 2019-07-26 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 88.2% | RoBERTa (ensemble) | 2019-07-26 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 87.4% | T5-Large 738M | 2023-04-27 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 87.2% | T5-Large 770M | 2019-10-23 |
Entailment as Few-Shot Learner | ✓ Link | 87.2% | RoBERTa-large 355M + EFL + UCA | 2021-04-29 |
A Statistical Framework for Low-bitwidth Training of Deep Neural Networks | ✓ Link | 86.8 | PSQ (Chen et al., 2020) | 2020-10-27 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding | ✓ Link | 85.9% | XLNet (single model) | 2019-06-19 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | ✓ Link | 85.4% | RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned) | 2022-08-15 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 84.8% | OPT-IML 175B | 2022-12-22 |
Q8BERT: Quantized 8Bit BERT | ✓ Link | 84.8 | Q8BERT (Zafrir et al., 2019) | 2019-10-14 |
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT | | 84.7 | Q-BERT (Shen et al., 2020) | 2019-09-12 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 84.5% | FLAN 137B (8-shot) | 2021-09-03 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 84.1% | FLAN 137B (0-shot) | 2021-09-03 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 83.8% | OPT-IML 30B | 2022-12-22 |
[]() | | 83.6% | ELECTRA | |
PaLM 2 Technical Report | ✓ Link | 81.9% | PaLM 2-M (1-shot) | 2023-05-17 |
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning | ✓ Link | 80.8% | T0-3B (CoT fine-tuned) | 2023-05-23 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 80.2% | ERNIE 2.0 Large | 2019-07-29 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 80.1% | T5-Base 220M | 2019-10-23 |
CLEAR: Contrastive Learning for Sentence Representation | | 79.8% | MLM+ del-span | 2020-12-31 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 79.6% | PaLM 540B (5-shot) | 2022-04-05 |
PaLM 2 Technical Report | ✓ Link | 79.3% | PaLM 2-L (1-shot) | 2023-05-17 |
SpanBERT: Improving Pre-training by Representing and Predicting Spans | ✓ Link | 79.0% | SpanBERT | 2019-07-24 |
PaLM 2 Technical Report | ✓ Link | 78.7% | PaLM 2-S (1-shot) | 2023-05-17 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 78.7% | PaLM 540B (1-shot) | 2022-04-05 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 75.1% | Neo-6B (QA + WS) | 2022-10-05 |
Big Bird: Transformers for Longer Sequences | ✓ Link | 75.0% | BigBird | 2020-07-28 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 74.8% | ERNIE 2.0 Base | 2019-07-29 |
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models | | 74.00 | KiC-770M | 2022-10-28 |
RealFormer: Transformer Likes Residual Attention | ✓ Link | 73.7% | RealFormer | 2020-12-21 |
SqueezeBERT: What can computer vision teach NLP about efficient neural networks? | ✓ Link | 73.2% | SqueezeBERT | 2020-06-19 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 72.9% | PaLM 540B (0-shot) | 2022-04-05 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 71.2% | SMART-BERT | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 71.2% | SMART | 2019-11-08 |
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners | ✓ Link | 71.05 | Flipped-3B | 2022-10-06 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | 70.1% | BERT-large 340M | 2018-10-11 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 69.9% | T5-Small | 2019-10-23 |
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language | ✓ Link | 69.9% | data2vec | 2022-02-07 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 69.3% | Bloomberg GPT 50B (1-shot) | 2023-03-30 |
FNet: Mixing Tokens with Fourier Transforms | ✓ Link | 69% | FNet-Large | 2021-05-09 |
Language Models are Few-Shot Learners | ✓ Link | 69% | GPT-3 175B (few-shot, k=32) | 2020-05-28 |
ERNIE: Enhanced Language Representation with Informative Entities | ✓ Link | 68.8% | ERNIE | 2019-05-17 |
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model | ✓ Link | 68.6% | AlexaTM 20B | 2022-08-02 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 67.9% | LaMini-GPT 1.5B | 2023-04-27 |
SenseBERT: Driving Some Sense into BERT | | 67.5% | SenseBERT-base 110M | 2019-08-15 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 66.8% | OPT-IML 1.3B | 2022-12-22 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 66% | TinyBERT-6 67M | 2019-09-23 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 65% | LaMini-F-T5 783M | 2023-04-27 |
Exploring the Benefits of Training Expert Language Models over Instruction Tuning | ✓ Link | 64.01 | RoE-3B | 2023-02-07 |
Not all layers are equally as important: Every Layer Counts BERT | | 63 | ELC-BERT-base 98M (zero init) | 2023-11-03 |
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | ✓ Link | 62.9% | DistilBERT 66M | 2019-10-02 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 62.9% | TinyBERT-4 14.5M | 2019-09-23 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 61.7% | Neo-6B (QA) | 2022-10-05 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 60.7% | UL2 20B (0-shot) | 2022-05-10 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 60.3% | OPT 175B | 2022-12-22 |
N-Grammer: Augmenting Transformers with latent n-grams | ✓ Link | 59.2% | N-Grammer 343M | 2022-07-13 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 59.2% | Hybrid H3 125M (0-shot, logit scoring) | 2022-12-28 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 58.8% | Neo-6B (few-shot) | 2022-10-05 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 58.1% | OPT 30B | 2022-12-22 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 58.1% | Hybrid H3 125M (3-shot, logit scoring) | 2022-12-28 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 58.1% | Hybrid H3 125M (3-shot, rank classification) | 2022-12-28 |
How to Train BERT with an Academic Budget | ✓ Link | 57.7% | 24hBERT | 2021-04-15 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 57.4% | BLOOM 176B (1-shot) | 2023-03-30 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 57% | LaMini-T5 738M | 2023-04-27 |
Not all layers are equally as important: Every Layer Counts BERT | | 55.4 | ELC-BERT-small 24M | 2023-11-03 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 54.9% | OPT 66B (1-shot) | 2023-03-30 |
Not all layers are equally as important: Every Layer Counts BERT | | 54.7 | LTG-BERT-base 98M | 2023-11-03 |
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | ✓ Link | 54.2% | OPT 1.3B | 2022-12-22 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 53.8% | GPT-NeoX 20B (1-shot) | 2023-03-30 |
Not all layers are equally as important: Every Layer Counts BERT | | 53.7 | LTG-BERT-small 24M | 2023-11-03 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 53.1% | H3 125M (0-shot, rank classification) | 2022-12-28 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 52.3% | GPT-2-XL 1.5B | 2023-04-27 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 52.3% | H3 125M (3-shot, rank classification) | 2022-12-28 |