PaLM: Scaling Language Modeling with Pathways | ✓ Link | 100 | PaLM 540B (finetuned) | 2022-04-05 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 99.4 | Vega v2 6B (KD-based prompt transfer) | 2022-12-04 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 99.2 | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 99 | UL2 20B (fine-tuned) | 2022-05-10 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 98.4 | DeBERTa-Ensemble | 2020-06-05 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 98.2 | Turing NLR v5 XXL 5.4B (fine-tuned) | 2022-12-04 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 96.8 | DeBERTa-1.5B | 2020-06-05 |
PaLM 2 Technical Report | ✓ Link | 96.0 | PaLM 2-L (1-shot) | 2023-05-17 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 94.8 | T5-XXL 11B (fine-tuned) | 2019-10-23 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 94 | FLAN 137B (prompt-tuned) | 2021-09-03 |
Language Models are Few-Shot Learners | ✓ Link | 92 | GPT-3 175B (few-shot, k=32) | 2020-05-28 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 92 | T5-XL 3B (fine-tuned) | 2019-10-23 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 91 | FLAN 137B (zero-shot) | 2021-09-03 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 91 | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
Language Models are Few-Shot Learners | ✓ Link | 91 | GPT-3 175B (0-shot) | 2020-05-28 |
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning | ✓ Link | 90.9 | T0-3B (CoT fine-tuned) | 2023-05-23 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 90.6 | RoBERTa-Winogrande-ft 355M (fine-tuned) | 2019-07-24 |
PaLM 2 Technical Report | ✓ Link | 90.0 | PaLM 2-M (1-shot) | 2023-05-17 |
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners | ✓ Link | 89.88 | Flipped-3B | 2022-10-06 |
PaLM 2 Technical Report | ✓ Link | 89.0 | PaLM 2-S (1-shot) | 2023-05-17 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 88 | GPT-NeoX (one-shot) | 2023-03-30 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 87 | FLAN 137B (few-shot, k=16) | 2021-09-03 |
Language Models are Few-Shot Learners | ✓ Link | 87 | GPT-3 175B (1-shot) | 2020-05-28 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 86.4 | RoBERTa-ft 355M (fine-tuned) | 2019-07-24 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 86 | Bloomberg GPT (one-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 86 | OPT 66B (one-shot) | 2023-03-30 |
Language Models are Few-Shot Learners | ✓ Link | 86 | GPT-3 13B (few-shot, k=32) | 2020-05-28 |
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models | | 85.30 | KiC-770M | 2022-10-28 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 85 | UL2 20B (0-shot) | 2022-05-10 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 84.4 | RoBERTa-Winogrande 355M (fine-tuned) | 2019-07-24 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 84.0 | Neo-6B (QA + WS) | 2022-10-05 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 84 | BLOOM 176B (one-shot) | 2023-03-30 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 83.4 | T5-Large 770M (fine-tuned) | 2019-10-23 |
SocialIQA: Commonsense Reasoning about Social Interactions | ✓ Link | 83.4 | BERT-SocialIQA 340M | 2019-04-22 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 81 | Hybrid H3 2.7B (0-shot, logit scoring) | 2022-12-28 |
SocialIQA: Commonsense Reasoning about Social Interactions | ✓ Link | 80.8 | BERT-large 340M | 2019-04-22 |
Exploring the Benefits of Training Expert Language Models over Instruction Tuning | ✓ Link | 79.25 | RoE-3B | 2023-02-07 |
Efficient Language Modeling with Sparse all-MLP | | 79 | sMLP – deterministic 9.4B (0-shot) | 2022-03-14 |
KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs | ✓ Link | 78.0 | KELM (finetuning BERT-large based single model) | 2021-09-09 |
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model | ✓ Link | 78.0 | AlexaTM 20B | 2022-08-02 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 77.0 | Neo-6B (few-shot) | 2022-10-05 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 77 | Hybrid H3 2.7B (3-shot, logit scoring) | 2022-12-28 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 76.4 | Causal Strength w/multi-word predicates (presumably on WinoGrande?) | 2019-07-24 |
Efficient Language Modeling with Sparse all-MLP | | 76 | Gshard 9B | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 75 | Switch Transformer 9B | 2022-03-14 |
Language Models are Few-Shot Learners | ✓ Link | 73.0 | GPT-3 Large 760M (0-shot) | 2020-05-28 |
Handling Multiword Expressions in Causality Estimation | | 71.2 | Causal Strength Computation w/multi-word predicates (on ClueWeb12) | 2017-01-01 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 71.2 | T5-Base 220M (fine-tuned) | 2019-10-23 |
Handling Multiword Expressions in Causality Estimation | | 70.2 | Causal Strength Computation (on Causal Net) | 2017-01-01 |
Handling Multiword Expressions in Causality Estimation | | 69.9 | Causal Strength Computation (on ClueWeb12) | 2017-01-01 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 67 | Hybrid H3 125M (0-shot, logit scoring) | 2022-12-28 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 67 | Hybrid H3 125M (0-shot, rank classification) | 2022-12-28 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 65.4 | Pointwise Mutual Information (on 10M stories) | 2019-07-24 |
Efficient Language Modeling with Sparse all-MLP | | 64 | HASH Layers 10B (0-shot) | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 63 | Base Layers 10B (0-shot) | 2022-03-14 |
N-Grammer: Augmenting Transformers with latent n-grams | ✓ Link | 60.0 | N-Grammer 343M | 2022-07-13 |
Handling Multiword Expressions in Causality Estimation | | 58.8 | Pointwise Mutual Information (on Project Gutenberg) | 2017-01-01 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 58.2 | Neo-6B (QA) | 2022-10-05 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 51 | H3 125M (0-shot, rank classification) | 2022-12-28 |
Handling Multiword Expressions in Causality Estimation | | 50 | Random chance baseline | 2017-01-01 |