PaLM: Scaling Language Modeling with Pathways | ✓ Link | 100 | PaLM 540B (fine-tuned) | 2022-04-05 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 98.6 | Vega v2 6B (KD-based prompt transfer) | 2022-12-04 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 98.1 | UL2 20B (fine-tuned) | 2022-05-10 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 97.3 | Turing NLR v5 XXL 5.4B (fine-tuned) | 2022-12-04 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 96.6 | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 95.9 | DeBERTa-1.5B | 2020-06-05 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 93.8 | T5-XXL 11B (fine-tuned) | 2019-10-23 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 93.3 | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 90.1 | RoBERTa-WinoGrande 355M | 2019-07-24 |
Scaling Instruction-Finetuned Language Models | ✓ Link | 89.82 | Flan-T5 XXL (zero -shot) | 2022-10-20 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 89.5 | PaLM 540B (5-shot) | 2022-04-05 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 89.1 | PaLM 540B (0-shot) | 2022-04-05 |
PaLM 2 Technical Report | ✓ Link | 88.1 | PaLM 2-M (1-shot) | 2023-05-17 |
PaLM 2 Technical Report | ✓ Link | 86.9 | PaLM 2-L (1-shot) | 2023-05-17 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 86.5 | FLAN 137B (prompt-tuned) | 2021-09-03 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 86.3 | PaLM 540B (1-shot) | 2022-04-05 |
TTTTTackling WinoGrande Schemas | | 84.6 | TTTTT 3B (fine-tuned) | 2020-03-18 |
PaLM 2 Technical Report | ✓ Link | 84.6 | PaLM 2-S (1-shot) | 2023-05-17 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 83.1 | RoBERTa-DPR 355M | 2019-07-24 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 80.8 | FLAN 137B (zero-shot) | 2021-09-03 |
Language Models are Few-Shot Learners | ✓ Link | 80.1 | GPT-3 175B (few-shot) | 2020-05-28 |
Generative Data Augmentation for Commonsense Reasoning | ✓ Link | 80 | RoBERTa-large + G-DAug-Inf | 2020-04-24 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 79.9 | UL2 20B (0-shot) | 2022-05-10 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 78.8 | ALBERT-xxlarge 235M | 2021-04-16 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 77.9 | Neo-6B (QA + WS) | 2022-10-05 |
A Hybrid Neural Network Model for Commonsense Reasoning | ✓ Link | 75.1 | HNN | 2019-07-27 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 74.7 | Neo-6B (QA) | 2022-10-05 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 73.9 | RoBERTa-large 354M | 2021-04-16 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 73.3 | GPT-2-XL 1.5B | 2023-04-27 |
A Surprisingly Robust Trick for Winograd Schema Challenge | ✓ Link | 72.5 | BERTwiki 340M (fine-tuned on WSCR) | 2019-05-15 |
SocialIQA: Commonsense Reasoning about Social Interactions | ✓ Link | 72.5 | BERT-SocialIQA 340M | 2019-04-22 |
A Surprisingly Robust Trick for Winograd Schema Challenge | ✓ Link | 71.4 | BERT-large 340M (fine-tuned on WSCR) | 2019-05-15 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 70.7 | GPT-2-XL 1.5B | 2019-02-14 |
A Surprisingly Robust Trick for Winograd Schema Challenge | ✓ Link | 70.3 | BERTwiki 340M (fine-tuned on half of WSCR) | 2019-05-15 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 69.6 | LaMini-GPT 1.5B | 2023-04-27 |
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG | ✓ Link | 69.2 | GPT-2 Medium 774M (partial scoring) | 2018-11-05 |
N-Grammer: Augmenting Transformers with latent n-grams | ✓ Link | 68.3 | N-Grammer 343M | 2022-07-13 |
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model | ✓ Link | 68.3 | AlexaTM 20B | 2022-08-02 |
SocialIQA: Commonsense Reasoning about Social Interactions | ✓ Link | 67 | BERT-large 340M | 2019-04-22 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 66.7 | T5-Large 738M | 2023-04-27 |
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning | ✓ Link | 66 | T0-3B (CoT fine-tuned) | 2023-05-23 |
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models | | 65.40 | KiC-770M | 2022-10-28 |
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG | ✓ Link | 64.5 | GPT-2 Medium 774M (full scoring) | 2018-11-05 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 64.1 | LaMini-F-T5 783M | 2023-04-27 |
A Simple Method for Commonsense Reasoning | ✓ Link | 63.7 | Ensemble of 14 LMs | 2018-06-07 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 63.5 | H3 125M (3-shot, rank classification) | 2022-12-28 |
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning | | 63.0 | DSSM | 2019-04-03 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 63 | RoBERTa-base 125M | 2021-04-16 |
A Simple Method for Commonsense Reasoning | ✓ Link | 62.6 | Word-level CNN+LSTM (partial scoring) | 2018-06-07 |
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning | | 62.4 | UDSSM-II (ensemble) | 2019-04-03 |
A Surprisingly Robust Trick for Winograd Schema Challenge | ✓ Link | 62.3 | BERT-base 110M (fine-tuned on WSCR) | 2019-05-15 |
Exploring the Benefits of Training Expert Language Models over Instruction Tuning | ✓ Link | 62.21 | RoE-3B | 2023-02-07 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | 62.0 | BERT-large 340M | 2018-10-11 |
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG | ✓ Link | 61.5 | GPT-2 Small 117M (partial scoring) | 2018-11-05 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 61.5 | H3 125M (0-shot, rank classification) | 2022-12-28 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 61.4 | BERT-large 340M | 2021-04-16 |
Attention Is (not) All You Need for Commonsense Reasoning | ✓ Link | 60.3 | BERT-base 110M + MAS | 2019-05-31 |
On Generalization in Coreference Resolution | ✓ Link | 60.1 | longdoc S (OntoNotes + PreCo + LitBank) | 2021-09-20 |
On Generalization in Coreference Resolution | ✓ Link | 59.4 | longdoc S (ON + PreCo + LitBank + 30k pseudo-singletons) | 2021-09-20 |
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning | | 59.2 | UDSSM-II | 2019-04-03 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 59 | LaMini-T5 738M | 2023-04-27 |
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners | ✓ Link | 58.37 | Flipped-3B | 2022-10-06 |
Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge | | 58.3 | KEE+NKAM winner of the WSC2016 | 2016-11-13 |
A Simple Method for Commonsense Reasoning | ✓ Link | 57.9 | Char-level CNN+LSTM (partial scoring) | 2018-06-07 |
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning | | 57.1 | UDSSM-I (ensemble) | 2019-04-03 |
A Knowledge Hunting Framework for Common Sense Reasoning | | 57.1 | Knowledge Hunter | 2018-10-02 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 57.1 | WKH | 2019-07-24 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 56.5 | BERT-base 110M | 2021-04-16 |
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG | ✓ Link | 55.7 | GPT-2 Small 117M (full scoring) | 2018-11-05 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 55.4 | ALBERT-base 11M | 2021-04-16 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 54.8 | Pythia 12B (0-shot) | 2023-04-03 |
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning | | 54.5 | UDSSM-I | 2019-04-03 |
Attention Is All You Need | ✓ Link | 54.1 | Subword-level Transformer LM | 2017-06-12 |
Attention Is (not) All You Need for Commonsense Reasoning | ✓ Link | 52.8 | USSM + Supervised DeepNet + KB | 2019-05-31 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 52.8 | KEE+NKAM on WinoGrande | 2019-07-24 |
Attention Is (not) All You Need for Commonsense Reasoning | ✓ Link | 52 | USSM + KB | 2019-05-31 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 50 | Random chance baseline | 2021-04-16 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 43.3 | Hybrid H3 125M (3-shot, logit scoring) | 2022-12-28 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 38.5 | Pythia 2.8B (0-shot) | 2023-04-03 |
Ask Me Anything: A simple strategy for prompting language models | ✓ Link | 36.5 | Neo-6B (few-shot) | 2022-10-05 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 36.5 | Pythia 6.9B (0-shot) | 2023-04-03 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 36.5 | Pythia 12B (5-shot) | 2023-04-03 |