Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | ✓ Link | 92.54 | GPT-4o (HPT) | 2024-06-18 |
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention | ✓ Link | 91.2 | DeBERTaV3-large+KEAR | 2021-12-06 |
PaLM 2 Technical Report | ✓ Link | 90.4 | PaLM 2 (few‑shot, CoT, SC) | 2023-05-17 |
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention | ✓ Link | 89.4 | KEAR | 2021-12-06 |
Fusing Context Into Knowledge Graph for Commonsense Question Answering | ✓ Link | 83.3 | DEKCOR | 2020-12-09 |
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark | ✓ Link | 79.3 | Unicorn 11B (fine-tuned) | 2021-03-24 |
Muppet: Massive Multi-task Representations with Pre-Finetuning | ✓ Link | 79.2 | MUPPET Roberta Large | 2021-01-26 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 79.1 | UnifiedQA 11B (fine-tuned) | 2020-05-02 |
Deep Bidirectional Language-Knowledge Graph Pretraining | ✓ Link | 78.2 | DRAGON | 2022-10-17 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 78.1 | T5-XXL 11B (fine-tuned) | 2020-05-02 |
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ✓ Link | 76.5 | Albert Lan et al. (2020) (ensemble) | 2019-09-26 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 76.2 | UnifiedQA 11B (zero-shot) | 2020-05-02 |
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering | ✓ Link | 76.1 | QA-GNN | 2021-04-13 |
Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering | ✓ Link | 75.3 | XLNet+GraphReason | 2019-09-09 |
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering | | 73.5 | GrapeQA: PEGA | 2023-03-22 |
Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering | | 73.2 | RoBERTa+HyKAS Ma et al. (2019) | 2019-10-30 |
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention | ✓ Link | 73.0 | GPT-3 Direct Finetuned | 2021-12-06 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 72.3 | STaR (on GPT-J) | 2022-03-28 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 72.1 | RoBERTa-Large 355M | 2019-07-26 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 68.8 | STaR without Rationalization (on GPT-J) | 2022-03-28 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 66.4 | OPT 66B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 65.5 | Bloomberg GPT 50B (1-shot) | 2023-03-30 |
Explain Yourself! Leveraging Language Models for Commonsense Reasoning | ✓ Link | 64.7 | CAGE-reasoning | 2019-06-06 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 64.2 | BLOOM 176B (1-shot) | 2023-03-30 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 64 | UnifiedQA 440M (fine-tuned) | 2020-05-02 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 62.5 | BART-large 440M (fine-tuned) | 2020-05-02 |
Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models | | 62.2 | BERT_CSlarge | 2019-08-19 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 60.4 | GPT-NeoX 20B (1-shot) | 2023-03-30 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 60.0 | GPT-J Direct Finetuned | 2022-03-28 |
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning | ✓ Link | 58.9 | KagNet | 2019-09-04 |
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | ✓ Link | 55.9 | BERT-LARGE | 2018-11-02 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 55.7 | UL2 20B (chain-of-thought + self-consistency) | 2022-05-10 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 55.6 | Few-shot CoT LaMDA 137B | 2022-03-28 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 51.4 | UL2 20B (chain-of-thought) | 2022-05-10 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 36.6 | Few-shot CoT GPT-J | 2022-03-28 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 34.2 | UL2 20B (zero-shot) | 2022-05-10 |
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | ✓ Link | 28.6 | Chain of thought ASDiv | 2022-01-28 |
STaR: Bootstrapping Reasoning With Reasoning | ✓ Link | 20.9 | Few-shot Direct GPT-J | 2022-03-28 |