Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | ✓ Link | 97.72 | | Claude 3.5 Sonnet (HPT) | 2024-06-18 |
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems | ✓ Link | 97.1 | | DUP prompt upon GPT-4 | 2024-04-23 |
Qwen2 Technical Report | ✓ Link | 96.7 | 72 | Qwen2-Math-72B-Instruct
(greedy) | 2024-07-15 |
[]() | | 96.4 | 7 | SFT-Mistral-7B (Metamath, OVM, Smart Ensemble) | |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 96.0 | | OpenMath2-Llama3.1-70B (majority@256) | 2024-10-02 |
[]() | | 95.2 | 75 | Jiutian-大模型 | |
[]() | | 95.1 | 7 | DAMOMath-7B(MetaMath, OVM, BS, Ensemble) | |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 95 | | Claude 3 Opus (0-shot chain-of-thought) | 2024-03-04 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 94.9 | | OpenMath2-Llama3.1-70B | 2024-10-02 |
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models | ✓ Link | 94.8 | | GPT-4 (Teaching-Inspired) | 2024-10-10 |
[]() | | 94.13 | 7 | SFT-Mistral-7B (Metamath + ovm +ensemble) | |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 94.1 | | OpenMath2-Llama3.1-8B (majority@256) | 2024-10-02 |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | ✓ Link | 94.0 | | Qwen2-72B-Instruct-Step-DPO (0-shot CoT) | 2024-06-26 |
[]() | | 93.2 | 7 | DAMOMath-7B(MetaMath, OVM, Ensemble) | |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 92.3 | | Claude 3 Sonnet (0-shot chain-of-thought) | 2024-03-04 |
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing | ✓ Link | 92 | 70 | AlphaLLM (with MCTS) | 2024-04-18 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 91.7 | | OpenMath2-Llama3.1-8B | 2024-10-02 |
PaLM 2 Technical Report | ✓ Link | 91.0 | | PaLM 2 (few-shot, k=8, SC) | 2023-05-17 |
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling | ✓ Link | 90.91 | | GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 90.8 | 70 | OpenMath-CodeLlama-70B (w/ code, SC, k=50) | 2024-02-15 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 90.4 | 70 | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 90.1 | 70 | OpenMath-Llama2-70B (w/ code, SC, k=50) | 2024-02-15 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 89.6 | 70 | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | ✓ Link | 89.1 | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) | 2023-12-14 |
[]() | | 89.0 | 13 | Llama SFT (Metamath ToRA Ensemble) | |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 89 | 62 | Minerva 62B (maj5@100) | 2022-06-29 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 88.9 | | Claude 3 Haiku (0-shot chain-of-thought) | 2024-03-04 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 88.3 | 70 | ToRA-70B (SC, k=50) | 2023-09-29 |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | ✓ Link | 88.2 | 7 | DeepSeekMATH-RL-7B | 2024-02-05 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 88.2 | 7 | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 88.0 | 34 | OpenMath-CodeLlama-34B (w/ code, SC, k=50) | 2024-02-15 |
Model Card and Evaluations for Claude Models | | 88 | | Claude 2 (0-shot chain-of-thought) | 2023-07-11 |
[]() | | 87.41 | 4 | Shivaay-4B (8-shot chain-of-thought) | |
Solving math word problems with process- and outcome-based feedback | | 87.3 | 70 | DeepMind 70B Model (SFT+ORM-RL, ORM reranking) | 2022-11-25 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 87.2 | 7 | MMOS-DeepSeekMath-7B(0-shot,k=50) | 2024-02-23 |
Solving math word problems with process- and outcome-based feedback | | 87.1 | 70 | DeepMind 70B Model (SFT+PRM-RL, PRM reranking) | 2022-11-25 |
Sparks of Artificial General Intelligence: Early experiments with GPT-4 | ✓ Link | 87.1 | | GPT-4 | 2023-03-22 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 86.9 | 7 | OpenMath-Mistral-7B (w/ code, SC, k=50) | 2024-02-15 |
Orca-Math: Unlocking the potential of SLMs in Grade School Math | | 86.8 | 7 | Orca-Math 7B (fine-tuned) | 2024-02-16 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 86.8 | 7 | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 86.8 | 13 | OpenMath-CodeLlama-13B (w/ code, SC, k=50) | 2024-02-15 |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 86.5 | | Gemini Pro (maj1@32) | 2023-12-19 |
[]() | | 85.5 | | Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval) | |
Model Card and Evaluations for Claude Models | | 85.2 | | Claude 1.3 (0-shot chain-of-thought) | 2023-07-11 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 85.1 | 34 | ToRA-Code-34B (SC, k=50) | 2023-09-29 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 84.8 | 7 | OpenMath-CodeLlama-7B (w/ code, SC, k=50) | 2024-02-15 |
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning | ✓ Link | 84.7 | 7 | OVM-Mistral-7B (verify100@1) | 2023-11-16 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 84.7 | 70 | OpenMath-Llama2-70B (w/ code) | 2024-02-15 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 84.6 | 70 | OpenMath-CodeLlama-70B (w/ code) | 2024-02-15 |
LEVER: Learning to Verify Language-to-Code Generation with Execution | ✓ Link | 84.5 | 175 | code-davinci-002 175B (LEVER, 8-shot) | 2023-02-16 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 84.3 | 70 | ToRA 70B | 2023-09-29 |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | ✓ Link | 84.1 | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) | 2023-12-14 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 83.9 | 70 | MathCoder-L-70B | 2023-10-05 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 83.2 | 7 | WizardMath-7B-V1.1 | 2023-08-18 |
Making Large Language Models Better Reasoners with Step-Aware Verifier | | 83.2 | 175 | DIVERSE 175B (8-shot) | 2022-06-06 |
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning | ✓ Link | 82.6 | 7 | OVM-Mistral-7B (verify20@1) | 2023-11-16 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 82.6 | 7 | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
The ART of LLM Refinement: Ask, Refine, and Trust | | 82.6 | | ChatGPT (Ask, Refine, Trust) | 2023-11-14 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 82.5 | 8 | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 82.3 | 70 | MetaMath 70B | 2023-09-21 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 82.3 | 70 | MuggleMATH 70B | 2023-10-09 |
Large Language Models Can Self-Improve | | 82.1 | 540 | PaLM 540B (Self Improvement, Self Consistency) | 2022-10-20 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 81.7 | 34 | MathCoder-CL-34B | 2023-10-05 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 81.6 | 70 | WizardMath-70B-V1.0 | 2023-08-18 |
TinyGSM: achieving >80% on GSM8k with small language models | | 81.5 | 2.6 | Phi-GSM+V 1.3B+1.3B (verify48@1) | 2023-12-14 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 81.1 | 7 | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 81.1 | 8 | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
Model Card and Evaluations for Claude Models | | 80.9 | | Claude Instant 1.1 (0-shot chain-of-thought) | 2023-07-11 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 80.7 | 34 | ToRA-Code 34B | 2023-09-29 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 80.7 | 34 | OpenMath-CodeLlama-34B (w/ code) | 2024-02-15 |
PaLM 2 Technical Report | ✓ Link | 80.7 | | PaLM 2 (few-shot, k=8, CoT) | 2023-05-17 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 80.5 | 7 | MMOS-DeepSeekMath-7B(0-shot) | 2024-02-23 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 80.4 | 34 | MMOS-CODE-34B(0-shot) | 2024-02-23 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 80.2 | 7 | OpenMath-Mistral-7B (w/ code) | 2024-02-15 |
[]() | | 80.2 | | Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval) | |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 78.8 | 13 | OpenMath-CodeLlama-13B (w/ code) | 2024-02-15 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 78.5 | 540 | Minerva 540B (CoT) | 2022-06-29 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 78.3 | | Camelidae-8×34B (5-shot) | 2024-01-05 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 77.8 | | Qwen2idae-16x14B (5-shot) | 2024-01-05 |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 77.7 | 7 | MetaMath-Mistral-7B | 2023-09-21 |
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data | ✓ Link | 77.3 | 7 | OpenChat-3.5 7B | 2023-09-20 |
Solving math word problems with process- and outcome-based feedback | | 76.5 | 70 | DeepMind 70B Model (STaR, maj1@96) | 2022-11-25 |
[]() | | 76.4 | 7 | Arithmo2-Mistral-7B | |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 75.9 | 7 | OpenMath-CodeLlama-7B (w/ code) | 2024-02-15 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 75.8 | 13 | ToRA-Code 13B | 2023-09-29 |
[]() | | 74.7 | 7 | Arithmo-Mistral-7B | |
Self-Consistency Improves Chain of Thought Reasoning in Language Models | ✓ Link | 74.4 | 540 | PaLM 540B maj1@40 (8-shot) | 2022-03-21 |
Large Language Models Can Self-Improve | | 74.4 | 540 | PaLM 540B (Self Consistency) | 2022-10-20 |
TinyGSM: achieving >80% on GSM8k with small language models | | 74.3 | 2.7 | Phi-GSM 2.7B (fine-tuned) | 2023-12-14 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 74.1 | 7 | MathCoder-CL-13B | 2023-10-05 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 74 | 13 | MuggleMATH 13B | 2023-10-09 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 73.9 | 7 | MMOS-CODE-7B(0-shot) | 2024-02-23 |
CodeT5+: Open Code Large Language Models for Code Understanding and Generation | ✓ Link | 73.8 | 0.77 | CodeT5+ | 2023-05-13 |
CAPO: Cost-Aware Prompt Optimization | ✓ Link | 73.73 | | Llama-3.3-70B + CAPO | 2025-04-22 |
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning | ✓ Link | 73.7 | 7 | OVM-Llama2-7B (verify100@1) | 2023-11-16 |
Large Language Models Can Self-Improve | | 73.5 | 540 | PaLM 540B (Self Improvement, CoT Prompting) | 2022-10-20 |
KwaiYiiMath: Technical Report | | 73.3 | 13 | KwaiYiiMath 13B | 2023-10-11 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 72.6 | 7 | ToRA-Code 7B | 2023-09-29 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 72.6 | 13 | MathCoder-L-13B | 2023-10-05 |
[]() | | 72.3 | | DBRX Base 132B | |
[]() | | 71.9 | | Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval) | |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 71.0 | 13 | MetaMath 13B | 2023-09-21 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 69.8 | 7 | MuggleMATH 7B | 2023-10-09 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 69.7 | 65 | LLaMA 65B-maj1@k | 2023-02-27 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 68.5 | 62 | Minerva 62B (maj1@100) | 2022-06-29 |
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models | ✓ Link | 68.01 | 175 | code-davinci-002 (Least-to-Most Prompting) | 2022-05-21 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 67.8 | 7 | MathCoder-CL-7B | 2023-10-05 |
[]() | | 66.9 | | DBRX Instruct 132B | |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 66.4 | 7 | MetaMath 7B | 2023-09-21 |
CAPO: Cost-Aware Prompt Optimization | ✓ Link | 65.07 | | Mistral-Small-24B + CAPO | 2025-04-22 |
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models | ✓ Link | 64.8 | 79 | RFT 70B | 2023-08-03 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 64.2 | 7 | MathCoder-L-7B | 2023-10-05 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 63.9 | 13 | WizardMath-13B-V1.0 | 2023-08-18 |
Solving Math Word Problems via Cooperative Reasoning induced Language Models | ✓ Link | 63.2 | 12 | GPT-J (CoRe) | 2022-10-28 |
The Unreasonable Effectiveness of Eccentric Automatic Prompts | | 61 | 70 | Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting) | 2024-02-09 |
CAPO: Cost-Aware Prompt Optimization | ✓ Link | 60.2 | | Qwen2.5-32B + CAPO | 2025-04-22 |
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning | | 59.59 | 70 | LLaMA 2 70B (CoT-Influx) | 2023-12-14 |
Orca 2: Teaching Small Language Models How to Reason | | 59.14 | 13 | Orca 2 13B | 2023-11-18 |
Transcending Scaling Laws with 0.1% Extra Compute | | 58.5 | 540 | U-PaLM | 2022-10-20 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 58.1 | 540 | PaLM-540B (few-Shot-cot) | 2022-05-24 |
GPT-4 Technical Report | ✓ Link | 57.1 | | GPT-3.5 (few-shot, k=5) | 2023-03-15 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 56.8 | 8 | Minerva 8B (maj5@100) | 2022-06-29 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 56.8 | 70 | LLaMA 2 70B (on-shot) | 2023-07-18 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 56.5 | 540 | PaLM 540B (8-shot) | 2022-06-29 |
Large Language Models Can Self-Improve | | 56.5 | 540 | PaLM 540B (CoT Prompting) | 2022-10-20 |
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models | ✓ Link | 55.3 | 13 | RFT 13B | 2023-08-03 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 55.0 | 175 | Finetuned GPT-3 175B + verifier | 2022-05-24 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 54.9 | 7 | WizardMath-7B-V1.0 | 2023-08-18 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 53.1 | 33 | LLaMA 33B-maj1@k | 2023-02-27 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 52.4 | 62 | Minerva 62B (8-shot) | 2022-06-29 |
Mistral 7B | ✓ Link | 52.2 | 7 | Mistral 7B (maj@8) | 2023-10-10 |
Llemma: An Open Language Model For Mathematics | ✓ Link | 51.5 | 34 | Llemma 34B | 2023-10-16 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 51.5 | 175 | Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples)) | 2022-05-24 |
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models | ✓ Link | 51.2 | 7 | RFT 7B | 2023-08-03 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 50.9 | 65 | LLaMA 65B | 2023-02-27 |
Orca 2: Teaching Small Language Models How to Reason | | 47.23 | 7 | Orca 2 7B | 2023-11-18 |
The Unreasonable Effectiveness of Eccentric Automatic Prompts | | 43 | 13 | Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting) | 2024-02-09 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 41.3 | 175 | text-davinci-002 175B (2-shot, CoT) | 2022-05-24 |
The Unreasonable Effectiveness of Eccentric Automatic Prompts | | 41 | 7 | Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting) | 2024-02-09 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 40.7 | 175 | text-davinci-002 175B (0-shot, CoT) | 2022-05-24 |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | ✓ Link | 37.1 | | Branch-Train-MiX 4x7B (sampling top-2 experts) | 2024-03-12 |
Llemma: An Open Language Model For Mathematics | ✓ Link | 36.4 | 7 | Llemma 7B | 2023-10-16 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 35.6 | 33 | LLaMA 33B | 2023-02-27 |
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning | ✓ Link | 35.2 | 13 | Vicuna (SYRELM) | 2023-12-09 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 33.0 | 62 | PaLM 62B (8-shot) | 2022-06-29 |
Large Language Models Can Self-Improve | | 32.2 | 540 | PaLM 540B (Self Improvement, Standard-Prompting) | 2022-10-20 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 29.3 | 13 | LLaMA 13B-maj1@k | 2023-02-27 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 28.4 | 8 | Minerva 8B-maj1@k (8-shot) | 2022-06-29 |
Composing Ensembles of Pre-trained Models via Iterative Consensus | | 20.8 | 0.355 | GPT-2-Medium 355M + question-solution classifier (BS=5) | 2022-10-20 |
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions | ✓ Link | 19.5 | 2.7 | GPT-Neo-2.7B + Self-Sampling | 2022-05-28 |
Composing Ensembles of Pre-trained Models via Iterative Consensus | | 18.3 | 0.355 | GPT-2-Medium 355M (fine-tuned, BS=5) | 2022-10-20 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 18.1 | 7 | LLaMA 7B (maj1@k) | 2023-02-27 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 17.9 | 540 | PaLM 540B (few-shot) | 2022-05-24 |
Large Language Models Can Self-Improve | | 17.9 | 540 | PaLM 540B (Standard-Prompting) | 2022-10-20 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 17.8 | 13 | LLaMA 13B | 2023-02-27 |
Composing Ensembles of Pre-trained Models via Iterative Consensus | | 16.8 | 0.355 | GPT-2-Medium 355M + question-solution classifier (BS=1) | 2022-10-20 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 16.2 | 8 | Minerva 8B (8-shot) | 2022-06-29 |
Composing Ensembles of Pre-trained Models via Iterative Consensus | | 12.2 | 0.355 | GPT-2-Medium 355M (BS=5) | 2022-10-20 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 11.0 | 7 | LLaMA 7B | 2023-02-27 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | 10.4 | 175 | Text-davinci-002-175B (0-shot) | 2022-05-24 |
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions | ✓ Link | 7.5 | 0.125 | GPT-Neo 125M + Self-Sampling | 2022-05-28 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 4.4 | 20 | UL2 20B (chain-of-thought) | 2022-05-10 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 4.1 | 8 | PaLM 8B (8-shot) | 2022-06-29 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 4.1 | 20 | UL2 20B (0-shot) | 2022-05-10 |