OpenCodePapers

arithmetic-reasoning-on-gsm8k

Arithmetic Reasoning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyParameters (Billion)ModelNameReleaseDate
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles✓ Link97.72Claude 3.5 Sonnet (HPT)2024-06-18
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems✓ Link97.1DUP prompt upon GPT-42024-04-23
Qwen2 Technical Report✓ Link96.772Qwen2-Math-72B-Instruct (greedy)2024-07-15
[]()96.47SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link96.0OpenMath2-Llama3.1-70B (majority@256)2024-10-02
[]()95.275Jiutian-大模型
[]()95.17DAMOMath-7B(MetaMath, OVM, BS, Ensemble)
The Claude 3 Model Family: Opus, Sonnet, Haiku95Claude 3 Opus (0-shot chain-of-thought)2024-03-04
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link94.9OpenMath2-Llama3.1-70B2024-10-02
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models✓ Link94.8GPT-4 (Teaching-Inspired)2024-10-10
[]()94.137SFT-Mistral-7B (Metamath + ovm +ensemble)
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link94.1OpenMath2-Llama3.1-8B (majority@256)2024-10-02
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs✓ Link94.0Qwen2-72B-Instruct-Step-DPO (0-shot CoT)2024-06-26
[]()93.27DAMOMath-7B(MetaMath, OVM, Ensemble)
The Claude 3 Model Family: Opus, Sonnet, Haiku92.3Claude 3 Sonnet (0-shot chain-of-thought)2024-03-04
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing✓ Link9270AlphaLLM (with MCTS)2024-04-18
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link91.7OpenMath2-Llama3.1-8B2024-10-02
PaLM 2 Technical Report✓ Link91.0PaLM 2 (few-shot, k=8, SC)2023-05-17
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling✓ Link90.91GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link90.870OpenMath-CodeLlama-70B (w/ code, SC, k=50)2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link90.470DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link90.170OpenMath-Llama2-70B (w/ code, SC, k=50)2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link89.670DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations✓ Link89.17Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)2023-12-14
[]()89.013Llama SFT (Metamath ToRA Ensemble)
Solving Quantitative Reasoning Problems with Language Models✓ Link8962Minerva 62B (maj5@100)2022-06-29
The Claude 3 Model Family: Opus, Sonnet, Haiku88.9Claude 3 Haiku (0-shot chain-of-thought)2024-03-04
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link88.370ToRA-70B (SC, k=50)2023-09-29
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models✓ Link88.27DeepSeekMATH-RL-7B2024-02-05
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link88.27DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link88.034OpenMath-CodeLlama-34B (w/ code, SC, k=50)2024-02-15
Model Card and Evaluations for Claude Models88Claude 2 (0-shot chain-of-thought)2023-07-11
[]()87.414Shivaay-4B (8-shot chain-of-thought)
Solving math word problems with process- and outcome-based feedback87.370DeepMind 70B Model (SFT+ORM-RL, ORM reranking)2022-11-25
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link87.27MMOS-DeepSeekMath-7B(0-shot,k=50)2024-02-23
Solving math word problems with process- and outcome-based feedback87.170DeepMind 70B Model (SFT+PRM-RL, PRM reranking)2022-11-25
Sparks of Artificial General Intelligence: Early experiments with GPT-4✓ Link87.1GPT-42023-03-22
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link86.97OpenMath-Mistral-7B (w/ code, SC, k=50)2024-02-15
Orca-Math: Unlocking the potential of SLMs in Grade School Math86.87Orca-Math 7B (fine-tuned)2024-02-16
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link86.87DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link86.813OpenMath-CodeLlama-13B (w/ code, SC, k=50)2024-02-15
Gemini: A Family of Highly Capable Multimodal Models✓ Link86.5Gemini Pro (maj1@32)2023-12-19
[]()85.5Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval)
Model Card and Evaluations for Claude Models85.2Claude 1.3 (0-shot chain-of-thought)2023-07-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link85.134ToRA-Code-34B (SC, k=50)2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link84.87OpenMath-CodeLlama-7B (w/ code, SC, k=50)2024-02-15
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning✓ Link84.77OVM-Mistral-7B (verify100@1)2023-11-16
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link84.770OpenMath-Llama2-70B (w/ code)2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link84.670OpenMath-CodeLlama-70B (w/ code)2024-02-15
LEVER: Learning to Verify Language-to-Code Generation with Execution✓ Link84.5175code-davinci-002 175B (LEVER, 8-shot)2023-02-16
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link84.370ToRA 70B2023-09-29
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations✓ Link84.17Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)2023-12-14
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link83.970MathCoder-L-70B2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link83.27WizardMath-7B-V1.12023-08-18
Making Large Language Models Better Reasoners with Step-Aware Verifier83.2175DIVERSE 175B (8-shot)2022-06-06
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning✓ Link82.67OVM-Mistral-7B (verify20@1)2023-11-16
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link82.67DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)2024-06-18
The ART of LLM Refinement: Ask, Refine, and Trust82.6ChatGPT (Ask, Refine, Trust)2023-11-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link82.58DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)2024-06-18
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link82.370MetaMath 70B2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link82.370MuggleMATH 70B2023-10-09
Large Language Models Can Self-Improve82.1540PaLM 540B (Self Improvement, Self Consistency)2022-10-20
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link81.734MathCoder-CL-34B2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link81.670WizardMath-70B-V1.02023-08-18
TinyGSM: achieving >80% on GSM8k with small language models81.52.6Phi-GSM+V 1.3B+1.3B (verify48@1)2023-12-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link81.17DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link81.18DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
Model Card and Evaluations for Claude Models80.9Claude Instant 1.1 (0-shot chain-of-thought)2023-07-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link80.734ToRA-Code 34B2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link80.734OpenMath-CodeLlama-34B (w/ code)2024-02-15
PaLM 2 Technical Report✓ Link80.7PaLM 2 (few-shot, k=8, CoT)2023-05-17
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link80.57MMOS-DeepSeekMath-7B(0-shot)2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link80.434MMOS-CODE-34B(0-shot)2024-02-23
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link80.27OpenMath-Mistral-7B (w/ code)2024-02-15
[]()80.2Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval)
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link78.813OpenMath-CodeLlama-13B (w/ code)2024-02-15
Solving Quantitative Reasoning Problems with Language Models✓ Link78.5540Minerva 540B (CoT)2022-06-29
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link78.3Camelidae-8×34B (5-shot)2024-01-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link77.8Qwen2idae-16x14B (5-shot)2024-01-05
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link77.77MetaMath-Mistral-7B2023-09-21
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data✓ Link77.37OpenChat-3.5 7B2023-09-20
Solving math word problems with process- and outcome-based feedback76.570DeepMind 70B Model (STaR, maj1@96)2022-11-25
[]()76.47Arithmo2-Mistral-7B
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link75.97OpenMath-CodeLlama-7B (w/ code)2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link75.813ToRA-Code 13B2023-09-29
[]()74.77Arithmo-Mistral-7B
Self-Consistency Improves Chain of Thought Reasoning in Language Models✓ Link74.4540PaLM 540B maj1@40 (8-shot)2022-03-21
Large Language Models Can Self-Improve74.4540PaLM 540B (Self Consistency)2022-10-20
TinyGSM: achieving >80% on GSM8k with small language models74.32.7Phi-GSM 2.7B (fine-tuned)2023-12-14
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link74.17MathCoder-CL-13B2023-10-05
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link7413MuggleMATH 13B2023-10-09
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link73.97MMOS-CODE-7B(0-shot)2024-02-23
CodeT5+: Open Code Large Language Models for Code Understanding and Generation✓ Link73.80.77CodeT5+2023-05-13
CAPO: Cost-Aware Prompt Optimization✓ Link73.73Llama-3.3-70B + CAPO2025-04-22
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning✓ Link73.77OVM-Llama2-7B (verify100@1)2023-11-16
Large Language Models Can Self-Improve73.5540PaLM 540B (Self Improvement, CoT Prompting)2022-10-20
KwaiYiiMath: Technical Report73.313KwaiYiiMath 13B2023-10-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link72.67ToRA-Code 7B2023-09-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link72.613MathCoder-L-13B2023-10-05
[]()72.3DBRX Base 132B
[]()71.9Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval)
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link71.013MetaMath 13B2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link69.87MuggleMATH 7B2023-10-09
LLaMA: Open and Efficient Foundation Language Models✓ Link69.765LLaMA 65B-maj1@k2023-02-27
Solving Quantitative Reasoning Problems with Language Models✓ Link68.562Minerva 62B (maj1@100)2022-06-29
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models✓ Link68.01175code-davinci-002 (Least-to-Most Prompting)2022-05-21
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link67.87MathCoder-CL-7B2023-10-05
[]()66.9DBRX Instruct 132B
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link66.47MetaMath 7B2023-09-21
CAPO: Cost-Aware Prompt Optimization✓ Link65.07Mistral-Small-24B + CAPO2025-04-22
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models✓ Link64.879RFT 70B2023-08-03
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link64.27MathCoder-L-7B2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link63.913WizardMath-13B-V1.02023-08-18
Solving Math Word Problems via Cooperative Reasoning induced Language Models✓ Link63.212GPT-J (CoRe)2022-10-28
The Unreasonable Effectiveness of Eccentric Automatic Prompts6170Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)2024-02-09
CAPO: Cost-Aware Prompt Optimization✓ Link60.2Qwen2.5-32B + CAPO2025-04-22
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning59.5970LLaMA 2 70B (CoT-Influx)2023-12-14
Orca 2: Teaching Small Language Models How to Reason59.1413Orca 2 13B2023-11-18
Transcending Scaling Laws with 0.1% Extra Compute58.5540U-PaLM2022-10-20
Large Language Models are Zero-Shot Reasoners✓ Link58.1540PaLM-540B (few-Shot-cot)2022-05-24
GPT-4 Technical Report✓ Link57.1GPT-3.5 (few-shot, k=5)2023-03-15
Solving Quantitative Reasoning Problems with Language Models✓ Link56.88Minerva 8B (maj5@100)2022-06-29
Llama 2: Open Foundation and Fine-Tuned Chat Models✓ Link56.870LLaMA 2 70B (on-shot)2023-07-18
Solving Quantitative Reasoning Problems with Language Models✓ Link56.5540PaLM 540B (8-shot)2022-06-29
Large Language Models Can Self-Improve56.5540PaLM 540B (CoT Prompting)2022-10-20
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models✓ Link55.313RFT 13B2023-08-03
Large Language Models are Zero-Shot Reasoners✓ Link55.0175Finetuned GPT-3 175B + verifier2022-05-24
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link54.97WizardMath-7B-V1.02023-08-18
LLaMA: Open and Efficient Foundation Language Models✓ Link53.133LLaMA 33B-maj1@k2023-02-27
Solving Quantitative Reasoning Problems with Language Models✓ Link 52.462Minerva 62B (8-shot)2022-06-29
Mistral 7B✓ Link52.27Mistral 7B (maj@8)2023-10-10
Llemma: An Open Language Model For Mathematics✓ Link51.534Llemma 34B2023-10-16
Large Language Models are Zero-Shot Reasoners✓ Link51.5175Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))2022-05-24
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models✓ Link51.27RFT 7B2023-08-03
LLaMA: Open and Efficient Foundation Language Models✓ Link50.965LLaMA 65B2023-02-27
Orca 2: Teaching Small Language Models How to Reason47.237Orca 2 7B2023-11-18
The Unreasonable Effectiveness of Eccentric Automatic Prompts4313Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)2024-02-09
Large Language Models are Zero-Shot Reasoners✓ Link41.3175text-davinci-002 175B (2-shot, CoT)2022-05-24
The Unreasonable Effectiveness of Eccentric Automatic Prompts417Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)2024-02-09
Large Language Models are Zero-Shot Reasoners✓ Link40.7175text-davinci-002 175B (0-shot, CoT)2022-05-24
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM✓ Link37.1Branch-Train-MiX 4x7B (sampling top-2 experts)2024-03-12
Llemma: An Open Language Model For Mathematics✓ Link36.47Llemma 7B2023-10-16
LLaMA: Open and Efficient Foundation Language Models✓ Link35.633LLaMA 33B2023-02-27
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning✓ Link35.213Vicuna (SYRELM)2023-12-09
Solving Quantitative Reasoning Problems with Language Models✓ Link33.062PaLM 62B (8-shot)2022-06-29
Large Language Models Can Self-Improve32.2540PaLM 540B (Self Improvement, Standard-Prompting)2022-10-20
LLaMA: Open and Efficient Foundation Language Models✓ Link29.313LLaMA 13B-maj1@k2023-02-27
Solving Quantitative Reasoning Problems with Language Models✓ Link 28.48Minerva 8B-maj1@k (8-shot)2022-06-29
Composing Ensembles of Pre-trained Models via Iterative Consensus20.80.355GPT-2-Medium 355M + question-solution classifier (BS=5)2022-10-20
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions✓ Link19.52.7GPT-Neo-2.7B + Self-Sampling2022-05-28
Composing Ensembles of Pre-trained Models via Iterative Consensus18.30.355GPT-2-Medium 355M (fine-tuned, BS=5)2022-10-20
LLaMA: Open and Efficient Foundation Language Models✓ Link18.17LLaMA 7B (maj1@k)2023-02-27
Large Language Models are Zero-Shot Reasoners✓ Link17.9540PaLM 540B (few-shot)2022-05-24
Large Language Models Can Self-Improve17.9540PaLM 540B (Standard-Prompting)2022-10-20
LLaMA: Open and Efficient Foundation Language Models✓ Link17.813LLaMA 13B2023-02-27
Composing Ensembles of Pre-trained Models via Iterative Consensus16.80.355GPT-2-Medium 355M + question-solution classifier (BS=1)2022-10-20
Solving Quantitative Reasoning Problems with Language Models✓ Link16.28Minerva 8B (8-shot)2022-06-29
Composing Ensembles of Pre-trained Models via Iterative Consensus12.20.355GPT-2-Medium 355M (BS=5)2022-10-20
LLaMA: Open and Efficient Foundation Language Models✓ Link11.07LLaMA 7B2023-02-27
Large Language Models are Zero-Shot Reasoners✓ Link10.4175Text-davinci-002-175B (0-shot)2022-05-24
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions✓ Link7.50.125GPT-Neo 125M + Self-Sampling2022-05-28
UL2: Unifying Language Learning Paradigms✓ Link4.420UL2 20B (chain-of-thought)2022-05-10
Solving Quantitative Reasoning Problems with Language Models✓ Link4.18PaLM 8B (8-shot)2022-06-29
UL2: Unifying Language Learning Paradigms✓ Link4.120UL2 20B (0-shot)2022-05-10