OpenCodePapers

math-word-problem-solving-on-math

Mathematical ReasoningMath Word Problem Solving
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyParameters (Billions)ModelNameReleaseDate
[]()89.7Gemini 2.0 Flash Experimental
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement88.172Qwen2.5-Math-72B-Instruct(TIR,Greedy)2024-09-18
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems✓ Link87.920GPT-4 Turbo (MACM, w/code, voting)2024-04-06
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement85.972Qwen2.5-Math-72B-Instruct(COT,Greedy)2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement85.27Qwen2.5-Math-7B-Instruct(TIR,Greedy)2024-09-18
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification✓ Link84.3GPT-4-code model (CSV, w/ code, SC, k=16)2023-08-15
Qwen2 Technical Report✓ Link84.072Qwen2-Math-72B-Instruct(greedy)2024-07-15
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement83.67Qwen2.5-Math-7B-Instruct(COT,Greedy)2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement79.91.5Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)2024-09-18
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link79.6OpenMath2-Llama3.1-70B (majority@256)2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link76.1OpenMath2-Llama3.1-8B (majority@256)2024-10-02
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement75.81.5Qwen2.5-Math-1.5B-Instruct(COT,Greedy)2024-09-18
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification✓ Link73.5GPT-4-code model (CSV, w/ code)2023-08-15
Cumulative Reasoning with Large Language Models✓ Link72.2CR (GPT-4-turbo model, w/ code)2023-08-08
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link71.9OpenMath2-Llama3.1-70B2024-10-02
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification✓ Link71.2LogicNet (with code interpreter)2023-08-15
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs✓ Link70.8Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)2024-06-26
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification✓ Link69.7GPT-4-code model (w/ code)2023-08-15
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data✓ Link67.8OpenMath2-Llama3.1-8B2024-10-02
AlphaMath Almost Zero: Process Supervision without Process✓ Link66.3AlphaMath-7B-SBS@32024-05-06
Solving Quantitative Reasoning Problems with Language Models✓ Link64.962Minerva 62B (maj5@256)2022-06-29
[]()64.57DAMOMath-7B
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link63.77MMOS-DeepSeekMath-7B(0-shot,k=50)2024-02-23
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification✓ Link60.8GPT-4-code model (w/o code)2023-08-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link60.470OpenMath-CodeLlama-70B (w/ code, SC, k=50)2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link60.234OpenMath-CodeLlama-34B (w/ code, SC, k=50)2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link60.034ToRA-Code 34B model (w/ code, SC, k=50)2023-09-29
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models✓ Link58.87DeepSeekMATH-RL-7B (w/ code, greedy decoding)2024-02-05
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link58.370OpenMath-Llama2-70B (w/ code, SC, k=50)2024-02-15
Cumulative Reasoning with Large Language Models✓ Link58.0CR (GPT-4 model, w/o code)2023-08-08
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link57.613OpenMath-CodeLlama-13B (w/ code, SC, k=50)2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link57.27OpenMath-Mistral-7B (w/ code, SC, k=50)2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link56.970ToRA 70B (w/ code, SC, k=50)2023-09-29
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models56.4SKiC (GPT-4 model)2023-08-01
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link56.170DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link55.67OpenMath-CodeLlama-7B (w/ code, SC, k=50)2024-02-15
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link55.07MMOS-DeepSeekMath-7B(0-shot)2024-02-23
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link54.970DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)2024-06-18
Progressive-Hint Prompting Improves Reasoning in Large Language Models✓ Link53.9PHP (GPT-4 model)2023-04-19
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link53.67DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
Gemini: A Family of Highly Capable Multimodal Models✓ Link53.2Gemini Ultra (4-shot)2023-12-19
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link52.97DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)2024-06-18
PAL: Program-aided Language Models✓ Link51.8GPT-4 model (w/ code, PAL)2022-11-18
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models✓ Link51.77DeepSeekMATH-RL-7B (greedy decoding)2024-02-05
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing✓ Link51AlphaLLM (with MCTS)2024-04-18
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link50.834ToRA-Code 34B (w/ code)2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link50.770OpenMath-CodeLlama-70B (w/ code)2024-02-15
Solving Quantitative Reasoning Problems with Language Models✓ Link50.3Minerva 540B (maj1@k, k=64)2022-06-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link49.770ToRA 70B (w/ code)2023-09-29
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link49.534MMOS-CODE-34B(0-shot)2024-02-23
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning48.87DeepSeekMath-7B-KPMath-Plus2024-03-04
PaLM 2 Technical Report✓ Link48.8PaLM 2 (few-shot, k=4, SC)2023-05-17
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning48.634Llemma-34B-KPMath-Plus2024-03-04
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link48.334OpenMath-CodeLlama-34B (w/ code)2024-02-15
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations✓ Link48.167Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)2023-12-14
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link48.113ToRA-Code 13B (w/ code)2023-09-29
Solving Quantitative Reasoning Problems with Language Models✓ Link47.68Minerva 8B (maj5@256)2022-06-29
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning46.87Mistral-7B-KPMath-Plus2024-03-04
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link46.68DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link46.370OpenMath-Llama2-70B (w/ code)2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link45.513OpenMath-CodeLlama-13B (w/ code)2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link45.57DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link45.38DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)2024-06-18
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link45.234MathCoder-CL-34B2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link45.134MathCoder-L-34B2023-10-05
Augmenting Math Word Problems via Iterative Question Composing✓ Link45.072MMIQC-72B2024-01-17
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link44.67ToRA-Code 7B (w/ code)2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link44.57OpenMath-Mistral-7B (w/ code)2024-02-15
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning✓ Link44.37MMOS-CODE-7B(0-shot)2024-02-23
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset✓ Link43.67OpenMath-CodeLlama-7B (w/ code)2024-02-15
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations✓ Link43.57Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)2023-12-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving✓ Link43.57DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)2024-06-18
Solving Quantitative Reasoning Problems with Language Models✓ Link43.462Minerva 62B (maj1@k, k=64)2022-06-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link43.013ToRA 13B (w/ code)2023-09-29
Sparks of Artificial General Intelligence: Early experiments with GPT-4✓ Link42.5GPT-42023-03-22
[]()41.87SFT-Mistral-7B
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning4113Llama2-13B-KPMath-Plus2024-03-04
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving✓ Link40.17ToRA 7B (w/ code)2023-09-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link35.913MathCoder-CL-13B2023-10-05
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link35.670MuggleMATH-70B2023-10-09
PaLM 2 Technical Report✓ Link34.3PaLM 2 (few-shot, k=4, CoT)2023-05-17
Solving Quantitative Reasoning Problems with Language Models✓ Link33.6540Minerva 540B2022-06-29
Galactica: A Large Language Model for Science✓ Link33.6540Minerva 540B (5-shot) mCoT2022-11-16
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations✓ Link33.07Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)2023-12-14
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link33.07WizardMath-7B-V1.12023-08-18
Gemini: A Family of Highly Capable Multimodal Models✓ Link32.6Gemini Pro (4-shot)2023-12-19
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link30.713MuggleMATH-13B2023-10-09
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link30.27MathCoder-CL-7B2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link29.913MathCoder-L-13B2023-10-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link29.9Qwen2idae-16x14B (4-shot)2024-01-05
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data✓ Link28.97OpenChat-3.5-1210 7B2023-09-20
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data✓ Link28.67OpenChat-3.5 7B2023-09-20
Mixtral of Experts✓ Link28.4Mixtral 8x7B (maj@4)2024-01-08
Solving Quantitative Reasoning Problems with Language Models✓ Link27.662Minerva 62B (4-shot)2022-06-29
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link26.070MetaMath 70B2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning✓ Link25.87MuggleMATH 7B2023-10-09
Solving Quantitative Reasoning Problems with Language Models✓ Link25.48Minerva 8B (maj1@k, k=64)2022-06-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning✓ Link23.37MathCoder-L-7B2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link22.770WizardMath-70B-V1.02023-08-18
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link22.6Camelidae-8×34B (4-shot)2024-01-05
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link22.513MetaMath 13B2023-09-21
LLaMA: Open and Efficient Foundation Language Models✓ Link20.565LLaMA 65B (maj1@k)2023-02-27
Galactica: A Large Language Model for Science✓ Link20.4120GAL 120B (5-shot) mCoT2022-11-16
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models✓ Link19.47MetaMath 7B2023-09-21
Solving Quantitative Reasoning Problems with Language Models✓ Link19.1175davinci-002 175B2022-06-29
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM✓ Link17.8Branch-Train-MiX 4x7B (sampling top-2 experts)2024-03-12
Galactica: A Large Language Model for Science✓ Link16.6120GAL 120B <work>2022-11-16
LLaMA: Open and Efficient Foundation Language Models✓ Link15.233LLaMA 33B-maj1@k2023-02-27
Solving Quantitative Reasoning Problems with Language Models✓ Link14.18Minerva 8B2022-06-29
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link14.013WizardMath-13B-V1.02023-08-18
Mistral 7B✓ Link13.17Mistral 7B (maj@4)2023-10-10
Galactica: A Large Language Model for Science✓ Link12.730GAL 30B (5-shot) mCoT2022-11-16
Mixtral of Experts✓ Link12.77Mistral 7B (maj@4)2024-01-08
Galactica: A Large Language Model for Science✓ Link11.430GAL 30B <work>2022-11-16
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct✓ Link10.77WizardMath-7B-V1.02023-08-18
LLaMA: Open and Efficient Foundation Language Models✓ Link10.665LLaMA 65B2023-02-27
Solving Quantitative Reasoning Problems with Language Models✓ Link8.8540PaLM 540B2022-06-29
Galactica: A Large Language Model for Science✓ Link8.8540PaLM 540B (5-shot) mCoT2022-11-16
LLaMA: Open and Efficient Foundation Language Models✓ Link8.813LLaMA 13B-maj1@k2023-02-27
LLaMA: Open and Efficient Foundation Language Models✓ Link7.133LLaMA 33B2023-02-27
LLaMA: Open and Efficient Foundation Language Models✓ Link6.97LLaMA 7B-maj1@k2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link6.91.5GPT-2 (1.5B)2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link6.40.7GPT-2 (0.7B)2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link6.20.3GPT-2 (0.3B)2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link5.613GPT-3 13B2021-03-05
Solving Quantitative Reasoning Problems with Language Models✓ Link5.68PaLM 8B (fine-tuned)2022-06-29
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link5.40.1GPT-2 (0.1B)2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link5.2175GPT-3-175B (few-shot)2021-03-05
Galactica: A Large Language Model for Science✓ Link5.2175GPT-3 175B (8-shot)2022-11-16
Solving Quantitative Reasoning Problems with Language Models✓ Link4.462PaLM 62B2022-06-29
LLaMA: Open and Efficient Foundation Language Models✓ Link3.913LLaMA 13B2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link3.013GPT-3-13B (few-shot)2021-03-05
LLaMA: Open and Efficient Foundation Language Models✓ Link2.97LLaMA 7B2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset✓ Link2.92.7GPT-3 2.7B2021-03-05
Solving Quantitative Reasoning Problems with Language Models✓ Link1.58PaLM 8B2022-06-29