[]() | | 89.7 | | Gemini 2.0 Flash Experimental | |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 88.1 | 72 | Qwen2.5-Math-72B-Instruct(TIR,Greedy) | 2024-09-18 |
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems | ✓ Link | 87.920 | | GPT-4 Turbo (MACM, w/code, voting) | 2024-04-06 |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 85.9 | 72 | Qwen2.5-Math-72B-Instruct(COT,Greedy) | 2024-09-18 |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 85.2 | 7 | Qwen2.5-Math-7B-Instruct(TIR,Greedy) | 2024-09-18 |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | ✓ Link | 84.3 | | GPT-4-code model (CSV, w/ code, SC, k=16) | 2023-08-15 |
Qwen2 Technical Report | ✓ Link | 84.0 | 72 | Qwen2-Math-72B-Instruct(greedy) | 2024-07-15 |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 83.6 | 7 | Qwen2.5-Math-7B-Instruct(COT,Greedy) | 2024-09-18 |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 79.9 | 1.5 | Qwen2.5-Math-1.5B-Instruct(TIR,Greedy) | 2024-09-18 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 79.6 | | OpenMath2-Llama3.1-70B (majority@256) | 2024-10-02 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 76.1 | | OpenMath2-Llama3.1-8B (majority@256) | 2024-10-02 |
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement | | 75.8 | 1.5 | Qwen2.5-Math-1.5B-Instruct(COT,Greedy) | 2024-09-18 |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | ✓ Link | 73.5 | | GPT-4-code model (CSV, w/ code) | 2023-08-15 |
Cumulative Reasoning with Large Language Models | ✓ Link | 72.2 | | CR (GPT-4-turbo model, w/ code) | 2023-08-08 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 71.9 | | OpenMath2-Llama3.1-70B | 2024-10-02 |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | ✓ Link | 71.2 | | LogicNet (with code interpreter) | 2023-08-15 |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | ✓ Link | 70.8 | | Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code) | 2024-06-26 |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | ✓ Link | 69.7 | | GPT-4-code model (w/ code) | 2023-08-15 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | ✓ Link | 67.8 | | OpenMath2-Llama3.1-8B | 2024-10-02 |
AlphaMath Almost Zero: Process Supervision without Process | ✓ Link | 66.3 | | AlphaMath-7B-SBS@3 | 2024-05-06 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 64.9 | 62 | Minerva 62B (maj5@256) | 2022-06-29 |
[]() | | 64.5 | 7 | DAMOMath-7B | |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 63.7 | 7 | MMOS-DeepSeekMath-7B(0-shot,k=50) | 2024-02-23 |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification | ✓ Link | 60.8 | | GPT-4-code model (w/o code) | 2023-08-15 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 60.4 | 70 | OpenMath-CodeLlama-70B (w/ code, SC, k=50) | 2024-02-15 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 60.2 | 34 | OpenMath-CodeLlama-34B (w/ code, SC, k=50) | 2024-02-15 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 60.0 | 34 | ToRA-Code 34B model (w/ code, SC, k=50) | 2023-09-29 |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | ✓ Link | 58.8 | 7 | DeepSeekMATH-RL-7B (w/ code, greedy decoding) | 2024-02-05 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 58.3 | 70 | OpenMath-Llama2-70B (w/ code, SC, k=50) | 2024-02-15 |
Cumulative Reasoning with Large Language Models | ✓ Link | 58.0 | | CR (GPT-4 model, w/o code) | 2023-08-08 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 57.6 | 13 | OpenMath-CodeLlama-13B (w/ code, SC, k=50) | 2024-02-15 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 57.2 | 7 | OpenMath-Mistral-7B (w/ code, SC, k=50) | 2024-02-15 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 56.9 | 70 | ToRA 70B (w/ code, SC, k=50) | 2023-09-29 |
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models | | 56.4 | | SKiC (GPT-4 model) | 2023-08-01 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 56.1 | 70 | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 55.6 | 7 | OpenMath-CodeLlama-7B (w/ code, SC, k=50) | 2024-02-15 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 55.0 | 7 | MMOS-DeepSeekMath-7B(0-shot) | 2024-02-23 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 54.9 | 70 | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
Progressive-Hint Prompting Improves Reasoning in Large Language Models | ✓ Link | 53.9 | | PHP (GPT-4 model) | 2023-04-19 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 53.6 | 7 | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 53.2 | | Gemini Ultra (4-shot) | 2023-12-19 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 52.9 | 7 | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
PAL: Program-aided Language Models | ✓ Link | 51.8 | | GPT-4 model (w/ code, PAL) | 2022-11-18 |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | ✓ Link | 51.7 | 7 | DeepSeekMATH-RL-7B (greedy decoding) | 2024-02-05 |
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing | ✓ Link | 51 | | AlphaLLM (with MCTS) | 2024-04-18 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 50.8 | 34 | ToRA-Code 34B (w/ code) | 2023-09-29 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 50.7 | 70 | OpenMath-CodeLlama-70B (w/ code) | 2024-02-15 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 50.3 | | Minerva 540B (maj1@k, k=64) | 2022-06-29 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 49.7 | 70 | ToRA 70B (w/ code) | 2023-09-29 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 49.5 | 34 | MMOS-CODE-34B(0-shot) | 2024-02-23 |
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning | | 48.8 | 7 | DeepSeekMath-7B-KPMath-Plus | 2024-03-04 |
PaLM 2 Technical Report | ✓ Link | 48.8 | | PaLM 2 (few-shot, k=4, SC) | 2023-05-17 |
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning | | 48.6 | 34 | Llemma-34B-KPMath-Plus | 2024-03-04 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 48.3 | 34 | OpenMath-CodeLlama-34B (w/ code) | 2024-02-15 |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | ✓ Link | 48.1 | 67 | Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256) | 2023-12-14 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 48.1 | 13 | ToRA-Code 13B (w/ code) | 2023-09-29 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 47.6 | 8 | Minerva 8B (maj5@256) | 2022-06-29 |
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning | | 46.8 | 7 | Mistral-7B-KPMath-Plus | 2024-03-04 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 46.6 | 8 | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 46.3 | 70 | OpenMath-Llama2-70B (w/ code) | 2024-02-15 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 45.5 | 13 | OpenMath-CodeLlama-13B (w/ code) | 2024-02-15 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 45.5 | 7 | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | 2024-06-18 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 45.3 | 8 | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 45.2 | 34 | MathCoder-CL-34B | 2023-10-05 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 45.1 | 34 | MathCoder-L-34B | 2023-10-05 |
Augmenting Math Word Problems via Iterative Question Composing | ✓ Link | 45.0 | 72 | MMIQC-72B | 2024-01-17 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 44.6 | 7 | ToRA-Code 7B (w/ code) | 2023-09-29 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 44.5 | 7 | OpenMath-Mistral-7B (w/ code) | 2024-02-15 |
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning | ✓ Link | 44.3 | 7 | MMOS-CODE-7B(0-shot) | 2024-02-23 |
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | ✓ Link | 43.6 | 7 | OpenMath-CodeLlama-7B (w/ code) | 2024-02-15 |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | ✓ Link | 43.5 | 7 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) | 2023-12-14 |
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | ✓ Link | 43.5 | 7 | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | 2024-06-18 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 43.4 | 62 | Minerva 62B (maj1@k, k=64) | 2022-06-29 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 43.0 | 13 | ToRA 13B (w/ code) | 2023-09-29 |
Sparks of Artificial General Intelligence: Early experiments with GPT-4 | ✓ Link | 42.5 | | GPT-4 | 2023-03-22 |
[]() | | 41.8 | 7 | SFT-Mistral-7B | |
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning | | 41 | 13 | Llama2-13B-KPMath-Plus | 2024-03-04 |
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | ✓ Link | 40.1 | 7 | ToRA 7B (w/ code) | 2023-09-29 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 35.9 | 13 | MathCoder-CL-13B | 2023-10-05 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 35.6 | 70 | MuggleMATH-70B | 2023-10-09 |
PaLM 2 Technical Report | ✓ Link | 34.3 | | PaLM 2 (few-shot, k=4, CoT) | 2023-05-17 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 33.6 | 540 | Minerva 540B | 2022-06-29 |
Galactica: A Large Language Model for Science | ✓ Link | 33.6 | 540 | Minerva 540B (5-shot) mCoT | 2022-11-16 |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | ✓ Link | 33.0 | 7 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) | 2023-12-14 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 33.0 | 7 | WizardMath-7B-V1.1 | 2023-08-18 |
Gemini: A Family of Highly Capable Multimodal Models | ✓ Link | 32.6 | | Gemini Pro (4-shot) | 2023-12-19 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 30.7 | 13 | MuggleMATH-13B | 2023-10-09 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 30.2 | 7 | MathCoder-CL-7B | 2023-10-05 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 29.9 | 13 | MathCoder-L-13B | 2023-10-05 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 29.9 | | Qwen2idae-16x14B (4-shot) | 2024-01-05 |
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data | ✓ Link | 28.9 | 7 | OpenChat-3.5-1210 7B | 2023-09-20 |
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data | ✓ Link | 28.6 | 7 | OpenChat-3.5 7B | 2023-09-20 |
Mixtral of Experts | ✓ Link | 28.4 | | Mixtral 8x7B (maj@4) | 2024-01-08 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 27.6 | 62 | Minerva 62B (4-shot) | 2022-06-29 |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 26.0 | 70 | MetaMath 70B | 2023-09-21 |
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning | ✓ Link | 25.8 | 7 | MuggleMATH 7B | 2023-10-09 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 25.4 | 8 | Minerva 8B (maj1@k, k=64) | 2022-06-29 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | ✓ Link | 23.3 | 7 | MathCoder-L-7B | 2023-10-05 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 22.7 | 70 | WizardMath-70B-V1.0 | 2023-08-18 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 22.6 | | Camelidae-8×34B (4-shot) | 2024-01-05 |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 22.5 | 13 | MetaMath 13B | 2023-09-21 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 20.5 | 65 | LLaMA 65B (maj1@k) | 2023-02-27 |
Galactica: A Large Language Model for Science | ✓ Link | 20.4 | 120 | GAL 120B (5-shot) mCoT | 2022-11-16 |
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models | ✓ Link | 19.4 | 7 | MetaMath 7B | 2023-09-21 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 19.1 | 175 | davinci-002 175B | 2022-06-29 |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | ✓ Link | 17.8 | | Branch-Train-MiX 4x7B (sampling top-2 experts) | 2024-03-12 |
Galactica: A Large Language Model for Science | ✓ Link | 16.6 | 120 | GAL 120B <work> | 2022-11-16 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 15.2 | 33 | LLaMA 33B-maj1@k | 2023-02-27 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 14.1 | 8 | Minerva 8B | 2022-06-29 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 14.0 | 13 | WizardMath-13B-V1.0 | 2023-08-18 |
Mistral 7B | ✓ Link | 13.1 | 7 | Mistral 7B (maj@4) | 2023-10-10 |
Galactica: A Large Language Model for Science | ✓ Link | 12.7 | 30 | GAL 30B (5-shot) mCoT | 2022-11-16 |
Mixtral of Experts | ✓ Link | 12.7 | 7 | Mistral 7B (maj@4) | 2024-01-08 |
Galactica: A Large Language Model for Science | ✓ Link | 11.4 | 30 | GAL 30B <work> | 2022-11-16 |
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct | ✓ Link | 10.7 | 7 | WizardMath-7B-V1.0 | 2023-08-18 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 10.6 | 65 | LLaMA 65B | 2023-02-27 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 8.8 | 540 | PaLM 540B | 2022-06-29 |
Galactica: A Large Language Model for Science | ✓ Link | 8.8 | 540 | PaLM 540B (5-shot) mCoT | 2022-11-16 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 8.8 | 13 | LLaMA 13B-maj1@k | 2023-02-27 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 7.1 | 33 | LLaMA 33B | 2023-02-27 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 6.9 | 7 | LLaMA 7B-maj1@k | 2023-02-27 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 6.9 | 1.5 | GPT-2 (1.5B) | 2021-03-05 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 6.4 | 0.7 | GPT-2 (0.7B) | 2021-03-05 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 6.2 | 0.3 | GPT-2 (0.3B) | 2021-03-05 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 5.6 | 13 | GPT-3 13B | 2021-03-05 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 5.6 | 8 | PaLM 8B (fine-tuned) | 2022-06-29 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 5.4 | 0.1 | GPT-2 (0.1B) | 2021-03-05 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 5.2 | 175 | GPT-3-175B (few-shot) | 2021-03-05 |
Galactica: A Large Language Model for Science | ✓ Link | 5.2 | 175 | GPT-3 175B (8-shot) | 2022-11-16 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 4.4 | 62 | PaLM 62B | 2022-06-29 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 3.9 | 13 | LLaMA 13B | 2023-02-27 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 3.0 | 13 | GPT-3-13B (few-shot) | 2021-03-05 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 2.9 | 7 | LLaMA 7B | 2023-02-27 |
Measuring Mathematical Problem Solving With the MATH Dataset | ✓ Link | 2.9 | 2.7 | GPT-3 2.7B | 2021-03-05 |
Solving Quantitative Reasoning Problems with Language Models | ✓ Link | 1.5 | 8 | PaLM 8B | 2022-06-29 |