OpenCodePapers

code-generation-on-mbpp

Code Generation
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
Execution Guided Line-by-Line Code Generation✓ Link96.6EG-CFG (DeepSeek-V3-0324)2025-06-12
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks94.2QualityFlow (Sonnet-3.5)2025-01-20
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving✓ Link93.2o1-mini + MapCoder (Hamming.ai)2024-05-18
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging✓ Link92.4MGDebugger (DeepSeek-V3-0324)2024-10-02
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation✓ Link91.8GPT-4 + AgentCoder2023-12-20
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging✓ Link90.7CodeSim (GPT4o)2025-02-08
[]()90.0Jiutian-大模型
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation✓ Link89.9GPT-3.5 Turbo (ChatGPT) + AgentCoder2023-12-20
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving✓ Link89.7MapCoder (GPT-4o)2024-05-18
How Does Naming Affect LLMs on Code Analysis Tasks?87.5GPT-4 (ChatGPT Plus)2023-07-24
The Claude 3 Model Family: Opus, Sonnet, Haiku86.4Claude 3 Opus2024-03-04
Planning-Driven Programming: A Large Language Model Programming Workflow✓ Link84.8LPW (GPT-4o)2024-11-21
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents83.8±0.6GPT-3.5 Turbo + FlowGenScrum + Test2024-03-23
AFlow: Automating Agentic Workflow Generation✓ Link83.4AFlow(GPT-4o-mini)2024-10-14
How Does Naming Affect LLMs on Code Analysis Tasks?83.2GPT-3.5 Turbo (ChatGPT)2023-07-24
Execution Guided Line-by-Line Code Generation✓ Link83.2EG-CFG (DeepSeek Coder 1.3b Instruct)2025-06-12
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving✓ Link83.1MapCoder (GPT-4)2024-05-18
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models✓ Link82.3o1-mini + Language Agent Tree Search (Hamming.ai)2023-10-06
How Does Naming Affect LLMs on Code Analysis Tasks?82GPT-4 (Bing Chat)2023-07-24
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models✓ Link81.1GPT-3.5 Turbo + Language Agent Tree Search2023-10-06
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging✓ Link80.8MGDebugger (CodeQwen1.5)2024-10-02
The Claude 3 Model Family: Opus, Sonnet, Haiku80.4Claude 3 Haiku2024-03-04
Teaching Large Language Models to Self-Debug✓ Link80.2GPT-4 (Self-Debugging with unit tests + trace)2023-04-11
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link80GPT-4 (few-shot)2024-01-25
The Claude 3 Model Family: Opus, Sonnet, Haiku79.4Claude 3 Sonnet2024-03-04
How Does Naming Affect LLMs on Code Analysis Tasks?76.2Bard (PaLM 2/chat-bison-001)2023-07-24
Teaching Large Language Models to Self-Debug✓ Link72.8GPT-3.5 Turbo (Self-Debugging with unit tests + trace)2023-04-11
How Does Naming Affect LLMs on Code Analysis Tasks?71.4Claude2023-07-24
Teaching Large Language Models to Self-Debug✓ Link70.8code-davinci-002 175B (Self-Debugging with unit tests + trace)2023-04-11
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link70.8GPT-3.5 Turbo (few-shot)2024-01-25
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link70DeepSeek-Coder-Instruct 33B (few-shot)2024-01-25
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair✓ Link69.8GPT-3.5 Turbo + INTERVENOR2023-11-16
LEVER: Learning to Verify Language-to-Code Generation with Execution✓ Link68.9code-davinci-002 175B + LEVER2023-02-16
CodeT: Code Generation with Generated Tests✓ Link67.7code-davinci-002 175B + CodeT2022-07-21
Teaching Large Language Models to Self-Debug✓ Link67.6GPT-3.5 Turbo (3-shot)2023-04-11
Coder Reviewer Reranking for Code Generation✓ Link66.9code-davinci-002 175B + Reviewer2022-11-29
Coder Reviewer Reranking for Code Generation✓ Link66.4code-davinci-002 175B + Coder-Reviewer2022-11-29
StarCoder 2 and The Stack v2: The Next Generation✓ Link66.2StarCoder2-15B2024-02-29
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link66DeepSeek-Coder-Base 33B (few-shot)2024-01-25
Code Llama: Open Foundation Models for Code✓ Link65.5Code Llama - Python 70B (3-shot)2023-08-24
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link65.4DeepSeek-Coder-Instruct 6.7B (few-shot)2024-01-25
Coder Reviewer Reranking for Code Generation✓ Link63code-davinci-002 175B + MBR-Exec2022-11-29
Code Llama: Open Foundation Models for Code✓ Link62.4Code Llama 70B (3-shot)2023-08-24
Code Llama: Open Foundation Models for Code✓ Link62.2Code Llama - Instruct 70B (3-shot)2023-08-24
CodeT: Code Generation with Generated Tests✓ Link61.9code-davinci-001 175B + CodeT2022-07-21
Teaching Large Language Models to Self-Debug✓ Link61.4code-davinci-002 175B (3-shot)2023-04-11
Code Llama: Open Foundation Models for Code✓ Link61.2Unnatural Code Llama 34B (3-shot)2023-08-24
Mixtral of Experts✓ Link60.7Mixtral 8x7B (3-shot)2024-01-08
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link60.6DeepSeek-Coder-Base 6.7B (few-shot)2024-01-25
Natural Language to Code Translation with Execution✓ Link58.2code-davinci-001 175B + MBR-Exec2022-04-25
Code Llama: Open Foundation Models for Code✓ Link57Code Llama - Instruct 34B (3-shot)2023-08-24
Code Llama: Open Foundation Models for Code✓ Link56.2Code Llama - Python 34B (3-shot)2023-08-24
CodeT: Code Generation with Generated Tests✓ Link55.4code-cushman-001 12B (CodeT)2022-07-21
Code Llama: Open Foundation Models for Code✓ Link55Code Llama 34B (3-shot)2023-08-24
Teaching Large Language Models to Self-Debug✓ Link53.2StarCoder 15.5B (Self-Debugging with unit tests + trace)2023-04-11
StarCoder: may the source be with you!✓ Link52.7StarCoder 15.5B2023-05-09
Code Llama: Open Foundation Models for Code✓ Link52.2GPT-3.5 Turbo2023-08-24
WizardCoder: Empowering Code Large Language Models with Evol-Instruct✓ Link51.8WizardCoder 15B2023-06-14
PaLM 2 Technical Report✓ Link50PaLM 2-S* (few-shot)2023-05-17
CodeT: Code Generation with Generated Tests✓ Link49.5CodeGen-Mono 16B + CodeT2022-07-21
Code Llama: Open Foundation Models for Code✓ Link49.4Code Llama - Instruct 13B (3-shot)2023-08-24
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link49.4DeepSeek-Coder-Instruct 1.3B (few-shot)2024-01-25
StarCoder: may the source be with you!✓ Link49StarCoderBase 15.5B2023-05-09
Code Llama: Open Foundation Models for Code✓ Link49Code Llama - Python 13B (3-shot)2023-08-24
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link48.6Qwen2idae-16x14B (4-shot)2024-01-05
Coder Reviewer Reranking for Code Generation✓ Link48.3code-cushman-001 12B + MBR-Exec2022-11-29
Code Llama: Open Foundation Models for Code✓ Link47.6Code Llama - Python 7B (3-shot)2023-08-24
Mistral 7B✓ Link47.5Mistral 7B (3-shot)2023-10-10
Coder Reviewer Reranking for Code Generation✓ Link47.3CodeGen 16B + MBR-Exec2022-11-29
Teaching Large Language Models to Self-Debug✓ Link47.2StarCoder 15.5B (3-shot)2023-04-11
PaLM: Scaling Language Modeling with Pathways✓ Link47PaLM Coder 540B2022-04-05
Code Llama: Open Foundation Models for Code✓ Link47Code Llama 13B (3-shot)2023-08-24
Coder Reviewer Reranking for Code Generation✓ Link46.2CodeGen 16B + Coder-Reviewer2022-11-29
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence✓ Link46.2DeepSeek-Coder-Base 1.3B (few-shot)2024-01-25
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair✓ Link45.4GPT-3.5 Turbo (few-shot)2023-11-16
Llama 2: Open Foundation and Fine-Tuned Chat Models✓ Link45Llama 2 70B (zero-shot)2023-07-18
Code Llama: Open Foundation Models for Code✓ Link44.4Code Llama - Instruct 7B (3-shot)2023-08-24
Coder Reviewer Reranking for Code Generation✓ Link44.1CodeGen 16B + Reviewer2022-11-29
Textbooks Are All You Need II: phi-1.5 technical report✓ Link43.5phi-1.5-web 1.3B2023-09-11
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM✓ Link42.6Branch-Train-Merge 4x7B (top-2)2024-03-12
Code Llama: Open Foundation Models for Code✓ Link41.4Code Llama 7B (3-shot)2023-08-24
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks✓ Link41.4Camelidae-8×34B (4-shot)2024-01-05
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair✓ Link39.8GPT-3.5 Turbo (0-shot)2023-11-16
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM✓ Link39.4Branch-Train-MiX 4x7B (sampling top-2 experts)2024-03-12
LLaMA: Open and Efficient Foundation Language Models✓ Link37.7LLaMA 65B (0-shot)2023-02-27
PaLM: Scaling Language Modeling with Pathways✓ Link36.8PaLM 540B2022-04-05
StarCoder: may the source be with you!✓ Link35SantaCoder 1.1B2023-05-09
CodeT: Code Generation with Generated Tests✓ Link34.4InCoder 6.7B + CodeT2022-07-21
Llama 2: Open Foundation and Fine-Tuned Chat Models✓ Link33Llama 2 34B (0-shot)2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models✓ Link30.6Llama 2 13B (0-shot)2023-07-18
LLaMA: Open and Efficient Foundation Language Models✓ Link30.2LLaMA 33B (0-shot)2023-02-27
Coder Reviewer Reranking for Code Generation✓ Link26.7InCoder 6.7B + MBR-Exec2022-11-29
Coder Reviewer Reranking for Code Generation✓ Link26.1InCoder 6.7B + Coder-Reviewer2022-11-29
Coder Reviewer Reranking for Code Generation✓ Link24.4InCoder 6.7B + Reviewer2022-11-29
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X✓ Link24.4CodeGeeX-13B2023-03-30
LLaMA: Open and Efficient Foundation Language Models✓ Link22LLaMA 13B (0-shot)2023-02-27
Llama 2: Open Foundation and Fine-Tuned Chat Models✓ Link20.8Llama 2 7B (0-shot)2023-07-18
InCoder: A Generative Model for Code Infilling and Synthesis✓ Link19.4InCoder 6.7B (0-shot)2022-04-12
LLaMA: Open and Efficient Foundation Language Models✓ Link17.7LLaMA 7B (0-shot)2023-02-27