Execution Guided Line-by-Line Code Generation | ✓ Link | 96.6 | EG-CFG (DeepSeek-V3-0324) | 2025-06-12 |
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks | | 94.2 | QualityFlow (Sonnet-3.5) | 2025-01-20 |
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | ✓ Link | 93.2 | o1-mini + MapCoder (Hamming.ai) | 2024-05-18 |
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | ✓ Link | 92.4 | MGDebugger (DeepSeek-V3-0324) | 2024-10-02 |
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | ✓ Link | 91.8 | GPT-4 + AgentCoder | 2023-12-20 |
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | ✓ Link | 90.7 | CodeSim (GPT4o) | 2025-02-08 |
[]() | | 90.0 | Jiutian-大模型 | |
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | ✓ Link | 89.9 | GPT-3.5 Turbo (ChatGPT) + AgentCoder | 2023-12-20 |
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | ✓ Link | 89.7 | MapCoder (GPT-4o) | 2024-05-18 |
How Does Naming Affect LLMs on Code Analysis Tasks? | | 87.5 | GPT-4 (ChatGPT Plus) | 2023-07-24 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 86.4 | Claude 3 Opus | 2024-03-04 |
Planning-Driven Programming: A Large Language Model Programming Workflow | ✓ Link | 84.8 | LPW (GPT-4o) | 2024-11-21 |
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents | | 83.8±0.6 | GPT-3.5 Turbo + FlowGenScrum + Test | 2024-03-23 |
AFlow: Automating Agentic Workflow Generation | ✓ Link | 83.4 | AFlow(GPT-4o-mini) | 2024-10-14 |
How Does Naming Affect LLMs on Code Analysis Tasks? | | 83.2 | GPT-3.5 Turbo (ChatGPT) | 2023-07-24 |
Execution Guided Line-by-Line Code Generation | ✓ Link | 83.2 | EG-CFG (DeepSeek Coder 1.3b Instruct) | 2025-06-12 |
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | ✓ Link | 83.1 | MapCoder (GPT-4) | 2024-05-18 |
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | ✓ Link | 82.3 | o1-mini + Language Agent Tree Search (Hamming.ai) | 2023-10-06 |
How Does Naming Affect LLMs on Code Analysis Tasks? | | 82 | GPT-4 (Bing Chat) | 2023-07-24 |
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | ✓ Link | 81.1 | GPT-3.5 Turbo + Language Agent Tree Search | 2023-10-06 |
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | ✓ Link | 80.8 | MGDebugger (CodeQwen1.5) | 2024-10-02 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 80.4 | Claude 3 Haiku | 2024-03-04 |
Teaching Large Language Models to Self-Debug | ✓ Link | 80.2 | GPT-4 (Self-Debugging with unit tests + trace) | 2023-04-11 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 80 | GPT-4 (few-shot) | 2024-01-25 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 79.4 | Claude 3 Sonnet | 2024-03-04 |
How Does Naming Affect LLMs on Code Analysis Tasks? | | 76.2 | Bard (PaLM 2/chat-bison-001) | 2023-07-24 |
Teaching Large Language Models to Self-Debug | ✓ Link | 72.8 | GPT-3.5 Turbo (Self-Debugging with unit tests + trace) | 2023-04-11 |
How Does Naming Affect LLMs on Code Analysis Tasks? | | 71.4 | Claude | 2023-07-24 |
Teaching Large Language Models to Self-Debug | ✓ Link | 70.8 | code-davinci-002 175B (Self-Debugging with unit tests + trace) | 2023-04-11 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 70.8 | GPT-3.5 Turbo (few-shot) | 2024-01-25 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 70 | DeepSeek-Coder-Instruct 33B (few-shot) | 2024-01-25 |
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair | ✓ Link | 69.8 | GPT-3.5 Turbo + INTERVENOR | 2023-11-16 |
LEVER: Learning to Verify Language-to-Code Generation with Execution | ✓ Link | 68.9 | code-davinci-002 175B + LEVER | 2023-02-16 |
CodeT: Code Generation with Generated Tests | ✓ Link | 67.7 | code-davinci-002 175B + CodeT | 2022-07-21 |
Teaching Large Language Models to Self-Debug | ✓ Link | 67.6 | GPT-3.5 Turbo (3-shot) | 2023-04-11 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 66.9 | code-davinci-002 175B + Reviewer | 2022-11-29 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 66.4 | code-davinci-002 175B + Coder-Reviewer | 2022-11-29 |
StarCoder 2 and The Stack v2: The Next Generation | ✓ Link | 66.2 | StarCoder2-15B | 2024-02-29 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 66 | DeepSeek-Coder-Base 33B (few-shot) | 2024-01-25 |
Code Llama: Open Foundation Models for Code | ✓ Link | 65.5 | Code Llama - Python 70B (3-shot) | 2023-08-24 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 65.4 | DeepSeek-Coder-Instruct 6.7B (few-shot) | 2024-01-25 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 63 | code-davinci-002 175B + MBR-Exec | 2022-11-29 |
Code Llama: Open Foundation Models for Code | ✓ Link | 62.4 | Code Llama 70B (3-shot) | 2023-08-24 |
Code Llama: Open Foundation Models for Code | ✓ Link | 62.2 | Code Llama - Instruct 70B (3-shot) | 2023-08-24 |
CodeT: Code Generation with Generated Tests | ✓ Link | 61.9 | code-davinci-001 175B + CodeT | 2022-07-21 |
Teaching Large Language Models to Self-Debug | ✓ Link | 61.4 | code-davinci-002 175B (3-shot) | 2023-04-11 |
Code Llama: Open Foundation Models for Code | ✓ Link | 61.2 | Unnatural Code Llama 34B (3-shot) | 2023-08-24 |
Mixtral of Experts | ✓ Link | 60.7 | Mixtral 8x7B (3-shot) | 2024-01-08 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 60.6 | DeepSeek-Coder-Base 6.7B (few-shot) | 2024-01-25 |
Natural Language to Code Translation with Execution | ✓ Link | 58.2 | code-davinci-001 175B + MBR-Exec | 2022-04-25 |
Code Llama: Open Foundation Models for Code | ✓ Link | 57 | Code Llama - Instruct 34B (3-shot) | 2023-08-24 |
Code Llama: Open Foundation Models for Code | ✓ Link | 56.2 | Code Llama - Python 34B (3-shot) | 2023-08-24 |
CodeT: Code Generation with Generated Tests | ✓ Link | 55.4 | code-cushman-001 12B (CodeT) | 2022-07-21 |
Code Llama: Open Foundation Models for Code | ✓ Link | 55 | Code Llama 34B (3-shot) | 2023-08-24 |
Teaching Large Language Models to Self-Debug | ✓ Link | 53.2 | StarCoder 15.5B (Self-Debugging with unit tests + trace) | 2023-04-11 |
StarCoder: may the source be with you! | ✓ Link | 52.7 | StarCoder 15.5B | 2023-05-09 |
Code Llama: Open Foundation Models for Code | ✓ Link | 52.2 | GPT-3.5 Turbo | 2023-08-24 |
WizardCoder: Empowering Code Large Language Models with Evol-Instruct | ✓ Link | 51.8 | WizardCoder 15B | 2023-06-14 |
PaLM 2 Technical Report | ✓ Link | 50 | PaLM 2-S* (few-shot) | 2023-05-17 |
CodeT: Code Generation with Generated Tests | ✓ Link | 49.5 | CodeGen-Mono 16B + CodeT | 2022-07-21 |
Code Llama: Open Foundation Models for Code | ✓ Link | 49.4 | Code Llama - Instruct 13B (3-shot) | 2023-08-24 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 49.4 | DeepSeek-Coder-Instruct 1.3B (few-shot) | 2024-01-25 |
StarCoder: may the source be with you! | ✓ Link | 49 | StarCoderBase 15.5B | 2023-05-09 |
Code Llama: Open Foundation Models for Code | ✓ Link | 49 | Code Llama - Python 13B (3-shot) | 2023-08-24 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 48.6 | Qwen2idae-16x14B (4-shot) | 2024-01-05 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 48.3 | code-cushman-001 12B + MBR-Exec | 2022-11-29 |
Code Llama: Open Foundation Models for Code | ✓ Link | 47.6 | Code Llama - Python 7B (3-shot) | 2023-08-24 |
Mistral 7B | ✓ Link | 47.5 | Mistral 7B (3-shot) | 2023-10-10 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 47.3 | CodeGen 16B + MBR-Exec | 2022-11-29 |
Teaching Large Language Models to Self-Debug | ✓ Link | 47.2 | StarCoder 15.5B (3-shot) | 2023-04-11 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 47 | PaLM Coder 540B | 2022-04-05 |
Code Llama: Open Foundation Models for Code | ✓ Link | 47 | Code Llama 13B (3-shot) | 2023-08-24 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 46.2 | CodeGen 16B + Coder-Reviewer | 2022-11-29 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence | ✓ Link | 46.2 | DeepSeek-Coder-Base 1.3B (few-shot) | 2024-01-25 |
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair | ✓ Link | 45.4 | GPT-3.5 Turbo (few-shot) | 2023-11-16 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 45 | Llama 2 70B (zero-shot) | 2023-07-18 |
Code Llama: Open Foundation Models for Code | ✓ Link | 44.4 | Code Llama - Instruct 7B (3-shot) | 2023-08-24 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 44.1 | CodeGen 16B + Reviewer | 2022-11-29 |
Textbooks Are All You Need II: phi-1.5 technical report | ✓ Link | 43.5 | phi-1.5-web 1.3B | 2023-09-11 |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | ✓ Link | 42.6 | Branch-Train-Merge 4x7B (top-2) | 2024-03-12 |
Code Llama: Open Foundation Models for Code | ✓ Link | 41.4 | Code Llama 7B (3-shot) | 2023-08-24 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 41.4 | Camelidae-8×34B (4-shot) | 2024-01-05 |
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair | ✓ Link | 39.8 | GPT-3.5 Turbo (0-shot) | 2023-11-16 |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | ✓ Link | 39.4 | Branch-Train-MiX 4x7B (sampling top-2 experts) | 2024-03-12 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 37.7 | LLaMA 65B (0-shot) | 2023-02-27 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 36.8 | PaLM 540B | 2022-04-05 |
StarCoder: may the source be with you! | ✓ Link | 35 | SantaCoder 1.1B | 2023-05-09 |
CodeT: Code Generation with Generated Tests | ✓ Link | 34.4 | InCoder 6.7B + CodeT | 2022-07-21 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 33 | Llama 2 34B (0-shot) | 2023-07-18 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 30.6 | Llama 2 13B (0-shot) | 2023-07-18 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 30.2 | LLaMA 33B (0-shot) | 2023-02-27 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 26.7 | InCoder 6.7B + MBR-Exec | 2022-11-29 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 26.1 | InCoder 6.7B + Coder-Reviewer | 2022-11-29 |
Coder Reviewer Reranking for Code Generation | ✓ Link | 24.4 | InCoder 6.7B + Reviewer | 2022-11-29 |
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X | ✓ Link | 24.4 | CodeGeeX-13B | 2023-03-30 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 22 | LLaMA 13B (0-shot) | 2023-02-27 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 20.8 | Llama 2 7B (0-shot) | 2023-07-18 |
InCoder: A Generative Model for Code Infilling and Synthesis | ✓ Link | 19.4 | InCoder 6.7B (0-shot) | 2022-04-12 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 17.7 | LLaMA 7B (0-shot) | 2023-02-27 |