Model Overview
Model Features
Model Capabilities
Use Cases
metrics:
- code_eval library_name: transformers tags:
- code model-index:
- name: WizardCoder
results:
- task:
type: text-generation
dataset:
type: openai_humaneval
name: HumanEval
metrics:
- name: pass@1 type: pass@1 value: 0.799 verified: false
- task:
type: text-generation
dataset:
type: openai_humaneval
name: HumanEval
metrics:
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
๐ Home Page
๐ค HF Repo โข๐ฑ Github Repo โข ๐ฆ Twitter
๐ [WizardLM] โข ๐ [WizardCoder] โข ๐ [WizardMath]
๐ Join our Discord
News
[2024/01/04] ๐ฅ We released WizardCoder-33B-V1.1 trained from deepseek-coder-33b-base, the SOTA OSS Code LLM on EvalPlus Leaderboard, achieves 79.9 pass@1 on HumanEval, 73.2 pass@1 on HumanEval-Plus, 78.9 pass@1 on MBPP, and 66.9 pass@1 on MBPP-Plus.
[2024/01/04] ๐ฅ WizardCoder-33B-V1.1 outperforms ChatGPT 3.5, Gemini Pro, and DeepSeek-Coder-33B-instruct on HumanEval and HumanEval-Plus pass@1.
[2024/01/04] ๐ฅ WizardCoder-33B-V1.1 is comparable with ChatGPT 3.5, and surpasses Gemini Pro on MBPP and MBPP-Plus pass@1.
Model | Checkpoint | Paper | HumanEval | HumanEval+ | MBPP | MBPP+ | License |
---|---|---|---|---|---|---|---|
GPT-4-Turbo (Nov 2023) | - | - | 85.4 | 81.7 | 83.0 | 70.7 | - |
GPT-4 (May 2023) | - | - | 88.4 | 76.8 | - | - | - |
GPT-3.5-Turbo (Nov 2023) | - | - | 72.6 | 65.9 | 81.7 | 69.4 | - |
Gemini Pro | - | - | 63.4 | 55.5 | 72.9 | 57.9 | - |
DeepSeek-Coder-33B-instruct | - | - | 78.7 | 72.6 | 78.7 | 66.7 | - |
WizardCoder-33B-V1.1 | ๐ค HF Link | ๐ [WizardCoder] | 79.9 | 73.2 | 78.9 | 66.9 | MSFTResearch |
WizardCoder-Python-34B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 73.2 | 64.6 | 73.2 | 59.9 | Llama2 |
WizardCoder-15B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 59.8 | 52.4 | -- | -- | OpenRAIL-M |
WizardCoder-Python-13B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 64.0 | -- | -- | -- | Llama2 |
WizardCoder-Python-7B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 55.5 | -- | -- | -- | Llama2 |
WizardCoder-3B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 34.8 | -- | -- | -- | OpenRAIL-M |
WizardCoder-1B-V1.0 | ๐ค HF Link | ๐ [WizardCoder] | 23.8 | -- | -- | -- | OpenRAIL-M |
How to Make the Training Data?
Apply our Code Evol-Instruct on Code-Aplaca data.
โ Data Contamination Check:
Before model training, we carefully and rigorously checked all the training data, and used multiple deduplication methods to verify and prevent data leakage on HumanEval and MBPP test set.
๐ฅ โNote for model system prompts usage:
Please use the same systems prompts strictly with us, and we do not guarantee the accuracy of the quantified versions.
Default version:
"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
How to Reproduce the Performance of WizardCoder-33B-V1.1
We provide all codes here.
We also provide all generated results.
transformers==4.36.2
vllm==0.2.5
(1) HumanEval and HumanEval-Plus
- Step 1
Code Generation (w/o accelerate)
model="WizardLM/WizardCoder-33B-V1.1"
temp=0.0
max_len=2048
pred_num=1
num_seqs_per_iter=1
output_path=preds/T${temp}_N${pred_num}_WizardCoder-33B-V1.1_Greedy_Decode
mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model
# 164 problems, 21 per GPU if GPU=8
index=0
gpu_num=8
for ((i = 0; i < $gpu_num; i++)); do
start_index=$((i * 21))
end_index=$(((i + 1) * 21))
gpu=$((i))
echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
((index++))
(
CUDA_VISIBLE_DEVICES=$gpu python humaneval_gen.py --model ${model} \
--start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path} --greedy_decode
) &
if (($index % $gpu_num == 0)); then wait; fi
done
Code Generation (w/ vllm accelerate)
model="WizardLM/WizardCoder-33B-V1.1"
temp=0.0
max_len=2048
pred_num=1
num_seqs_per_iter=1
output_path=preds/T${temp}_N${pred_num}_WizardCoder-33B-V1.1_Greedy_Decode_vllm
mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model
CUDA_VISIBLE_DEVICES=0,1,2,3 python humaneval_gen_vllm.py --model ${model} \
--start_index 0 --end_index 164 --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path} --num_gpus 4 --overwrite
- Step 2: Get the score
Install Eval-Plus benchmark.
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Get HumanEval and HumanEval-Plus scores.
output_path=preds/T0.0_N1_WizardCoder-33B-V1.1_Greedy_Decode
echo 'Output path: '$output_path
python process_humaneval.py --path ${output_path} --out_path ${output_path}.jsonl --add_prompt
evalplus.evaluate --dataset humaneval --samples ${output_path}.jsonl
(2) MBPP and MBPP-Plus
The preprocessed questions are provided in mbppplus.json.
- Step 1
Code Generation (w/o accelerate)
model="WizardLM/WizardCoder-33B-V1.1"
temp=0.0
max_len=2048
pred_num=1
num_seqs_per_iter=1
output_path=preds/MBPP_T${temp}_N${pred_num}_WizardCoder-33B-V1.1_Greedy_Decode
mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model
# 399 problems, 50 per GPU if GPU=8
index=0
gpu_num=8
for ((i = 0; i < $gpu_num; i++)); do
start_index=$((i * 50))
end_index=$(((i + 1) * 50))
gpu=$((i))
echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
((index++))
(
CUDA_VISIBLE_DEVICES=$gpu python mbppplus_gen.py --model ${model} \
--start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path} --mbpp_path "mbppplus.json" --greedy_decode
) &
if (($index % $gpu_num == 0)); then wait; fi
done
Code Generation (w/ vllm accelerate)
model="WizardLM/WizardCoder-33B-V1.1"
temp=0.0
max_len=2048
pred_num=1
num_seqs_per_iter=1
output_path=preds/MBPP_T${temp}_N${pred_num}_WizardCoder-33B-V1.1_Greedy_Decode_vllm
mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model
CUDA_VISIBLE_DEVICES=0,1,2,3 python mbppplus_gen_vllm.py --model ${model} \
--start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path} --mbpp_path "mbppplus.json" --num_gpus 4
- Step 2: Get the score
Install Eval-Plus benchmark.
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Get HumanEval and HumanEval-Plus scores.
output_path=preds/MBPP_T0.0_N1_WizardCoder-33B-V1.1_Greedy_Decode
echo 'Output path: '$output_path
python mbppplus_process_preds.py --path ${output_path} --out_path ${output_path}.jsonl --add_prompt
evalplus.evaluate --dataset mbpp --samples ${output_path}.jsonl
Citation
Please cite the repo if you use the data, method or code in this repo.
@article{luo2023wizardcoder,
title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct},
author={Luo, Ziyang and Xu, Can and Zhao, Pu and Sun, Qingfeng and Geng, Xiubo and Hu, Wenxiang and Tao, Chongyang and Ma, Jing and Lin, Qingwei and Jiang, Daxin},
journal={arXiv preprint arXiv:2306.08568},
year={2023}
}

