Hi, thanks for sharing this impressive work. While evaluating Dream-org/Dream-Coder-v0-Instruct-7B on the HumanEval and MBPP benchmarks, I observed that the reproduced results do not match the performance reported in the paper.
For humaneval, I used a command aligned with the instructions in the Dream's official repository (https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh):
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
--model diffllm \
--model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
--tasks humaneval_instruct \
--device cuda \
--batch_size 1 \
--num_fewshot 0 \
--output_path output_reproduce/humaneval \
--log_samples --confirm_run_unsafe_code \
--apply_chat_template
My reproduction yields a HumanEval score of 74.39:
diffllm (pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype=bfloat16,temperature=0.1,top_p=0.9,alg=entropy), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
| Tasks |Version| Filter |n-shot|Metric| |Value | |Stderr|
|------------------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval_instruct| 2|create_test| 0|pass@1| |0.7439|± |0.0342|
This is significantly below the 82.9 reported in the paper.
For MBPP, I used the command:
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
--model diffllm \
--model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=1024,diffusion_steps=1024,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
--tasks mbpp_instruct \
--device cuda \
--batch_size 1 \
--num_fewshot 0 \
--output_path output_reproduce/mbpp \
--log_samples --confirm_run_unsafe_code \
--apply_chat_template
My reproduced MBPP score is 66.84, which is also lower than the 79.6 reported in the paper.
Could you clarify why these discrepancies occur or whether additional evaluation settings are required to match the results in the paper? Thank you!
Hi, thanks for sharing this impressive work. While evaluating
Dream-org/Dream-Coder-v0-Instruct-7Bon the HumanEval and MBPP benchmarks, I observed that the reproduced results do not match the performance reported in the paper.For humaneval, I used a command aligned with the instructions in the Dream's official repository (https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh):
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \ --model diffllm \ --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \ --tasks humaneval_instruct \ --device cuda \ --batch_size 1 \ --num_fewshot 0 \ --output_path output_reproduce/humaneval \ --log_samples --confirm_run_unsafe_code \ --apply_chat_templateMy reproduction yields a HumanEval score of 74.39:
This is significantly below the 82.9 reported in the paper.
For MBPP, I used the command:
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \ --model diffllm \ --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=1024,diffusion_steps=1024,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \ --tasks mbpp_instruct \ --device cuda \ --batch_size 1 \ --num_fewshot 0 \ --output_path output_reproduce/mbpp \ --log_samples --confirm_run_unsafe_code \ --apply_chat_templateMy reproduced MBPP score is 66.84, which is also lower than the 79.6 reported in the paper.
Could you clarify why these discrepancies occur or whether additional evaluation settings are required to match the results in the paper? Thank you!