Performance Gap Between Reproduction and Paper Results on HumanEval and MBPP

Hi, thanks for sharing this impressive work. While evaluating `Dream-org/Dream-Coder-v0-Instruct-7B` on the HumanEval and MBPP benchmarks, I observed that the reproduced results do not match the performance reported in the paper.

For humaneval, I used a command aligned with the instructions in the Dream's official repository ([https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh](https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh)):

```bash
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
    --model diffllm \
    --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
    --tasks humaneval_instruct \
    --device cuda \
    --batch_size 1 \
    --num_fewshot 0 \
    --output_path output_reproduce/humaneval \
    --log_samples --confirm_run_unsafe_code \
    --apply_chat_template
```

My reproduction yields a HumanEval score of **74.39**:

```
diffllm (pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype=bfloat16,temperature=0.1,top_p=0.9,alg=entropy), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
|      Tasks       |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval_instruct|      2|create_test|     0|pass@1|   |0.7439|±  |0.0342|
```

This is significantly below the **82.9** reported in the paper.

---


For MBPP, I used the command:

```bash
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
    --model diffllm \
    --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=1024,diffusion_steps=1024,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
    --tasks mbpp_instruct \
    --device cuda \
    --batch_size 1 \
    --num_fewshot 0 \
    --output_path output_reproduce/mbpp \
    --log_samples --confirm_run_unsafe_code \
    --apply_chat_template
```

My reproduced MBPP score is **66.84**, which is also lower than the **79.6** reported in the paper.

---

Could you clarify why these discrepancies occur or whether additional evaluation settings are required to match the results in the paper? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Gap Between Reproduction and Paper Results on HumanEval and MBPP #11

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Performance Gap Between Reproduction and Paper Results on HumanEval and MBPP #11

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions