Add DPO and ORPO preference data preprocessing pipeline utils#3895
Add DPO and ORPO preference data preprocessing pipeline utils#3895igorts-git wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
30d3c25 to
b8ae239
Compare
|
🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
The Pull Request introduces important utilities for DPO and ORPO preference data preprocessing, which is a key component for the upcoming Tunix-based alignment implementation. The core logic for handling 2-column and 3-column datasets is well-structured, but I identified a high-severity bug in the common prefix extraction and some opportunities for more flexible truncation strategies.
🔍 General Feedback
- Logic Bug: The common prefix extraction logic using
enumerate(zip(...))is flawed for edge cases like identical strings or prefix strings. I have provided a more robust implementation in the inline comments. - Truncation Strategy: The current 50/50 split for prompt/response lengths and the prefix-based truncation for prompts might lead to information loss in long-context scenarios.
- Test Coverage: The new unit tests are quite thorough, but adding the suggested edge cases for prefix extraction would make them even better.
…ities Includes robust common prefix extraction for 2-column datasets, prompt suffix truncation, customizable max_prompt_length with validation against max_target_length, and complete integration unit test coverage.
b8ae239 to
2d7b6e0
Compare
|
🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This PR introduces necessary data preprocessing utilities for DPO and ORPO, including a new Grain transform DPOTunixPrep that handles column remapping, prefix extraction, and DPO-aware padding. The implementation is well-tested and integrated into the existing Hugging Face data pipeline.
🔍 General Feedback
- Robustness: The prefix extraction logic for 2-column datasets is a great addition for supporting popular preference datasets like Anthropic/hh-rlhf.
- Breaking Change: As noted in the description, moving DPO parameters into a nested config block is a breaking change for existing DPO configurations.
- Logic Correction: A fix is suggested for the slicing logic in
_padto correctly handle cases where the requested length is 0. - Validation: Added a suggestion for non-negativity validation on
max_prompt_lengthto align with project standards.
| pad_amount = max(length - x.shape[0], 0) | ||
| if left: | ||
| pad_width = ((pad_amount, 0),) | ||
| x_trimmed = x[-length:] |
There was a problem hiding this comment.
| x_trimmed = x[-length:] | |
| if left: | |
| pad_width = ((pad_amount, 0),) | |
| x_trimmed = x[-length:] if length > 0 else x[:0] | |
| else: |
| orpo_lambda: float = Field(0.1, description="Weight for preference loss in ORPO.") | ||
| dpo_label_smoothing: float = Field(0.0, ge=0.0, le=1.0, description="Label smoothing for DPO.") | ||
| max_prompt_length: int | None = Field( | ||
| None, |
There was a problem hiding this comment.
| None, | |
| max_prompt_length: int | None = Field( | |
| None, | |
| ge=0, | |
| description="Maximum length for prompt. If None, defaults to half of max_target_length.", | |
| ) |
Description
To simplify code review I am splitting the Tunix-based DPO implementation into smaller PRs.
This one adds the data reading processing required by DPO.
The classic DPO inputs consist of three data columns: ["prompt", "chosen_response", "rejected_response"].
However, some DPO datasets use a two-column format where the prompt is the prefix to the choosen and rejected strings.
When a 2-column dataset is used our implementation extracts the common prefix into the "prompt" field that is then fed into the model separately.
The column names in the dataset can wary, for example ["input", chosen", "rejected"]. Our implementation allows the user to supply the dataset column names via the
train_data_columnsandeval_data_columnsparameters.Tunix requires left-padded prompt and right-padded responses. Our code implements this padding (and truncation if needed) it also provides Tunix with the corresponding masks.
NOTE: once this PR is merged the legacy DPO will stop working correctly. The follow up PRs will enable Tunix-based DPO.
Tests
Added unit tests. Ran DPO/ORPO and performed logits comparison against the legacy implementation.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.