Skip to content

build(deps): update trl requirement from <=0.21.0 to <=1.7.0#704

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/trl-lte-1.7.0
Open

build(deps): update trl requirement from <=0.21.0 to <=1.7.0#704
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/trl-lte-1.7.0

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 29, 2026

Copy link
Copy Markdown
Contributor

Updates the requirements on trl to permit the latest version.

Release notes

Sourced from trl's releases.

v1.7.0

Features

SFT default loss_type is now "chunked_nll"

The flip announced in v1.6 has landed. Setting loss_type is optional, and the default now resolves to "chunked_nll" — giving every SFTTrainer run ~30% less peak VRAM on average (up to ~50% on large-vocab models) with wall-clock time neutral or slightly faster. No action needed.

The auto-resolve falls back to "nll" when use_liger_kernel=True (the two paths are incompatible). If you want the old behavior — e.g. for custom heads — pin it explicitly:

SFTConfig(loss_type="nll")

by @​qgallouedec in huggingface/trl#5846

MoE auxiliary loss in GRPO / RLOO / AsyncGRPO

Post-training MoE models now correctly include the router load-balancing auxiliary loss, matching the model's own reference forward and SFTTrainer. Enable via model_init_kwargs:

GRPOConfig(
    ...,
    model_init_kwargs={"output_router_logits": True, "router_aux_loss_coef": 0.001},
)

Plumbed through _get_per_token_logps_and_entropies (now returns a 3-tuple including aux_loss), folded into the policy loss with grad-accum scaling matched per trainer, and logged as aux_loss. AsyncGRPO recomputes it via load_balancing_loss_func in the chunked LM-head path (same as SFT's chunked path).

by @​AmineDiro in huggingface/trl#6083, plus router_aux_loss_coef config wiring by @​qgallouedec in huggingface/trl#6085

New experimental GMPO trainer

Geometric-Mean Policy Optimization lands as an experimental trainer. Replaces GRPO's per-token arithmetic mean of importance ratios with a sequence-level geometric mean (mean of clipped log-ratios, then exp); clipping is one-sided by advantage sign and applied in log space. Default epsilon=0.4 per the paper.

from trl.experimental.gmpo import GMPOConfig, GMPOTrainer
trainer = GMPOTrainer(
model="Qwen/Qwen3-4B",
args=GMPOConfig(epsilon=0.4),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)

by @​raghulchandramouli in huggingface/trl#6078

Transformers continuous batching in GRPO / RLOO

use_transformers_paged was deprecated in v1.4; it's now replaced with proper transformers continuous batching. The old branch silently bypassed importance-sampling correction (logprobs = None); the new path captures logprobs from output.logprobs and exposes a ContinuousBatchingConfig for KV-cache tuning.

... (truncated)

Commits
  • 06b42c7 Release: v1.7 (#6184)
  • 8182ef5 Support LFM2-VL multimodal inputs in GRPO and RLOO (#6114)
  • bd128c5 Align KTO with DPO: Add VLM multi-image test (#6163)
  • 9915f71 Align format of code examples in docstrings (#6147)
  • 6d80c97 Align KTO with DPO: Add tests (#6160)
  • 96e12e5 Keep extra columns in unpair_preference_dataset (#6161)
  • 851d020 Replace parse_version with Version (#6164)
  • e63f67e Align KTO with DPO: Support sync_ref_model (#6152)
  • 230c076 Align KTO with DPO: Move all metrics computation from log to _compute_loss (#...
  • 99e3c65 Align KTO with DPO: Align order and signature of methods (#6149)
  • Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Updates the requirements on [trl](https://github.com/huggingface/trl) to permit the latest version.
- [Release notes](https://github.com/huggingface/trl/releases)
- [Changelog](https://github.com/huggingface/trl/blob/main/RELEASE.md)
- [Commits](huggingface/trl@v0.2.0...v1.7.0)

---
updated-dependencies:
- dependency-name: trl
  dependency-version: 1.7.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants