Skip to content

Added metrics and fixed step counter#17

Open
matthewyryang wants to merge 8 commits into
prodfrom
steps-metrics
Open

Added metrics and fixed step counter#17
matthewyryang wants to merge 8 commits into
prodfrom
steps-metrics

Conversation

@matthewyryang
Copy link
Copy Markdown

@matthewyryang matthewyryang commented May 7, 2026

Problem

Agentic-RL W&B charts were unreadable:

  • Metrics in different panels didn't share a step axis (e.g., reward/returns plotted against W&B's auto _step while reward/rewards
    plotted against rollout/step)
  • The cumulative train-step number was nonmonotonic under dynamic batching (collided / went backwards)
  • Per-domain training metrics couldn't be split (only one global pg_loss)
  • The agent/* aggregator default-zeroed missing keys, so e.g. agent/turns_sum=0 was logged even when the agent never reported turns —
    looks like real data, isn't
  • Monkey-patches in ~/rl360/src/miles360/training_metrics_patches.py (432 lines) duplicated logic that now belongs in miles

Changes

Axis alignment. Every emission of a reward/* / optimization/* / policy_shift/* / train_inference_mismatch/* metric is now co-logged with
the bare counter it's supposed to plot against — rollout_step on the rollout side, train_step on the train side. W&B picks them up via
existing define_metric registration.

Monotone train-step. Replaced rollout_id × num_steps_per_rollout + step_id with a process-global _TRAIN_STEP_COUNTER. The old formula
collapsed under dynamic batching because num_steps_per_rollout shrinks across rollouts when total tokens drop. Counter is invariant.

Panel groupings. _TRAIN_METRIC_GROUPS and _ROLLOUT_DATA_METRIC_GROUPS maps in log_utils.py route metrics into the W&B sections reward/,
optimization/, policy_shift/, train_inference_mismatch/. No emission-site changes needed when a metric is added — just update the map.

Per-domain fan-out. pg_loss/, pg_clipfrac/, ppo_kl/, train_rollout_logprob_*/ are now routed into
// so each panel shows one curve per domain.

Sparse metric aggregation. _collect_values in src/agent360/harbor/miles/generate.py no longer defaults missing keys to 0. A fall-through
loop auto-emits agent/mean for any numeric scalar agents put in agent_metrics, so terminus-2's n_episodes, summarization_count,
n_input_tokens, n_output_tokens show up automatically. Names that no agent reports (turns, tool_calls, model_query_time
*) just aren't
logged — no misleading zeros.

Cleanup. Deleted src/sitecustomize.py and src/miles360/training_metrics_patches.py (432 lines of runtime monkey-patches) — that logic is
now upstream in ~/miles. Plus dropped useless reward/{correct,incorrect}/raw_reward panels (they were always exactly 0 or 1 by
construction).

Other side effects

  • agent.yaml: generate_multi_samples: false, rollout_top_p: 0.95 — match the working baseline.
  • scale.yaml iter: max_response_len: 4096, n_samples_per_prompt: 8 — survives all-AgentError rollouts statistically, ~2× faster per turn.

Validation

Run (pd-hicache-l3 --scale iter) - 10 rollouts end-to-end, no hangs.

📊 https://wandb.ai/mbzuai-llm/harbor-miles/runs/215ha1cw

PR on the RL360 side: https://github.com/LLM360/RL360/pull/308

@matthewyryang matthewyryang marked this pull request as ready for review May 14, 2026 20:13
@matthewyryang matthewyryang requested a review from a team as a code owner May 14, 2026 20:13
@matthewyryang matthewyryang marked this pull request as draft May 14, 2026 23:01
@matthewyryang matthewyryang marked this pull request as ready for review May 14, 2026 23:03
@nightlessbaron
Copy link
Copy Markdown

nightlessbaron commented May 18, 2026

Code review

Found 1 issue:

  1. New define_metric calls are missing step_metric, so samples_seen, train_step, and rollout_step will be plotted against W&B's internal auto-_step instead of the rollout axis — directly contradicting the PR's stated goal of axis alignment. Every other metric registered in the same function uses an explicit step_metric (e.g. wandb.define_metric("perf/*", step_metric="rollout/step")).

https://github.com/LLM360/miles/blob/700da89d13803cbc399247e0348db389ca5298d8/miles/utils/wandb_utils.py#L179-L183

Fix: add step_metric="rollout/step" to all three calls (or step_metric="train/step" for train_step if it should track the training axis).

@nightlessbaron
Copy link
Copy Markdown

nightlessbaron commented May 18, 2026

Follow-up: _TRAIN_STEP_COUNTER correctness

Two concerns with the global counter replacing rollout_id * num_steps_per_rollout + step_id:

  1. Not resume-safe. The counter is module-level and resets to 0 on process restart. On checkpoint resume, train/step will dip back to 0 in W&B rather than continuing from where it left off. The old formula was deterministic from rollout_id and step_id, so it was implicitly resume-safe.

  2. Shared across roles if actor and critic run in the same process. Both call log_train_step with different role= values but the same counter. If co-located, actor logs at steps 0, 2, 4… and critic at 1, 3, 5… — so they never share an x-axis position in W&B, making actor/critic metric comparison harder. The old formula gave both the same value for the same (rollout_id, step_id) pair.

https://github.com/LLM360/miles/blob/38677e04a6cd63ce406a9ef0b7f98e2f6f54a0a/miles/backends/training_utils/log_utils.py#L513-L515

If actor and critic are always separate Ray actors (separate processes), concern 2 is a non-issue — each process has its own counter. But concern 1 applies regardless.

@DavidBellamy DavidBellamy removed their assignment May 18, 2026
@matthewyryang
Copy link
Copy Markdown
Author

Fixed resume-safe problem, and tested:

Resumed run: https://wandb.ai/mbzuai-llm/Verification-Study/runs/33qdth6b
from: https://wandb.ai/mbzuai-llm/Verification-Study/runs/f7kbhjcy

The second issue is not a problem, as actor and critic are always separate Ray actors.

Copy link
Copy Markdown
Author

@matthewyryang matthewyryang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants