Skip to content

Print a stable run-ID banner to stdout on run start/resume#119

Open
asaiacai wants to merge 3 commits into
mainfrom
claude/magical-lovelace-Bn0Ay
Open

Print a stable run-ID banner to stdout on run start/resume#119
asaiacai wants to merge 3 commits into
mainfrom
claude/magical-lovelace-Bn0Ay

Conversation

@asaiacai

@asaiacai asaiacai commented Jun 4, 2026

Copy link
Copy Markdown

What

Emit a fixed-format line to stdout when a run starts or resumes:

pluto: run LV3-12 started (external_id=dhyecrvx)

(resumed is used for the resume path.)

Why

This comes from the linum-v3 feedback (item #7). When a multi-node training job crashes, the operator often only has the process logs and doesn't remember the run ID. Their /resume-crashed-run skill reverse-looks up the run by grepping the trainer's stdout — today it greps W&B's banner (wandb: setting up run …) and wants the equivalent for Pluto:

m = re.search(r"pluto:\s*run\s+(LV3-\d+)", line)

The only run-start output we emitted was a logger.info line (pluto/op.py), which:

  1. goes to stderr (the logging StreamHandler defaults to sys.stderr), but the consumer greps stdout;
  2. prints the numeric run ID (e.g. 151299), not the LV3-12 display ID;
  3. isn't a stable, greppable format (it's prose behind the logging system, subject to log level / disable_console / notebook handler stripping).

How

No server change needed — the POST /api/runs/create (and resume) response already returns displayId (verified in pluto-server web/server/routes/runs-openapi.ts). This:

  • captures displayId from the response into settings._display_id;
  • adds Op._print_run_banner(verb), a plain print(..., flush=True) to stdout (deliberately independent of the logging system so it can't be suppressed and always lands on stdout);
  • derives external_id as the sqid slug — the last path segment of the run URL (runUrl ends in sqidEncode(run.id)).

The existing logger.info lines are left unchanged.

Tests

tests/test_run_banner.py (6 cases, all passing):

  • banner goes to stdout, not stderr;
  • started / resumed verbs;
  • output satisfies the documented consumer regex;
  • trailing-slash URL handling;
  • no displayId → silent;
  • no URL → external_id omitted.
$ python -m pytest tests/test_run_banner.py
6 passed

ruff check / ruff format --check clean on changed files; existing TestNoopRunStatus Op-construction tests still pass.

Notes / scope

This is the in-scope, client-side slice of the linum-v3 notes. The other items are either web-frontend (#1, 2, 9–14) or MCP/backend (#3, 8) and live in other repos. #5 (GPU metrics) is already emitted by the SDK (pluto/sys.py); #6 (image captions) and #4 (string-valued checkpoint/* metrics) also have client-side slices in pluto/compat/wandb.py that could be follow-ups if wanted.

https://claude.ai/code/session_01DEs8exGc8X6WqmLkjbqEi2


Generated by Claude Code


Note

Low Risk
Client-only observability output with no changes to auth, uploads, or server APIs beyond reading an existing response field.

Overview
Adds a greppable stdout banner when a run starts or resumes, so operators can recover the Pluto display ID (e.g. LV3-12) from trainer logs—similar to W&B’s run banner.

After create/resume, the client stores displayId from the API response on Settings._display_id and calls new Op._print_run_banner, which prints to stdout (not the logger): pluto: run <display_id> started|resumed with optional external_id parsed from the last URL path segment. Missing displayId prints nothing; host-only URLs omit external_id. Existing logger.info run lines are unchanged.

tests/test_run_banner.py covers stdout vs stderr, verbs, regex compatibility, URL edge cases, and silent/no-external-id behavior.

Reviewed by Cursor Bugbot for commit 43bcbbf. Configure here.

Emit a fixed-format line on stdout when a run starts or resumes so
external tooling can reverse-look up a run from a training process's
stdout (e.g. when a multi-node job crashes and the operator only has the
logs):

    pluto: run LV3-12 started (external_id=dhyecrvx)

Previously the only run-start output was a logger.info line that (a) went
to stderr via the logging StreamHandler, (b) printed the numeric run ID
rather than the LV3-12 display ID, and (c) wasn't a stable greppable
format. The server's create/resume response already returns displayId,
so this captures it (settings._display_id) and prints a plain stdout
line independent of the logging system. external_id is the sqid slug
parsed from the run URL.

Adds tests/test_run_banner.py covering stdout routing, the started/
resumed verbs, the consumer regex, trailing-slash URLs, and the
no-display-id / no-url fallbacks.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a stable, greppable run banner printed to stdout when a run starts or resumes, allowing external tooling to reverse-look up a run. It retrieves a _display_id from the server response, parses an external_id from the run URL, and adds comprehensive unit tests. The review feedback correctly points out a potential bug where a host-only URL would incorrectly parse the hostname as the external_id due to string splitting, and suggests a more robust parsing method using urllib.parse.urlparse along with an additional test case to cover this scenario.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pluto/op.py Outdated
Comment thread tests/test_run_banner.py
Use urllib.parse.urlparse to extract the path before taking the last
segment, so a host-only url_view (e.g. https://pluto.trainy.ai with no
run slug) omits external_id instead of falling back to the hostname.
Adds a regression test for the host-only URL case.

Addresses review feedback on PR #119.
The previous run failed only on test_e2e_metrics_logged due to a 503
from pluto-api.trainy.ai mid-run (the e2e suite hits the live server);
not related to this change. Empty commit to re-run the matrix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants