Optimize pr-docs-check agentic workflow to cut token usage#18119
Optimize pr-docs-check agentic workflow to cut token usage#18119IEvangelist wants to merge 6 commits into
Conversation
The PR Documentation Check agentic workflow repeatedly hit the AWF 25M effective-token hard cap (18 of 19 recent failures over 7 days). A deterministic create_pull_request failure (\No changes to commit\) drove a retry loop (~34 inference requests in failed run 27322389644) that accumulated effective tokens until the run hard-failed. Healthy runs use only ~4-5 requests. Changes: - Add \max-runs: 12\ (AWF proxy maxRuns) as a hard backstop. Skip runs use ~4-5 requests, the runaway used ~34; 12 stops any loop far below the 25M rail while leaving headroom for the doc-writing path. - Add an anti-loop guard to Step 10: emit create_pull_request exactly once, never re-emit after a deterministic failure; on 'No changes' bail to a single notify_source_pr (skipped) instead of looping. - Defer expensive PR/review comment-thread fetches to the doc-drafting path only; the cheap skip path no longer fetches them. - Condense the prompt body ~16%: collapse Step 4's six signal tables into a compact grouped reference, trim Step 3's target.json table, de-duplicate repeated justification prose. - Cap Step 8 'browse existing documentation' breadth to the affected feature area. - Add a repeatable build.ps1 helper (gh aw compile + compute_signals tests) and regenerate pr-docs-check.lock.yml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18119Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18119" |
There was a problem hiding this comment.
Pull request overview
This PR addresses a recurring failure in the pr-docs-check agentic workflow where a deterministic create_pull_request retry loop caused 18 of 19 recent failures by exceeding the 25M effective-token cap. The fix applies multiple layers of defense: a hard iteration cap, anti-loop prompt guidance, deferred expensive operations, and prompt size reduction.
Changes:
- Adds
max-runs: 12to cap agent iterations (compiled tomaxRunsin the AWF proxy config), preventing runaway loops from reaching the 25M token rail. - Restructures the prompt to defer expensive comment-thread reads to the docs-drafting path only, condenses the signal catalog (~16% smaller), caps Step 8 doc browsing breadth, and adds an explicit anti-loop guard in Step 10 that prevents re-emitting
create_pull_requestafter an error. - Adds a repeatable
build.ps1helper for recompiling the workflow and running unit tests.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
.github/workflows/pr-docs-check/build.ps1 |
New PowerShell helper script to compile the workflow via gh aw compile and run compute_signals unit tests. |
.github/workflows/pr-docs-check.md |
Prompt source changes: max-runs: 12, deferred comment reads, condensed signal catalog, anti-loop guard in Step 10, and Step 8 browsing cap. |
.github/workflows/pr-docs-check.lock.yml |
Regenerated lock file reflecting the .md changes — hash/heredoc label churn plus maxRuns: 500 → 12 for the main agent job. |
- build.ps1: prefer python3 (matching the CI pre-agent step) with a probed fallback to python/py, rejecting the Windows Store stub and Python 2. Resolves the Copilot review comment about bare \python\ being absent or Python 2 on Linux/macOS. - build.ps1: add a gh aw version check against the pin in actions-lock.json, and a -VerifyParity guard that asserts the recompiled lock matches the committed lock byte-for-byte (catches hand-edits, missing recompiles, or a wrong gh aw version). - Make the Step 10 anti-loop guard generic: treat ANY deterministic create_pull_request failure (no commits, no diff, invalid patch, base/validation error, protected-file rejection) as non-retryable, not just the literal 'No changes to commit' string. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e build.py The PowerShell helper was not idiomatic for this repo (no other .ps1 under .github/workflows; CI is Linux and uses run steps + actions/github-script) and only checked lock parity locally. Replace it with: - .github/workflows/pr-docs-check-verify.yml: a hand-authored GitHub Actions workflow that, on PRs touching the pr-docs-check sources, installs the pinned gh-aw (version single-sourced from actions-lock.json), recompiles pr-docs-check, fails if the committed lock drifted, and runs the compute_signals tests. Closes the gap where nothing enforced lock parity in CI. - .github/workflows/pr-docs-check/build.py: a cross-platform local helper (matches the sibling compute_signals.py; no pwsh dependency) mirroring the same gh-aw version check, compile, optional --verify-parity, and tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per review feedback, "scripts for repeatable bits" means offloading deterministic work inside the agentic flow (the compute_signals.py pattern), not adding compile/CI meta-tooling. This reverts the verify workflow + build.py and instead moves two deterministic agent steps to tested pre-agent scripts so the agent reads a result file instead of making GitHub tool calls and reasoning inside the token-expensive loop. - Remove pr-docs-check-verify.yml and build.py (the meta-tooling). - Add compute_pr_context.py (+ tests): reuse the PR + files payloads the signals step already fetches to write .pr-docs-check/pr.json, the curated metadata the agent reads in Step 1 (no diff patches; those are fetched on demand only on the drafting path). - Add resolve_sme.py (+ tests): port the deterministic SME-selection algorithm (Step 2) to a pre-agent step writing .pr-docs-check/sme.json from assignees + reviews; only the fuzzy CODEOWNERS hint stays in the agent, signalled via needs_codeowners_fallback. - Rewrite agent Step 1 and Step 2 to read pr.json / sme.json. - Fetch reviews once in the signals step; regenerate the lock. Tests: 46 pass (20 compute_signals + 8 compute_pr_context + 18 resolve_sme). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A backport PR ports an already-merged change onto a release/* branch and is documented against its original forward PR on the default branch, so it must never spawn its own aspire.dev docs PR. Because the workflow runs on release/* merges as well as main, merged backports currently reach the agent and could draft a redundant docs PR. Add deterministic backport detection to compute_signals.py via a new detect_backport() helper that matches any of: base branch release/*, head branch backport/*, title prefixed [release/...], a "Backport of #N" body marker, or a backport label. signals.json now carries excluded + exclusion_reasons and a non-gating is_backport signal; when excluded, the recommendation is forced to docs_optional. The agent prompt gains a Step 5 exclusion branch (checked first, overrides the ambiguity rule) so an excluded PR is definitively skipped with an "excluded -> <reason>" notify_source_pr summary. - compute_signals.py: detect_backport(), is_backport (non-gating), excluded/exclusion_reasons output, recommendation override. - test_compute_signals.py: +10 backport tests (each marker, multi-reason, excluded-overrides-gating, negative main PR, default-safe missing keys). - pr-docs-check.md: Step 4 fields table + short-circuit note, Step 5 exclusion branch, ambiguity-rule carve-out, Step 6 summary branch. - Regenerated pr-docs-check.lock.yml (body_hash only; recompile parity OK). All 56 compute_signals/compute_pr_context/resolve_sme tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The pr-docs-check agentic flow only runs the three pre-agent scripts that produce data the agent reads (compute_signals.py -> signals.json, compute_pr_context.py -> pr.json, resolve_sme.py -> sme.json). The test_*.py files were never part of the flow; the only thing that ran them was the pr-docs-check-verify.yml CI check, which was removed earlier in this PR as meta-tooling around the flow rather than scripts within it. That left test_compute_pr_context.py and test_resolve_sme.py orphaned. Remove the two test files introduced in this PR and drop the now-stale "Running the tests" section from those scripts' docstrings. test_compute_signals.py is kept: it predates this PR (shipped on main with compute_signals.py) and guards the highest-risk piece — the regex signal catalog plus the new backport detection. Its 30 tests (20 original + 10 backport) still pass. The compiled lock is unaffected (recompile parity). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| _NON_GATING_SIGNALS = frozenset({"only_test_or_build_changes", "is_backport"}) | ||
|
|
||
|
|
||
|
|
||
|
|
| _NON_GATING_SIGNALS = frozenset({"only_test_or_build_changes", "is_backport"}) | ||
|
|
||
|
|
||
|
|
||
|
|
| def resolve_sme(pr: dict, reviews: list[dict]) -> dict: | ||
| """Resolve the SME from curated PR context + raw reviews. | ||
|
|
||
| Returns a dict with: | ||
| * sme_login the chosen login (no '@'), or "" if undecided | ||
| * sme_source how it was chosen (for transparency / logs) | ||
| * needs_codeowners_fallback True only when the agent should consult | ||
| CODEOWNERS as a last-resort hint | ||
| * candidates eligible reviewers (login + latest state) | ||
| """ | ||
| author = pr.get("author") or {} | ||
| author_login = (author.get("login") or "").lower() | ||
| assignees = [a for a in (pr.get("assignees") or []) if a] | ||
| latest_reviews = _latest_review_by_reviewer(reviews) | ||
|
|
||
| # --- Step 2a: Copilot-authored PRs -------------------------------------- | ||
| if is_copilot_authored(author): | ||
| human_assignees = [a for a in assignees if not is_bot(a)] | ||
| if len(human_assignees) == 1: | ||
| return _add_candidates(_result(human_assignees[0], "copilot_originator"), latest_reviews, author_login) | ||
| if len(human_assignees) > 1: | ||
| # Prefer the assignee whose latest review is APPROVED; if still | ||
| # ambiguous, the one appearing earliest in assignees[]. | ||
| approved = [a for a in human_assignees if latest_reviews.get(a, {}).get("state") == "APPROVED"] | ||
| if approved: | ||
| return _add_candidates(_result(approved[0], "copilot_originator_approved"), latest_reviews, author_login) | ||
| return _add_candidates(_result(human_assignees[0], "copilot_originator"), latest_reviews, author_login) | ||
| # No human assignees (unusual) — fall through to the reviewer logic. | ||
|
|
||
| # --- Step 2b: human-authored PRs (or 2a fallthrough) -------------------- | ||
| eligible = { | ||
| login: info | ||
| for login, info in latest_reviews.items() | ||
| if login.lower() != author_login and not is_bot(login) | ||
| } | ||
|
|
||
| # Prefer APPROVED reviewers; among them the most recent. | ||
| approved = {login: info for login, info in eligible.items() if info["state"] == "APPROVED"} | ||
| if approved: | ||
| chosen = max(approved.items(), key=lambda kv: kv[1]["submitted_at"])[0] | ||
| return _add_candidates(_result(chosen, "approved_reviewer"), latest_reviews, author_login) | ||
|
|
||
| # Fallback A: a reviewer whose latest state is substantive but not a bare | ||
| # COMMENTED (e.g. CHANGES_REQUESTED), most recent first. | ||
| substantive = {login: info for login, info in eligible.items() if info["state"] not in ("", "COMMENTED")} | ||
| if substantive: | ||
| chosen = max(substantive.items(), key=lambda kv: kv[1]["submitted_at"])[0] | ||
| return _add_candidates(_result(chosen, "substantive_reviewer"), latest_reviews, author_login) | ||
|
|
||
| # Fallback B: no usable reviewer signal at all — let the agent consult | ||
| # CODEOWNERS as a hint. (Glob matching against changed files is fuzzy, so | ||
| # it intentionally stays in the agent rather than here.) | ||
| if not eligible: | ||
| return _add_candidates(_result("", "none", needs_codeowners_fallback=True), latest_reviews, author_login) | ||
|
|
||
| # Reviewers exist but only ever COMMENTED: per Step 2b this is not a strong | ||
| # enough signal to pick one, and CODEOWNERS is only for "no reviews at all". | ||
| # Leave the SME empty; the workflow drafts without an explicit reviewer. | ||
| return _add_candidates(_result("", "none"), latest_reviews, author_login) |
|
❓ CLI E2E Tests unknown — 114 passed, 0 failed, 2 unknown (commit View all recordings
📹 Recordings uploaded automatically from CI run #27364557363 |
Problem
The PR Documentation Check workflow (
.github/workflows/pr-docs-check.md→ compiledpr-docs-check.lock.yml) runs the Copilot CLI engine (claude-sonnet-4.6) inside the AWF firewall/proxy. The proxy enforces two caps from itsapiProxyconfig and refuses inference when either trips:maxRuns(per-run inference-request cap) — defaulted to 500maxEffectiveTokens— 25,000,000 (25M) hard capOver the last 7 days, 18 of 19 failed runs died at the
Execute GitHub Copilot CLIstep with the same error:CAPIError: 429 Maximum effective tokens exceeded, landing just over the 25M rail (observed 25.0M–25.8M), logged asAWF effective-token hard rail hit — not retrying or continuing. (The 19th failure was an unrelated transientcurl 504while installing the CLI — infra flake.) The 500-turn cap was never reached.Root cause: a deterministic retry loop
A
create_pull_requestsafe-output failure spirals into a retry loop. In failed run27322389644the safe output returned{"result":"error","error":"No changes to commit - no commits found"}7 times whilecreate_pull_requestwas referenced 80× — the agent kept re-emitting it instead of stopping, accumulating effective tokens each turn until it crossed 25M.Calibration from AWF audit data (
gh aw audit)27246693222)27322389644)Healthy runs use only ~4–5 requests; the runaway used ~34 to reach the rail.
Changes
max-runs: 12(compiles to AWF proxymaxRuns). A hard backstop that sits well above a legitimate run (the doc-writing path needs a few more requests than a skip to read the skill, browse docs, write files, and open the PR) yet far below the ~34-request runaway — so any loop is stopped long before the 25M token rail. (Replaces the default of 500, which never fired.)Generic anti-loop guard in Step 10 — emit
create_pull_requestexactly once, only after writing doc changes, and never re-emit after a failure. The guard treats any deterministiccreate_pull_requestfailure as non-retryable (no commits / no diff / empty or invalid patch / base-branch or validation error / protected-file rejection), not just the literalNo changes to commitstring. On such a failure, bail to a singlenotify_source_pr(skipped) instead of looping. This removes the loop at its behavioral source.Defer expensive comment-thread reads to the doc-drafting path only — the cheap skip path (the common case) no longer fetches the PR issue/review comment threads; they are fetched only once Step 5 confirms docs are required.
Condense the prompt body (the body is re-sent every turn, so this compounds): collapse Step 4's six verbose signal-catalog tables into a compact grouped reference (the catalog is already defined in
compute_signals.pyand surfaced insignals.jsonat runtime), trim Step 3'starget.jsontable to the fields actually used, and de-duplicate repeated justification prose. Nothing the agent reads at decision time was removed — the gate remainssignals.json'srecommendation/triggered_signals/evidence.Cap Step 8 "browse existing documentation" breadth to the affected feature area instead of broad exploration.
Offload repeatable, deterministic flow tasks to pre-agent scripts (extending the existing
compute_signals.pypattern). These steps were previously executed inside the agent loop as GitHub tool calls plus reasoning — work that is fully deterministic and whose verbose API responses were re-sent on every subsequent turn. Moving them into thepre-agent-steps:shell step (which already fetches the PR + files payloads) means the agent now just reads a small result file:compute_pr_context.pywrites.pr-docs-check/pr.json— the curated PR metadata (number, title, body, author, base ref, milestone, labels, assignees, linked issues, and changed-file list). Diff patches are deliberately excluded and fetched on demand only on the drafting path. Step 1 now reads this file instead of issuing severalget PR/list files/ linked-issue tool calls.resolve_sme.pywrites.pr-docs-check/sme.json— the subject-matter expert resolved deterministically from assignees + reviews (a faithful port of Step 2's selection algorithm). Step 2 now readssme_login/sme_source; only the genuinely fuzzy CODEOWNERS hint stays in the agent, signalled vianeeds_codeowners_fallback.This change set also removes the earlier
pr-docs-check-verify.ymlCI workflow andbuild.pyhelper: those were compile/lock-parity meta-tooling around the workflow, not the "scripts for repeatable bits within the flow" that were intended.Backstop sanity-check (does
max-runs: 12trip below 25M?)Yes, with large margin. Anchoring on the only run we have that actually hit the rail —
27322389644, which reached 25.2M effective tokens in 34 requests — the linear average is ≈ 740k effective tokens/request. Atmax-runs: 12the worst case is therefore ≈12 × 740k ≈ 8.9M effective tokens, about 36% of the 25M cap. And because per-turn cost grows as conversation context accumulates (the full transcript is re-sent each turn), the first 12 turns are individually cheaper than that run's average turn, so the true worst case is strictly below 8.9M. The trimmed prompt and the in-flow offloads lower it further still.This is also why 12 was chosen over the originally-suggested 25: at
max-runs: 25the worst case is ≈25/34of the rail-hitting run, and since later turns cost more than the average, the cumulative could climb toward ~20M+ — under the cap but with a thin margin.max-runs: 12guarantees the backstop is a floor well under the cap, not merely "a few × the norm," while still leaving the legitimate doc-writing path room to complete.Expected token impact
max-runs: 12backstop plus the generic anti-loop guard stop the retry loop after a handful of requests rather than ~34, keeping any pathological run an order of magnitude below the 25M rail (≤ ~8.9M effective worst case, see sanity-check above).Validation
gh aw compile pr-docs-check(gh-aw v0.77.5, matching.github/aw/actions-lock.json): 0 errors, 0 warnings. The committedpr-docs-check.lock.ymlis byte-identical to a fresh recompile (parity verified), and recompiling the unchanged source twice is idempotent.test_compute_signals.py(20 catalog scenarios + 10 backport cases) — viapython -m unittest discover -s .github/workflows/pr-docs-check -v.compute_pr_context.pyandresolve_sme.pycarry no unit tests: those orphaned test files were removed (nothing in the flow runs them) and both helpers are covered by the end-to-end smoke test below.compute_pr_context.py→pr.json→resolve_sme.py→sme.json).Update: exclude backport PRs
The workflow triggers on
pull_request: closedfor bothmainandrelease/*(see theon:block), so a merged backport reaches the agentand could draft a redundant aspire.dev docs PR. A backport is documented
against its original (forward) PR on the default branch, so re-documenting it
is pure duplicate noise.
Backport detection is now deterministic in
compute_signals.py(
detect_backport()), matching the/backportbot shape(
.github/workflows/backport.yml) and explicit/manual backports via anyof: base branch
release/*, head branchbackport/*, title prefixed[release/...], aBackport of #Nbody marker, or abackportlabel.signals.jsonnow carriesexcluded+exclusion_reasonsand a non-gatingis_backportsignal; whenexcludedis true therecommendationis forcedto
docs_optional. The agent prompt gains a Step 5 exclusion branch(checked first, overrides
recommendationand the ambiguity rule) so anexcluded PR is definitively skipped with an
excluded → <reason>notify_source_prsummary — it never drafts. Excluding release-base PRs issafe for the release-targeting feature, which operates on
main-base PRs viathe milestone resolver.
Tests: +10 backport cases added to
test_compute_signals.py(eachmarker, multi-reason, excluded-overrides-gating, negative main PR,
default-safe when base/head/title absent); 30/30 pass. Lock regenerated (
body_hashonly; recompile parity verified).Notes
pr-docs-check.lock.ymlis generated — regenerated here viagh aw compile, never hand-edited.max-runsis the correct top-level field in gh-aw v0.77.5 (the compiler rejectsmax-turns); it is engine-agnostic (enforced by the AWF proxy).target.json/signals.jsonpattern: deterministic work runs once before the agent and is read from a file, keeping the token-expensive agent loop focused on judgment (deciding whether docs are needed and authoring them).🤖 Generated with GitHub Copilot CLI
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com