ci: dedupe quality gates and add fast-fail pre-check#345
Conversation
runpod-Henrik
left a comment
There was a problem hiding this comment.
Henrik's AI-Powered Bug Finder — PR #345 Review
Verdict: PASS WITH NITS (with one repo-admin action required before merge)
CI-only refactor. Local SDK behavior is unchanged. The optimization claims hold up (75% compute reduction for the typical PR-fixup loop; full matrix dedupe on merge-to-main). CI on this PR proves the new pipeline works: pre-check 10s, all four quality-gates legs 4-4:30 min, build skipped per the new if: guard.
The PR description already flags the branch-protection compatibility issue, which is the real release-blocker.
1. Issue: branch-protection compatibility on release-please bot PRs
The PR description calls this out explicitly. Recapping the QA risk in user terms:
- After this PR, a release-please bot PR that touches only
CHANGELOG.md+.release-please-manifest.jsoncausesci.ymlto be skipped viapaths-ignore— workflow doesn't run at all. GitHub branch protection treatspaths-ignoreskips as "check satisfied," so the bot PR can merge cleanly. - But if branch protection on
mainrequires a check namedCI / Quality Gates(4 statuses, one per matrix leg), and the workflow never runs, those status checks are simply absent. Whether branch protection blocks "missing required check" or accepts "not run because path-ignored" depends on a setting on the rule ("Require status checks to pass before merging" with vs. without "Strict — Require branches to be up to date"). Repo admin needs to verify this onmainbefore merging.
If the rule is strict and missing checks block, the bot PR will be permanently un-mergeable.
A safer path is the PR description's option 2: add a no-op success job named to match the required check (e.g., name: Quality Gates) that fires only on release-please--* branches.
This is not a code defect — it's a deployment-coordination requirement. Flagging it as a hard gate so it doesn't get missed.
2. Issue: PR-time wheel build is dropped
build job now runs only on push: main (`if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}`). That means make build + validate-wheel.sh no longer run on PR.
User scenario: a contributor changes pyproject.toml (e.g., adjusts [tool.setuptools.package-data], adds a non-Python file that needs include_package_data). The PR's CI passes (quality-gates is pure pytest), the wheel-validation regression lands on main, and the next PR's build job — or worse, release-please's PyPI publish — is the first signal.
The PR's rationale: "a PR-time wheel build would add ~1 min for no extra signal beyond quality-gates." That's mostly true — but validate-wheel.sh is wheel-packaging-specific signal that quality-gates doesn't catch. Examples that would slip past:
- Stale
package-dataglob after a directory rename. runpod_flash/rules/AGENTS.md(just added in 1.17.0) accidentally excluded from the wheel.- A
MANIFEST.inregression.
Suggest gating the build job on the actual files that affect packaging:
```yaml
if: ${{ (github.event_name == 'push' && github.ref == 'refs/heads/main') ||
(github.event_name == 'pull_request' && contains(github.event.pull_request.changed_files, 'pyproject.toml')) }}
```
Or even simpler: keep build on PRs, but only when pyproject.toml, Makefile, or scripts/validate-wheel.sh are touched. Cheap signal preserved.
3. Question: redundant skip-guards for release-please bot PRs
ci.yml skips release-please bot PRs in two places:
- Workflow-level
paths-ignore: [CHANGELOG.md, .release-please-manifest.json]— workflow doesn't run if all changed files match. - Job-level
if: ${{ !startsWith(github.head_ref, 'release-please--') }}onpre-check.
These overlap but aren't equivalent:
- If the bot PR ever touches a non-ignored file (e.g., the bot decides to update a workflow file),
paths-ignoreno longer applies → workflow runs. - Then
pre-checkskips via theif:, andquality-gates needs: [pre-check]will also skip per defaultneedssemantics → no quality signal on a release-please PR that's broader than CHANGELOG.
Is the intent "always skip on release-please branches even if it touches code," or "skip only when the change is metadata-only"? If the former, this is consistent. If the latter, drop the if: on pre-check and rely on paths-ignore alone.
Confirming the intent now is cheap; debugging which guard skipped which check after the fact is not.
4. Question: does make dev (without uv pip install -e .) make the flash CLI entry point importable?
User scenario: a contributor clones the repo, runs make dev, then flash --version.
Before: uv sync --all-groups + uv pip install -e . — the editable install registered the flash = runpod_flash.cli.main:app entry point in the venv.
After: uv sync --all-groups alone.
Whether uv sync registers the entry point for the project depends on whether the project has [tool.uv.sources] / workspace config that includes itself. CI passes because the matrix runs make ci-quality-github which uses uv run pytest … — that activates the project context regardless.
But a contributor running flash outside uv run (i.e., just flash --version from an activated .venv) might find it missing. This is a developer-onboarding hazard, not a production user issue. Worth a sanity check: from a clean clone, make dev followed by .venv/bin/flash --version — does it print 1.17.0, or "command not found"?
If "command not found," restore the uv pip install -e . and call it explicitly redundant-but-required for entry-point registration.
5. Nit: local make quality-check UX now noisier
quality-check aliases ci-quality-github, which means local devs running it see GitHub Actions annotation markers (::group::, ::endgroup::) sprinkled in their output. Functionally fine — the markers are inert outside Actions — but it's a minor UX regression for the most common local quality-check invocation.
If keeping single-source-of-truth is the priority (it should be), at least add a brief note in the help text that markers in the output are intentional. Or have quality-check set an env var that conditionally suppresses the @echo "::group::..." lines.
Nits
Makefile:75test-coveragecomment update mentions "serial pass for state isolation" — good clarification, matches the PR description's rationale for keeping the two-pass.e2e.ymlloses theunit-testsjob entirely. That's fine because the workflow isworkflow_dispatchonly (manually triggered E2E runs), and the manual user always runsci.ymlfirst via PR. But worth a one-line comment in the workflow file saying so, otherwise a future maintainer might add it back assuming it was dropped by accident.pre-checkrunsuvx ruff …which downloads ruff into a temporary venv every time. Cached viasetup-uv'scache-dependency-glob: pyproject.toml, so warm runs are fast — but on a first-of-day cold cache, ~30s vs ~10s. Acceptable tradeoff.- Trailing newline restored in
release-please.ymlfinal line. Good.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
There was a problem hiding this comment.
Pull request overview
Removes redundant CI work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR. Centralizes the quality gate as a single canonical Makefile target, fixes a silent junit overwrite, and adds a fast-fail format/lint pre-check plus PR-cancel concurrency to ci.yml. Drops the duplicated 4-version matrix from release-please.yml and the duplicated unit-tests job (plus its artifact round-trip summary) from e2e.yml.
Changes:
Makefile: drop redundant editable install indev; aliasquality-checktoci-quality-github; split pytest passes into distinctpytest-results-parallel.xml/pytest-results-serial.xml.ci.yml: add concurrency group,paths-ignorefor release-please bot files, fast-failpre-checkjob, gatebuildtopush: main, bumpsetup-uv@v2 → v5, uploadpytest-results-*.xml.release-please.yml/e2e.yml: remove duplicatedquality-gatesmatrix andunit-tests/summaryjobs; bumpsetup-uv@v2 → v5; fold summary into thee2ejob.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| Makefile | Simplifies dev, makes ci-quality-github the canonical gate, fixes junit filename collision. |
| .github/workflows/ci.yml | Adds concurrency, paths-ignore, pre-check job; conditions build on push:main; v5 setup-uv. |
| .github/workflows/release-please.yml | Removes duplicate quality-gates matrix; release-please is the entry point; v5 setup-uv. |
| .github/workflows/e2e.yml | Drops duplicate unit-tests and summary jobs; folds summary into e2e job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…release-please bot Addresses @runpod-Henrik (PASS WITH NITS) and Copilot review on PR #345. Henrik #3 / Copilot inline ci.yml:50 — skip cascade - Previously `pre-check.if` skipped on release-please bot PRs, and `quality-gates needs: [pre-check]` cascaded to skip too, leaving any release-please PR that touched non-metadata files with zero CI signal. - Drop the workflow-level `paths-ignore` (it interacted badly with branch protection's required-check semantics anyway) and instead put explicit `if: !startsWith(github.head_ref, 'release-please--')` on pre-check, quality-gates, packaging-guard, and build. Symmetric, auditable. Henrik #1 + #6 — branch-protection compatibility (release-please bot PRs) - Add `release-please-sentinel` job with matching matrix + name "Quality Gates" so it produces the same "Quality Gates (3.10)"…"(3.13)" check names that branch protection requires, but auto-passes. Branch protection on main can now require those checks without admins having to either loosen the rule or hand-craft an external sentinel workflow. Henrik #2 — PR-time wheel build is dropped - Reintroduce build on PRs but gated by `packaging-guard` job that inspects `git diff` against the PR base. Build runs when: - pyproject.toml / Makefile / MANIFEST.in / scripts/validate-wheel.sh changed, OR - any new non-Python file was added under src/ (data files need explicit package-data inclusion or they're silently dropped). - Push to main always builds. Henrik #4 — `make dev` entry-point sanity check - Verified locally: `rm -rf .venv && make dev && .venv/bin/flash --version` prints "Runpod Flash CLI v1.16.0". `uv sync` registers the entry point for the workspace project; the dropped `uv pip install -e .` was redundant. No code change needed. Henrik #5 / Copilot Makefile:108 — misleading "plain output" comment - Rewrite the comment block above `quality-check`. The aliases do NOT produce plain output -- local invocations see GitHub Actions annotation markers (::group::, --output-format=github) which are cosmetically noisy but functionally inert. Comment now reflects this. Henrik misc nit — e2e.yml `unit-tests` job removal - Add a leading comment on the `e2e` job explaining why the duplicate unit-tests job was dropped, so a future maintainer doesn't re-add it thinking it was lost by accident. https://linear.app/runpod/issue/AE-3161
|
Addressed all review feedback in 1136d5c. @runpod-Henrik items:
Skipped (already acceptable per your review):
Both Copilot inline threads resolved. Ready for re-review. |
Addresses Copilot's second review on PR #345. Copilot ci.yml:42 — sentinel-bypass vulnerability - Previous sentinel gated only on `startsWith(head_ref, 'release-please--')`. A contributor could open a PR from any branch named with that prefix and inherit the sentinel's auto-passes for "Quality Gates (3.10..3.13)" while pre-check / quality-gates / build all skipped via the same `if:` guard. Net: CI fully bypassed, branch protection satisfied, code unreviewed. - Replace the three independent `if:` gates with a single `guard` job that inspects the actual PR diff and classifies the run as either `release-please-bypass` (branch prefix AND diff is exactly CHANGELOG.md + .release-please-manifest.json) or `full`. `release-please-sentinel` only fires when classification is bypass; pre-check / quality-gates / build fire only when classification is full. A contributor naming a branch release-please-- but changing code falls through to full CI. - Also folds the previous packaging-guard logic into the same job; build reads `should_build` from the unified guard output. Copilot ci.yml:59 — ruff version drift in pre-check - `uvx ruff …` installed the latest ruff at runtime, which can drift from the locked version (uv.lock pins ruff 0.14.14) used by `make ci-quality-github`. That produces flaky "passes locally / fails CI" (or vice versa) verdicts when ruff cuts a release between runs. - Pre-check now runs `uv sync --only-group dev --frozen` then `uv run ruff …`. Pulls the exact pinned version; warm-cache install is ~1s, cold is a few seconds. Fast-fail premise preserved; determinism gained. https://linear.app/runpod/issue/AE-3161
|
Addressed Copilot's second review in 1a16897. Sentinel-bypass vulnerability (3 dup threads at Ruff version drift in pre-check (3 dup threads at All 6 threads resolved (3 dups each). Ready for re-review. |
Eliminates redundant CI/CD work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR push. Workflow changes: - ci.yml: add concurrency group (cancel in-flight PR runs), paths-ignore for release-please bot PRs, new pre-check job (uvx ruff format+check, ~10s no-install fast-fail), bump setup-uv v2 -> v5, build job only on push:main since release-please.yml builds + publishes the same SHA. - release-please.yml: remove duplicate quality-gates job. Branch protection on main already enforces ci.yml green before code lands, so the second matrix validated nothing. Saves a full 4-leg matrix per merge. - e2e.yml: drop unit-tests job (duplicated ci.yml quality-gates), fold the summary step into the e2e job to remove the third workflow definition of the same install setup. Makefile changes: - dev: drop redundant 'uv pip install -e .' (uv sync handles editable install in workspace mode). - quality-check: alias ci-quality-github so there is one canonical CI quality gate, no drift between local and CI. - ci-quality-github / test-coverage: fix the latent junit overwrite bug where both pytest invocations wrote pytest-results.xml -- now writes pytest-results-parallel.xml and pytest-results-serial.xml. The double pytest pass is retained because it provides process isolation between parallel and serial tests; collapsing to a single loadgroup run surfaced state pollution in test_load_balancer_sls_stub. Note: removing quality-gates from release-please.yml + paths-ignore for the bot's files means the release-please PR will have no required CI check. Branch protection on main will need to allow missing checks for the release-please--* branch pattern, or add a no-op sentinel job named to match the protection rule. https://linear.app/runpod/issue/AE-3161
…release-please bot Addresses @runpod-Henrik (PASS WITH NITS) and Copilot review on PR #345. Henrik #3 / Copilot inline ci.yml:50 — skip cascade - Previously `pre-check.if` skipped on release-please bot PRs, and `quality-gates needs: [pre-check]` cascaded to skip too, leaving any release-please PR that touched non-metadata files with zero CI signal. - Drop the workflow-level `paths-ignore` (it interacted badly with branch protection's required-check semantics anyway) and instead put explicit `if: !startsWith(github.head_ref, 'release-please--')` on pre-check, quality-gates, packaging-guard, and build. Symmetric, auditable. Henrik #1 + #6 — branch-protection compatibility (release-please bot PRs) - Add `release-please-sentinel` job with matching matrix + name "Quality Gates" so it produces the same "Quality Gates (3.10)"…"(3.13)" check names that branch protection requires, but auto-passes. Branch protection on main can now require those checks without admins having to either loosen the rule or hand-craft an external sentinel workflow. Henrik #2 — PR-time wheel build is dropped - Reintroduce build on PRs but gated by `packaging-guard` job that inspects `git diff` against the PR base. Build runs when: - pyproject.toml / Makefile / MANIFEST.in / scripts/validate-wheel.sh changed, OR - any new non-Python file was added under src/ (data files need explicit package-data inclusion or they're silently dropped). - Push to main always builds. Henrik #4 — `make dev` entry-point sanity check - Verified locally: `rm -rf .venv && make dev && .venv/bin/flash --version` prints "Runpod Flash CLI v1.16.0". `uv sync` registers the entry point for the workspace project; the dropped `uv pip install -e .` was redundant. No code change needed. Henrik #5 / Copilot Makefile:108 — misleading "plain output" comment - Rewrite the comment block above `quality-check`. The aliases do NOT produce plain output -- local invocations see GitHub Actions annotation markers (::group::, --output-format=github) which are cosmetically noisy but functionally inert. Comment now reflects this. Henrik misc nit — e2e.yml `unit-tests` job removal - Add a leading comment on the `e2e` job explaining why the duplicate unit-tests job was dropped, so a future maintainer doesn't re-add it thinking it was lost by accident. https://linear.app/runpod/issue/AE-3161
Addresses Copilot's second review on PR #345. Copilot ci.yml:42 — sentinel-bypass vulnerability - Previous sentinel gated only on `startsWith(head_ref, 'release-please--')`. A contributor could open a PR from any branch named with that prefix and inherit the sentinel's auto-passes for "Quality Gates (3.10..3.13)" while pre-check / quality-gates / build all skipped via the same `if:` guard. Net: CI fully bypassed, branch protection satisfied, code unreviewed. - Replace the three independent `if:` gates with a single `guard` job that inspects the actual PR diff and classifies the run as either `release-please-bypass` (branch prefix AND diff is exactly CHANGELOG.md + .release-please-manifest.json) or `full`. `release-please-sentinel` only fires when classification is bypass; pre-check / quality-gates / build fire only when classification is full. A contributor naming a branch release-please-- but changing code falls through to full CI. - Also folds the previous packaging-guard logic into the same job; build reads `should_build` from the unified guard output. Copilot ci.yml:59 — ruff version drift in pre-check - `uvx ruff …` installed the latest ruff at runtime, which can drift from the locked version (uv.lock pins ruff 0.14.14) used by `make ci-quality-github`. That produces flaky "passes locally / fails CI" (or vice versa) verdicts when ruff cuts a release between runs. - Pre-check now runs `uv sync --only-group dev --frozen` then `uv run ruff …`. Pulls the exact pinned version; warm-cache install is ~1s, cold is a few seconds. Fast-fail premise preserved; determinism gained. https://linear.app/runpod/issue/AE-3161
f748fae to
64bc5fb
Compare
Borrowed from runpod-python (ci.yml:105) -- one required check that
inspects upstream `needs.*.result` and treats "success" or "skipped" as
green. Much cleaner than the previous sentinel-with-matching-matrix-names
hack:
- Drops `release-please-sentinel` job entirely. No more duplicate
`Quality Gates` display name. No more "skipped Quality Gates" cosmetic
row alongside the real four matrix legs on every regular PR.
- For release-please bot bypass runs: guard classifies bypass ->
pre-check / quality-gates / build all skip -> `validation` aggregates
("skipped" treated as ok) -> green.
- For regular PRs: everything runs -> `validation` aggregates -> green.
- For any genuinely failed or cancelled job: `validation` exits 1 ->
branch protection blocks merge.
Future matrix changes (add 3.14, drop 3.10) or new jobs (security scans,
etc.) no longer require touching branch protection -- only the `needs:`
list and the results array in this aggregator.
Requires branch protection on `main` to be reconfigured before merge:
- Remove required checks: `CI / Quality Gates (3.10)`, `(3.11)`, `(3.12)`,
`(3.13)`
- Add required check: `CI / Validation`
https://linear.app/runpod/issue/AE-3161
|
Applied the single-required-check aggregator pattern from runpod-python's Dropped: Added: How each path resolves now:
⚠ Required before merge — branch protection on
After this one-time change, future matrix or job additions don't require any branch-protection updates. The aggregator handles it automatically. |
Closes the last remaining redundancy in the lifecycle. When release-please's release PR is merged, the resulting commit on main only changes CHANGELOG.md and .release-please-manifest.json -- the same code was already validated when the underlying feature PRs merged. Previously ci.yml ran the full 4-version matrix + build on those merge commits (~7 min compute per release). Now `guard` classifies push:main by diff: - Metadata-only diff -> release-please-bypass -> pre-check / quality-gates / build all skip, validation aggregates "skipped" as success. - Any non-metadata change (feature merges, admin direct pushes) -> full. release-please.yml's pypi-publish builds its own wheel before publishing, so skipping ci.yml's build on release commits doesn't blind us to wheel regressions. https://linear.app/runpod/issue/AE-3161
|
Applied stage-4 optimization in 1a6a0b6.
Safe to skip the build here because Final lifecycle redundancy map
Only remaining redundancy is Stage 2 — the merge commit re-validation. That one's deliberate defense-in-depth (squash/rebase changes SHAs; admin override exists). Removing it would mean trusting PR-only validation, which most shops don't. |
…or cache Addresses Copilot's third review on PR #345. Copilot ci.yml:66 + ci.yml:88 — bypass missing release-please-managed files - release-please-config.json declares release-type=python (auto-bumps pyproject.toml) and extra-files=src/runpod_flash/__init__.py. Real bot PRs touch 4 files, not 2. Previous ALLOWED regex missed pyproject.toml and __init__.py, so every real release-please bot PR was misclassified as `full`. Entire bypass optimization was broken for the actual case it was designed for. - Extend ALLOWED to all four files. Add a comment pointing back to release-please-config.json so the regex stays in sync. - Add defense-in-depth: PR classifier now also requires github.event.pull_request.user.login == 'runpod-release-please-bot[bot]' (the GitHub App account). Branch prefix + bot author + file content all required. A contributor spoofing the branch name fails the author check; a contributor with write access to the bot would already be trusted. - Push:main classifier keeps content-only check (admin can squash-merge the bot's PR by hand, so actor isn't a reliable signal on push). Copilot ci.yml:164 + ci.yml:193 + release-please.yml:61 + e2e.yml:39 — setup-uv cache keyed on pyproject.toml - uv sync is lockfile-driven; pyproject.toml-only cache key means uv.lock changes (most common dependency update) don't invalidate the cache, causing stale-deps cache hits. - Switch cache-dependency-glob to uv.lock everywhere. pre-check already used uv.lock; this aligns the rest. Copilot Makefile:107 — misleading "aliases" comment - quality-check-strict is NOT an alias for ci-quality-github -- it runs a different set (adds typecheck, plain output, no junit). Comment block rewritten to enumerate each target individually so maintainers don't conflate them. https://linear.app/runpod/issue/AE-3161
|
Addressed Copilot's third review in 97920a6. Bypass classifier was broken for the actual case it was designed for (2 threads, ci.yml:66 + ci.yml:88) — caught this verified against release-please-config.json and real bot PR #328. release-please-config has Fixed:
setup-uv cache keyed on pyproject.toml instead of uv.lock (4 threads across ci.yml:164, ci.yml:193, release-please.yml:61, e2e.yml:39) — fixed. Makefile:107 misleading "aliases" comment (1 thread) — All 7 threads resolved. |
runpod-Henrik
left a comment
There was a problem hiding this comment.
Henrik's AI-Powered Bug Finder — PR #345 Review
Delta review — since 2026-05-28
Significant rework since the prior review. The author replaced the original paths-ignore + job-level if: approach with a content-based guard job and a single Validation aggregator. CI is green across all checks (Guard, Pre-check, 4× Quality Gates, Build, Validation).
Prior findings — status
1. ✅ Resolved — branch-protection compatibility
Original concern: release-please bot PRs skipped via paths-ignore would leave required-check status absent, possibly blocking merge.
Now: Validation aggregator job runs if: always() and treats skipped upstream as success. Branch protection only needs to require CI / Validation — works for both full runs and release-please-bypass runs. Cleaner than the no-op-sentinel approach I suggested.
2. ✅ Resolved — PR-time wheel build dropped
Original concern: validate-wheel.sh not running on PRs would let packaging regressions land on main.
Now: guard.outputs.should_build is true when pyproject.toml, Makefile, MANIFEST.in, or scripts/validate-wheel.sh change, OR when a new non-Python file is added under src/. Build runs on relevant PRs; skipped on pure-Python changes. Targeted exactly at the packaging-regression risk I flagged.
3. ✅ Resolved — skip-guard redundancy
Original concern: two overlapping skip mechanisms (paths-ignore + if: !startsWith(...)) with unclear semantics.
Now: single guard job is the only classifier. Downstream pre-check, quality-gates, build gate on needs.guard.outputs.mode == 'full'. One source of truth.
4. ⏸ Still open — make dev entry-point sanity
Question wasn't blocking; CI is green which suggests uv sync registers entry points correctly in workspace mode. Worth a one-off make dev && .venv/bin/flash --version on a clean clone before merging, but not a release-blocker.
5. ⏸ Still open — local make quality-check UX
Annotation markers (::group::) still surface in local output. Minor nit, unchanged.
New findings (from commits since 2026-05-28)
6. Issue: ALLOWED regex must stay in sync with release-please-config.json by hand
The guard hardcodes the files release-please touches:
```
ALLOWED='^(CHANGELOG.md|.release-please-manifest.json|pyproject.toml|src/runpod_flash/init.py)$'
```
The comment correctly notes "Keep in sync with release-please-config.json" but there's no automated check. The asymmetric failure modes are favorable (out-of-sync → bypass doesn't fire → full CI runs → safe), so this is hardening, not a blocker. But the next person who adds an extra-files entry to release-please-config.json will spend a confused 20 minutes wondering why the bypass stopped firing on bot PRs.
Two cheap mitigations: (a) add a unit test in tests/ that reads release-please-config.json and asserts the ALLOWED regex matches; (b) read the config in the guard script and derive the regex dynamically.
7. Question: pre-check installs dev group via uv sync --only-group dev --frozen, then runs uv run ruff
uv sync --only-group dev --frozen installs dev dependencies without the project itself. Then the next steps run uv run ruff …. Default uv run behaviour syncs the project on every invocation — so the "no project install" intent of the previous step may be undone by uv run.
If the goal is to keep pre-check at ~10s, consider uv run --no-project ruff … for both ruff commands. Otherwise the cold-cache pre-check pays the full project install cost twice. Not a correctness issue — CI still passes — just possibly leaves the optimization unrealized.
8. Question: should_build heuristic doesn't trigger on modified non-Python src/ files
The guard checks --diff-filter=A for added files under src/:
```bash
git diff --name-only --diff-filter=A "origin/${BASE_REF}...HEAD" | grep -E '^src/.+' | grep -qv '.py$'
```
A PR that modifies an existing non-Python data file (e.g., a future edit to runpod_flash/rules/AGENTS.md) won't trigger should_build. The wheel will still include the file because it's already in the package-data manifest, so a modification doesn't change packaging behaviour — but the modified content won't be smoke-tested via validate-wheel.sh. Probably acceptable; worth confirming the intent.
If catching content regressions in packaged non-Python files matters, change --diff-filter=A to --diff-filter=AM (Added or Modified).
9. Nit: fetch-depth: 0 on guard checkout
Pulls full history just to run git diff HEAD~1...HEAD or git diff origin/\${BASE_REF}...HEAD. Full fetch is cheap on this repo size (no monorepo concerns) but fetch-depth: 2 would suffice for the push case and fetch-depth: 0 is needed only for the PR comparison-against-base case (and even then, only enough history to find the common ancestor). Not a real cost issue, just over-broad.
Pre-merge action still required
Branch protection on main needs to change the required status check from Quality Gates (4 contexts) to Validation (1 context). Without this rule update, merging still fails because old required-check names are absent on bypass runs. The PR's no-op-sentinel option from the original description has been replaced with the aggregator; the admin-side step is now "switch required check to Validation."
Verdict
PASS WITH NITS. The guard + Validation pattern is a clean solution for the two original issues that needed real design. New observations are minor — config-sync brittleness, pre-check optimization, modified-file heuristic — none block merge. CI proves the new pipeline works on this PR (all four quality-gates legs pass, build runs because the diff touched Makefile/.github/workflows/* which trips should_build).
Pre-merge action: branch protection rule must be updated to require Validation instead of Quality Gates. Without that admin change, the bypass path still leaves merges blocked.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
…sync; catch modified data files Henrik's delta review on PR #345. Henrik #6 — ALLOWED was hand-synced with release-please-config.json - Previous regex hardcoded the four files release-please touches today. Asymmetric failure mode is safe (out-of-sync -> bypass doesn't fire -> full CI runs) but the next person to add an `extra-files` entry to release-please-config.json would silently break the bypass for bot PRs. - Derive the file list dynamically via jq from release-please-config.json's `.packages["."]["extra-files"]`, append the three implicit files (CHANGELOG.md, manifest, pyproject.toml from release-type=python). Use `grep -Fxvf <(allowed_files)` for fixed-string exact-line matching -- removes the regex-escaping concern entirely. Henrik #7 — `uv run` undid the dev-only sync - `uv run` defaults to syncing the environment before each invocation, so the steps after `uv sync --only-group dev --frozen` would pull in the full project on each ruff call -- defeating the "no project install" premise of pre-check. - Add `--no-sync` to both ruff invocations. ruff runs from the existing dev-only venv directly; pre-check stays fast. Henrik #8 — `--diff-filter=A` missed modified non-Python files - Edits to existing packaged data files (e.g. future tweaks to runpod_flash/rules/AGENTS.md) wouldn't trigger should_build, so validate-wheel.sh wouldn't smoke-test the modified content. Packaging metadata wouldn't change, but the wheel content would. - Switch to `--diff-filter=AM` (Added or Modified). Cheap and correct. Henrik #9 (fetch-depth: 0 over-broad) intentionally not addressed -- Henrik flagged it as "not a real cost issue." Henrik #4 (make dev entry-point) re-verified locally: `rm -rf .venv && make dev && .venv/bin/flash --version` -> Runpod Flash CLI v1.16.0. https://linear.app/runpod/issue/AE-3161
|
Addressed your delta review in e467bc3. #6 — #7 — #8 — #9 — #4 — #5 — local Pre-merge admin step still required: switch the required check on |
…y table fix Copilot fourth review on PR #345. Copilot ci.yml:98 — pyproject.toml in ALLOWED could let non-version edits slip through the bypass - A push:main touching only the metadata filenames could smuggle non-version edits to pyproject.toml (deps, ruff config, scripts) since the previous check was filename-only. Push:main can't rely on actor identity (admins can squash-merge by hand), so the file-list check was the only gate. - Add `verify_version_shape` helper: enforces that any pyproject.toml change is a single `version = "..."` line, and any src/runpod_flash/__init__.py change is a single `__version__ = "..."` line. Applies to both push:main and PR bypass paths. Tested locally against simulated bot and malicious diffs. Copilot ci.yml:103 — `should_build=true` unconditional on push:main - Previous code only computed should_build content-based on PR full-mode; push:main non-bypass always built. Contradicted the documented intent and cost ~1 min per feature merge. - Factor the content heuristic into `compute_should_build`; use it for both push:main full and PR full paths. Same allow-list (pyproject.toml, Makefile, MANIFEST.in, validate-wheel.sh) plus AM-filter on non-Python files under src/. Copilot e2e.yml:129 — malformed markdown table when JUnit file missing - Python ternary `... if duration is not None else "- |"` had the wrong scope: when duration was None, the *entire* print expression became the string `"- |"`, breaking the table. - Extract `_num()` and `_dur()` helpers that placehold "-" per field; build the row in one f-string. Table stays valid when the suite didn't run. Copilot Makefile:116 (NOT changed) — claim was that aliasing quality-check to ci-quality-github drops `--cov-fail-under=65` enforcement. Verified locally: the serial pass in ci-quality-github inherits --cov-fail-under=65 from pyproject.toml's addopts (only the parallel pass overrides to 0). Local `make quality-check` still hits the 65% gate at the end of the serial pass; observed "Required test coverage of 65% reached. Total coverage: 85.96%" on this branch. Will reply on the thread, not change code. https://linear.app/runpod/issue/AE-3161
|
Addressed Copilot's fourth review in 722b38c. ci.yml:103 — ci.yml:98 — e2e.yml:129 — malformed table when JUnit missing → the Python ternary Makefile:116 — left in an open reply on that thread with verification output. The 3 of 4 threads resolved; 1 left open with rationale for re-review. |
Summary
Eliminates redundant CI/CD work across the full release lifecycle while keeping defense-in-depth for actual code changes. Linear: AE-3161.
Adopts the single-required-check aggregator pattern from runpod-python (ci.yml:105) plus a content-classifying
guardjob that decides each run's mode by inspecting the actual diff -- not just branch names.Lifecycle redundancy map (after this PR)
pull_requestpushpush: main(non-metadata diff)pull_requestonrelease-please--*push: main(metadata-only diff)What's redundant today (before this PR)
ci.ymlandrelease-please.ymlrun the full 4-version Quality Gates matrix onpush: main-- ~24 min of duplicated compute per merge. The release workflow runs after CI on the same SHA.CHANGELOG.md+.release-please-manifest.json.e2e.ymlhas its ownunit-testsjob that re-implementsci.yml's quality-gates with a driftingsetup-uvversion.make ci-quality-githubwrites both pytest passes to the same--junitxml=pytest-results.xml-- the second pass silently overwrites the first report.Makefile:devrunsuv sync --all-groupsthenuv pip install -e .-- the editable install is redundant in workspace mode.quality-checkandci-quality-githubare two Makefile targets running the same checks.concurrencygroup -- pushing a fixup commit to a PR leaves the previous run racing.make devinstall across 4 matrix legs before bailing.Changes
.github/workflows/ci.ymlAdded:
guardjob (content-based classifier) -- single source of truth for what mode each run is in. Inspects the actual diff and outputs:mode:release-please-bypass(branch prefix matches AND diff is exactly CHANGELOG + manifest) orfullshould_build: true if packaging-relevant files (pyproject.toml,Makefile,MANIFEST.in,scripts/validate-wheel.sh) changed, or new non-Python file added undersrc/For
push: main, the same classification applies: a release-please release commit (metadata-only) classifies as bypass; anything else as full.Added:
validationjob (aggregator) -- runs unconditionally (if: always()), inspectsneeds.guard.result,needs.pre-check.result,needs.quality-gates.result,needs.build.result. Passes when every upstream issuccessorskipped, fails on anything else. This is the single required check -- adding/removing future jobs or matrix legs no longer requires branch-protection updates.Other:
pre-checkjob:uv sync --only-group dev --frozen+uv run ruff(pins to lockfile, nouvxversion drift)quality-gatesmatrix: 3.10 / 3.11 / 3.12 / 3.13buildjob: gated byguard.outputs.should_buildsetup-uv@v2 → v5pytest-results-*.xml(now two files: parallel + serial).github/workflows/release-please.ymlquality-gatesjob entirely. Branch protection already enforces CI green before code lands. Re-running on the same SHA validates nothing new.release-pleasebecomes the entry point;pypi-publishruns if a release was created.setup-uv@v2 → v5..github/workflows/e2e.ymlunit-testsjob (duplicatedci.ymlquality-gates).summarystep into thee2ejob; remove the separatesummaryjob and its artifact round-trip.unit-testswas dropped, so future maintainers don't re-add it.Makefiledev: drop redundantuv pip install -e .(uv sync handles editable install in workspace mode). Verified locally:rm -rf .venv && make dev && .venv/bin/flash --versionprints correctly.quality-check: aliasci-quality-github(single source of truth).ci-quality-github: write distinctpytest-results-parallel.xmlandpytest-results-serial.xml(fixes silent junit overwrite).Why the pytest double-pass stayed
Initial draft collapsed the two pytest invocations into a single
-n auto --dist=loadgrouprun, relying onconftest.py'sxdist_group("serial")hook for isolation. Localmake quality-checkexposed test pollution:test_call_routes_to_execute_for_live_endpointfailed when sharing a worker with non-serial tests under the single-pass arrangement, but passes in isolation and in the original two-pass layout. The double pass provides real process isolation between parallel and serial tests; the silent junit overwrite is fixed instead.Expected impact
Branch protection on
mainneeds to be reconfigured:CI / Quality Gates (3.10),(3.11),(3.12),(3.13)CI / ValidationThe
Validationjob aggregates all upstream results, so this becomes a stable required-check name regardless of future matrix or job changes.Test plan
make quality-checkpasses locally with the new Makefile layoutmake devproduces a working install withflashentry point registered (nouv pip install -e .)make ci-quality-githubwrites bothpytest-results-parallel.xmlandpytest-results-serial.xmlpre-checkuses lockfile-pinned ruff (0.14.14), matchingquality-gatesguardclassification works on PR diffs (regular PR →full, spoofed release-please branch with code changes →full, real release-please metadata-only →bypass)validationaggregator visible in CI runs as the single green checkQuality Gates (X)checks, addValidationguard+validation(~7s)guard+validation(~7s)