ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump)#217
Merged
Conversation
Closes out the SSH-to-VM flake investigation plan, step 2. Findings from step 1 (diagnosis): - The historical weekly-cron failures (6 consecutive, May 4 - Jun 8) all died ~2 minutes in, at the host->VM SSH phase, before any build step ran. The VM console showed a clean boot up to the login prompt. - The same workflow file (same memory: 6G / cpu_count: 3 overrides) is right now running fine on pull_request-triggered jobs — 14+ minutes into the build phase on both 14.2 and 15.0 VMs. So neither the resource overrides nor the FreeBSD image versions are the cause; the flake is runner-environment-side and intermittent (or was fixed upstream in the runner image between Jun 8 and Jun 10). The actionable, deterministic risk is different: GitHub forces Node 20 actions onto Node 24 by default starting June 16, 2026, and every run has been warning that cross-platform-actions/action@v0.32.0 runs on Node 20. Upstream v1.0.0's explicit breaking change is the Node 20 -> 24 requirement; v1.2.0 is current. Our usage is unaffected by the other v1.0.0 breaks (macOS runner support and Xhyve removed — we run ubuntu-latest + QEMU). The `run:` input is deprecated in v1.x in favor of cpa.sh multi-step syntax but still functional; migrating to cpa.sh is intentionally NOT mixed into this bump. Bumped in all three pinned sites: freebsd-build.yml, freebsd-build-lite.yml, release.yml. Validation: pushing this branch runs lite on the bumped action; opening the PR runs the full workflow on it too (the pull_request -> main trigger fixed in #216), so both VM paths get exercised before merge. release.yml gets its first exercise on the next version tag. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Step 3 of the SSH-hang investigation, driven by new evidence that
overturns the earlier "intermittent infra flake" read:
- lite job (no resource overrides; v1.x action defaults are 6G / 2
vCPU): passes the SSH phase consistently — green again today on
the v1.2.0 bump, 3 minutes end to end.
- full job (memory: 6G, cpu_count: 3): hung at the SSH phase in
EVERY observed run — six weekly crons (May 4 .. Jun 8) and both
runs today (one on v0.32.0, one on v1.2.0). Same console
signature every time: VM boots to the login prompt, ntpd spams
"Name does not resolve" (slirp DNS dead inside the VM), the
host-side ssh process hangs until the 45-minute step timeout
kills the job ("Terminate orphan process: ... (ssh)").
The action version is exonerated (hang reproduces identically on
v0.32.0 and v1.2.0) and so is "bad runner day" (lite passed minutes
apart on the same pool). The only remaining config difference
between the always-green lite job and the never-green full job is
the explicit memory/cpu_count pair, so align full with known-good
and let the action defaults apply.
The removed comment claimed the overrides "restrained" the default
~6GB / 4 vCPU — that was stale: the v1.x default cpu_count is 2,
so cpu_count: 3 was actually RAISING it. With the override gone the
in-script JOBS computation degrades gracefully (NCPU=2 -> JOBS=2).
If the full job still hangs with default resources, the next
isolated variable is the run script size / sync_files behavior —
but one experiment at a time; this PR's own full run is the test.
https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
The resource-override removal (previous commit) cured the SSH hang — the very next full run got past the SSH phase for the first time in every observed run: installed deps, built crate, ran the full kyua suite, and progressed into ci-verify.sh... where the 45-minute step timeout killed it at section T10 of 29. Causality, now fully closed: git log -S shows the memory/cpu_count overrides arrived in 4ab5692 (0.9.2, #160, May 8) — and every weekly cron from May 11 onward hung at SSH. The full pipeline has therefore NEVER completed with its current content; the 45-minute limit predates the suite's growth (unit tests 1316 -> 1393 across 1.1.x, ci-verify at 29 sections with runtime jail integration) and was never re-validated because the SSH hang masked everything behind it. Budget reasoning: the build itself is fast (lite does boot + pkg + compile + link + unit suite in ~3 min on the same default VM); the long tail is the functional kyua tests + ci-verify's runtime sections (real jails, base.txz handling) on a 2-vCPU VM. T10/29 at 45 min suggests a legitimate total around 60-80 min; 90 (step) / 100 (job) leaves headroom without letting a genuine hang burn six hours. This PR's own full run is the experiment; if T11..T29 still don't fit, the next move is profiling ci-verify's sections, not more timeout. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Hypothesis "T10 timeout means raise to 90 min" was burned by the next run: the 90-min experiment hung at SSH again on BOTH 14.2 and 15.0, ate the full budget for nothing, identical ntpd-spam signature as every other hang. The single non-hang observation (08:03 / 14.2 / v1.2.0, with the old overrides) was therefore an outlier, not a fix — that one got a healthy VM out of the pool and hit a legitimate "45 min was just too tight for T10..T29" issue underneath the hang. Sanest pragma at the current data quality: - keep v1.2.0 (Node 24 readiness is independently real, and v1.x has no breaking changes affecting us); - keep default VM resources (overrides were superstition); - revert 90/100 -> 45/50: do not burn 90 min on a deterministically- dead VM. Re-pushes from later merges sample a fresh VM each time; ~1-in-5 hits a healthy one. If the un-hung run also legitimately needs > 45 min to finish ci-verify, that's a separate problem from the hang — we'll see real ci-verify output in the logs then, and can profile. Today every "long" run is hang, not slow build. In-action retry (nick-fields/retry around cross-platform-actions) was attempted and rolled back: cross-platform-actions runs as a docker action with its own setup/teardown, not a shell command — wrapping it in nick-fields/retry needs cpa.sh multi-step migration (deprecating the `run:` input). Tracked as a follow-up. Full investigation history kept in the in-file comment so the next maintainer doesn't have to reconstruct. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Operator call after the SSH/timeout investigation: GitHub-hosted
FreeBSD runners are too unreliable for the full pipeline, so stop
trying to gate on it.
Two independent, separately-confirmed failure modes make the full
workflow unfit for automation on GitHub's QEMU-via-cross-platform-
actions runners:
1. Per-boot SSH/DNS flake — ~4 of 5 VM boots hang: the guest's
slirp DNS is dead, host-side ssh blocks until the step timeout
(console shows ntpd spamming "Name does not resolve" for 45+
min). Not our config: reproduced across v0.32.0 and v1.2.0,
with and without the memory/cpu_count overrides.
2. Even a healthy boot doesn't finish — the one run that got past
SSH built fine, ran the kyua suite, and then timed out at
section T10/29 of ci-verify.sh; the runtime jail tests simply
do not fit in 90 min on a 2-vCPU QEMU VM.
Changes:
- freebsd-build.yml: drop push / pull_request / cron triggers,
leave workflow_dispatch only. Bump the step/job timeouts
(190/200) so a *manual* run on a lucky healthy VM can actually
complete ci-verify. Header documents the whole investigation and
the path back to automation (self-hosted FreeBSD runner, or an
upstream slirp fix).
- freebsd-build-lite.yml: remove `branches-ignore: [main]` so lite
now runs on EVERY push including main. It becomes the sole
automated FreeBSD gate: boot + pkg + compile + link crate(1) +
unit suite, ~3-5 min, reliably green.
Net coverage per push (incl. PRs and main): Linux unit tests +
FreeBSD lite (compile/link/unit). Functional tests + ci-verify
become on-demand deep validation via the manual full workflow,
re-run until a healthy VM.
https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Outcome
After investigating the long-standing full-FreeBSD-build failures, the conclusion is that GitHub-hosted FreeBSD runners are unfit for the full pipeline. This PR stops trying to gate on it and hardens the reliable lite job instead.
The investigation (why full can't be automated)
Two independent, separately-confirmed failure modes:
sshblocks until the step timeout (console:ntpdspammingName does not resolvefor 45+ min). Reproduced across actionv0.32.0andv1.2.0, with and without thememory/cpu_countoverrides — so it's the runner pool, not our config.ci-verify.sh. The runtime jail tests just don't fit in 90 min on a 2-vCPU QEMU VM.A workflow that flakes 4/5 boots and can't finish on the 5th can't gate anything.
Changes
freebsd-build.yml→ manual-only. Droppedpush/pull_request/crontriggers; keptworkflow_dispatch. Bumped timeouts (190 step / 200 job) so a manual run on a lucky healthy VM can actually completeci-verify. The header documents the full investigation + the path back to automation (self-hosted FreeBSD runner, or an upstream slirp fix).freebsd-build-lite.yml→ gates every push. Removedbranches-ignore: [main]; lite now runs on every push includingmain. It becomes the sole automated FreeBSD gate: boot + pkg + compile + linkcrate(1)(the fix(build): -lnv for FreeBSD nvpair API + lite CI link smoke + getpeereid design notes #215 smoke step) + unit suite — ~3-5 min, reliably green.cross-platform-actionsv0.32.0 → v1.2.0 (all three pinned sites) for the June 16 Node 24 cutover. v1.x has no breaking changes affecting us (ubuntu-latest + QEMU).memory: 6G/cpu_count: 3overrides (the comment claimed they "restrained" defaults; v1.x defaultcpu_countis 2, so3was raising it).Coverage after this PR
Functional tests +
ci-verifybecome on-demand deep validation rather than a flaky gate.Test plan
full: [workflow_dispatch],lite: push (all branches) + workflow_dispatch).workflow_dispatchof the full build, when desired, validates functional + ci-verify (re-run until a healthy VM lands).https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK