ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump) by click0 · Pull Request #217 · click0/crate

click0 · 2026-06-10T08:03:05Z

Outcome

After investigating the long-standing full-FreeBSD-build failures, the conclusion is that GitHub-hosted FreeBSD runners are unfit for the full pipeline. This PR stops trying to gate on it and hardens the reliable lite job instead.

The investigation (why full can't be automated)

Two independent, separately-confirmed failure modes:

Per-boot SSH/DNS flake — ~4 of 5 VM boots hang. The guest's slirp DNS dies; host-side ssh blocks until the step timeout (console: ntpd spamming Name does not resolve for 45+ min). Reproduced across action v0.32.0 and v1.2.0, with and without the memory/cpu_count overrides — so it's the runner pool, not our config.
Even a healthy boot doesn't finish — the single run that got past SSH built fine, ran the kyua suite, then timed out at section T10/29 of ci-verify.sh. The runtime jail tests just don't fit in 90 min on a 2-vCPU QEMU VM.

A workflow that flakes 4/5 boots and can't finish on the 5th can't gate anything.

Changes

freebsd-build.yml → manual-only. Dropped push / pull_request / cron triggers; kept workflow_dispatch. Bumped timeouts (190 step / 200 job) so a manual run on a lucky healthy VM can actually complete ci-verify. The header documents the full investigation + the path back to automation (self-hosted FreeBSD runner, or an upstream slirp fix).
freebsd-build-lite.yml → gates every push. Removed branches-ignore: [main]; lite now runs on every push including main. It becomes the sole automated FreeBSD gate: boot + pkg + compile + link crate(1) (the fix(build): -lnv for FreeBSD nvpair API + lite CI link smoke + getpeereid design notes #215 smoke step) + unit suite — ~3-5 min, reliably green.
cross-platform-actions v0.32.0 → v1.2.0 (all three pinned sites) for the June 16 Node 24 cutover. v1.x has no breaking changes affecting us (ubuntu-latest + QEMU).
Removed the superstitious memory: 6G / cpu_count: 3 overrides (the comment claimed they "restrained" defaults; v1.x default cpu_count is 2, so 3 was raising it).

Coverage after this PR

Trigger	Linux unit	FreeBSD lite	FreeBSD full
push (feature/PR/main)	✅	✅ (compile/link/unit)	—
manual dispatch	—	—	✅ functional + ci-verify, re-run until healthy VM

Functional tests + ci-verify become on-demand deep validation rather than a flaky gate.

Test plan

Both workflow files parse as valid YAML; trigger sets confirmed (full: [workflow_dispatch], lite: push (all branches) + workflow_dispatch).
This PR's own push runs lite on v1.2.0 — green = the bumped action boots/SSHes/compiles/links/tests on FreeBSD 14.2. No more red full checks blocking the PR.
Manual workflow_dispatch of the full build, when desired, validates functional + ci-verify (re-run until a healthy VM lands).

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

Closes out the SSH-to-VM flake investigation plan, step 2. Findings from step 1 (diagnosis): - The historical weekly-cron failures (6 consecutive, May 4 - Jun 8) all died ~2 minutes in, at the host->VM SSH phase, before any build step ran. The VM console showed a clean boot up to the login prompt. - The same workflow file (same memory: 6G / cpu_count: 3 overrides) is right now running fine on pull_request-triggered jobs — 14+ minutes into the build phase on both 14.2 and 15.0 VMs. So neither the resource overrides nor the FreeBSD image versions are the cause; the flake is runner-environment-side and intermittent (or was fixed upstream in the runner image between Jun 8 and Jun 10). The actionable, deterministic risk is different: GitHub forces Node 20 actions onto Node 24 by default starting June 16, 2026, and every run has been warning that cross-platform-actions/action@v0.32.0 runs on Node 20. Upstream v1.0.0's explicit breaking change is the Node 20 -> 24 requirement; v1.2.0 is current. Our usage is unaffected by the other v1.0.0 breaks (macOS runner support and Xhyve removed — we run ubuntu-latest + QEMU). The `run:` input is deprecated in v1.x in favor of cpa.sh multi-step syntax but still functional; migrating to cpa.sh is intentionally NOT mixed into this bump. Bumped in all three pinned sites: freebsd-build.yml, freebsd-build-lite.yml, release.yml. Validation: pushing this branch runs lite on the bumped action; opening the PR runs the full workflow on it too (the pull_request -> main trigger fixed in #216), so both VM paths get exercised before merge. release.yml gets its first exercise on the next version tag. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

Step 3 of the SSH-hang investigation, driven by new evidence that overturns the earlier "intermittent infra flake" read: - lite job (no resource overrides; v1.x action defaults are 6G / 2 vCPU): passes the SSH phase consistently — green again today on the v1.2.0 bump, 3 minutes end to end. - full job (memory: 6G, cpu_count: 3): hung at the SSH phase in EVERY observed run — six weekly crons (May 4 .. Jun 8) and both runs today (one on v0.32.0, one on v1.2.0). Same console signature every time: VM boots to the login prompt, ntpd spams "Name does not resolve" (slirp DNS dead inside the VM), the host-side ssh process hangs until the 45-minute step timeout kills the job ("Terminate orphan process: ... (ssh)"). The action version is exonerated (hang reproduces identically on v0.32.0 and v1.2.0) and so is "bad runner day" (lite passed minutes apart on the same pool). The only remaining config difference between the always-green lite job and the never-green full job is the explicit memory/cpu_count pair, so align full with known-good and let the action defaults apply. The removed comment claimed the overrides "restrained" the default ~6GB / 4 vCPU — that was stale: the v1.x default cpu_count is 2, so cpu_count: 3 was actually RAISING it. With the override gone the in-script JOBS computation degrades gracefully (NCPU=2 -> JOBS=2). If the full job still hangs with default resources, the next isolated variable is the run script size / sync_files behavior — but one experiment at a time; this PR's own full run is the test. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

The resource-override removal (previous commit) cured the SSH hang — the very next full run got past the SSH phase for the first time in every observed run: installed deps, built crate, ran the full kyua suite, and progressed into ci-verify.sh... where the 45-minute step timeout killed it at section T10 of 29. Causality, now fully closed: git log -S shows the memory/cpu_count overrides arrived in 4ab5692 (0.9.2, #160, May 8) — and every weekly cron from May 11 onward hung at SSH. The full pipeline has therefore NEVER completed with its current content; the 45-minute limit predates the suite's growth (unit tests 1316 -> 1393 across 1.1.x, ci-verify at 29 sections with runtime jail integration) and was never re-validated because the SSH hang masked everything behind it. Budget reasoning: the build itself is fast (lite does boot + pkg + compile + link + unit suite in ~3 min on the same default VM); the long tail is the functional kyua tests + ci-verify's runtime sections (real jails, base.txz handling) on a 2-vCPU VM. T10/29 at 45 min suggests a legitimate total around 60-80 min; 90 (step) / 100 (job) leaves headroom without letting a genuine hang burn six hours. This PR's own full run is the experiment; if T11..T29 still don't fit, the next move is profiling ci-verify's sections, not more timeout. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

Hypothesis "T10 timeout means raise to 90 min" was burned by the next run: the 90-min experiment hung at SSH again on BOTH 14.2 and 15.0, ate the full budget for nothing, identical ntpd-spam signature as every other hang. The single non-hang observation (08:03 / 14.2 / v1.2.0, with the old overrides) was therefore an outlier, not a fix — that one got a healthy VM out of the pool and hit a legitimate "45 min was just too tight for T10..T29" issue underneath the hang. Sanest pragma at the current data quality: - keep v1.2.0 (Node 24 readiness is independently real, and v1.x has no breaking changes affecting us); - keep default VM resources (overrides were superstition); - revert 90/100 -> 45/50: do not burn 90 min on a deterministically- dead VM. Re-pushes from later merges sample a fresh VM each time; ~1-in-5 hits a healthy one. If the un-hung run also legitimately needs > 45 min to finish ci-verify, that's a separate problem from the hang — we'll see real ci-verify output in the logs then, and can profile. Today every "long" run is hang, not slow build. In-action retry (nick-fields/retry around cross-platform-actions) was attempted and rolled back: cross-platform-actions runs as a docker action with its own setup/teardown, not a shell command — wrapping it in nick-fields/retry needs cpa.sh multi-step migration (deprecating the `run:` input). Tracked as a follow-up. Full investigation history kept in the in-file comment so the next maintainer doesn't have to reconstruct. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

Operator call after the SSH/timeout investigation: GitHub-hosted FreeBSD runners are too unreliable for the full pipeline, so stop trying to gate on it. Two independent, separately-confirmed failure modes make the full workflow unfit for automation on GitHub's QEMU-via-cross-platform- actions runners: 1. Per-boot SSH/DNS flake — ~4 of 5 VM boots hang: the guest's slirp DNS is dead, host-side ssh blocks until the step timeout (console shows ntpd spamming "Name does not resolve" for 45+ min). Not our config: reproduced across v0.32.0 and v1.2.0, with and without the memory/cpu_count overrides. 2. Even a healthy boot doesn't finish — the one run that got past SSH built fine, ran the kyua suite, and then timed out at section T10/29 of ci-verify.sh; the runtime jail tests simply do not fit in 90 min on a 2-vCPU QEMU VM. Changes: - freebsd-build.yml: drop push / pull_request / cron triggers, leave workflow_dispatch only. Bump the step/job timeouts (190/200) so a *manual* run on a lucky healthy VM can actually complete ci-verify. Header documents the whole investigation and the path back to automation (self-hosted FreeBSD runner, or an upstream slirp fix). - freebsd-build-lite.yml: remove `branches-ignore: [main]` so lite now runs on EVERY push including main. It becomes the sole automated FreeBSD gate: boot + pkg + compile + link crate(1) + unit suite, ~3-5 min, reliably green. Net coverage per push (incl. PRs and main): Linux unit tests + FreeBSD lite (compile/link/unit). Functional tests + ci-verify become on-demand deep validation via the manual full workflow, re-run until a healthy VM. https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

claude added 5 commits June 10, 2026 08:02

click0 changed the title ~~ci: bump cross-platform-actions to v1.2.0 (Node 24 readiness, flake follow-up)~~ ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump) Jun 10, 2026

click0 merged commit b0f4368 into main Jun 10, 2026
1 check passed

click0 deleted the claude/analyze-test-coverage-nCOJW branch June 10, 2026 15:50

click0 mentioned this pull request Jun 10, 2026

chore(release): 1.1.16 — hygiene tail after the 1.1.12 → 1.1.15 series #218

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump)#217

ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump)#217
click0 merged 5 commits into
mainfrom
claude/analyze-test-coverage-nCOJW

click0 commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

click0 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Outcome

The investigation (why full can't be automated)

Changes

Coverage after this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

click0 commented Jun 10, 2026 •

edited

Loading