Skip to content

ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump)#217

Merged
click0 merged 5 commits into
mainfrom
claude/analyze-test-coverage-nCOJW
Jun 10, 2026
Merged

ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump)#217
click0 merged 5 commits into
mainfrom
claude/analyze-test-coverage-nCOJW

Conversation

@click0

@click0 click0 commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Outcome

After investigating the long-standing full-FreeBSD-build failures, the conclusion is that GitHub-hosted FreeBSD runners are unfit for the full pipeline. This PR stops trying to gate on it and hardens the reliable lite job instead.

The investigation (why full can't be automated)

Two independent, separately-confirmed failure modes:

  1. Per-boot SSH/DNS flake — ~4 of 5 VM boots hang. The guest's slirp DNS dies; host-side ssh blocks until the step timeout (console: ntpd spamming Name does not resolve for 45+ min). Reproduced across action v0.32.0 and v1.2.0, with and without the memory/cpu_count overrides — so it's the runner pool, not our config.
  2. Even a healthy boot doesn't finish — the single run that got past SSH built fine, ran the kyua suite, then timed out at section T10/29 of ci-verify.sh. The runtime jail tests just don't fit in 90 min on a 2-vCPU QEMU VM.

A workflow that flakes 4/5 boots and can't finish on the 5th can't gate anything.

Changes

  • freebsd-build.yml → manual-only. Dropped push / pull_request / cron triggers; kept workflow_dispatch. Bumped timeouts (190 step / 200 job) so a manual run on a lucky healthy VM can actually complete ci-verify. The header documents the full investigation + the path back to automation (self-hosted FreeBSD runner, or an upstream slirp fix).
  • freebsd-build-lite.yml → gates every push. Removed branches-ignore: [main]; lite now runs on every push including main. It becomes the sole automated FreeBSD gate: boot + pkg + compile + link crate(1) (the fix(build): -lnv for FreeBSD nvpair API + lite CI link smoke + getpeereid design notes #215 smoke step) + unit suite — ~3-5 min, reliably green.
  • cross-platform-actions v0.32.0 → v1.2.0 (all three pinned sites) for the June 16 Node 24 cutover. v1.x has no breaking changes affecting us (ubuntu-latest + QEMU).
  • Removed the superstitious memory: 6G / cpu_count: 3 overrides (the comment claimed they "restrained" defaults; v1.x default cpu_count is 2, so 3 was raising it).

Coverage after this PR

Trigger Linux unit FreeBSD lite FreeBSD full
push (feature/PR/main) ✅ (compile/link/unit)
manual dispatch ✅ functional + ci-verify, re-run until healthy VM

Functional tests + ci-verify become on-demand deep validation rather than a flaky gate.

Test plan

  • Both workflow files parse as valid YAML; trigger sets confirmed (full: [workflow_dispatch], lite: push (all branches) + workflow_dispatch).
  • This PR's own push runs lite on v1.2.0 — green = the bumped action boots/SSHes/compiles/links/tests on FreeBSD 14.2. No more red full checks blocking the PR.
  • Manual workflow_dispatch of the full build, when desired, validates functional + ci-verify (re-run until a healthy VM lands).

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK

claude added 5 commits June 10, 2026 08:02
Closes out the SSH-to-VM flake investigation plan, step 2.

Findings from step 1 (diagnosis):
- The historical weekly-cron failures (6 consecutive, May 4 - Jun 8)
  all died ~2 minutes in, at the host->VM SSH phase, before any build
  step ran. The VM console showed a clean boot up to the login prompt.
- The same workflow file (same memory: 6G / cpu_count: 3 overrides)
  is right now running fine on pull_request-triggered jobs — 14+
  minutes into the build phase on both 14.2 and 15.0 VMs. So neither
  the resource overrides nor the FreeBSD image versions are the
  cause; the flake is runner-environment-side and intermittent (or
  was fixed upstream in the runner image between Jun 8 and Jun 10).

The actionable, deterministic risk is different: GitHub forces
Node 20 actions onto Node 24 by default starting June 16, 2026, and
every run has been warning that cross-platform-actions/action@v0.32.0
runs on Node 20. Upstream v1.0.0's explicit breaking change is the
Node 20 -> 24 requirement; v1.2.0 is current. Our usage is unaffected
by the other v1.0.0 breaks (macOS runner support and Xhyve removed —
we run ubuntu-latest + QEMU). The `run:` input is deprecated in v1.x
in favor of cpa.sh multi-step syntax but still functional; migrating
to cpa.sh is intentionally NOT mixed into this bump.

Bumped in all three pinned sites: freebsd-build.yml,
freebsd-build-lite.yml, release.yml.

Validation: pushing this branch runs lite on the bumped action;
opening the PR runs the full workflow on it too (the pull_request ->
main trigger fixed in #216), so both VM paths get exercised before
merge. release.yml gets its first exercise on the next version tag.

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Step 3 of the SSH-hang investigation, driven by new evidence that
overturns the earlier "intermittent infra flake" read:

- lite job (no resource overrides; v1.x action defaults are 6G / 2
  vCPU): passes the SSH phase consistently — green again today on
  the v1.2.0 bump, 3 minutes end to end.
- full job (memory: 6G, cpu_count: 3): hung at the SSH phase in
  EVERY observed run — six weekly crons (May 4 .. Jun 8) and both
  runs today (one on v0.32.0, one on v1.2.0). Same console
  signature every time: VM boots to the login prompt, ntpd spams
  "Name does not resolve" (slirp DNS dead inside the VM), the
  host-side ssh process hangs until the 45-minute step timeout
  kills the job ("Terminate orphan process: ... (ssh)").

The action version is exonerated (hang reproduces identically on
v0.32.0 and v1.2.0) and so is "bad runner day" (lite passed minutes
apart on the same pool). The only remaining config difference
between the always-green lite job and the never-green full job is
the explicit memory/cpu_count pair, so align full with known-good
and let the action defaults apply.

The removed comment claimed the overrides "restrained" the default
~6GB / 4 vCPU — that was stale: the v1.x default cpu_count is 2,
so cpu_count: 3 was actually RAISING it. With the override gone the
in-script JOBS computation degrades gracefully (NCPU=2 -> JOBS=2).

If the full job still hangs with default resources, the next
isolated variable is the run script size / sync_files behavior —
but one experiment at a time; this PR's own full run is the test.

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
The resource-override removal (previous commit) cured the SSH hang —
the very next full run got past the SSH phase for the first time in
every observed run: installed deps, built crate, ran the full kyua
suite, and progressed into ci-verify.sh... where the 45-minute step
timeout killed it at section T10 of 29.

Causality, now fully closed: git log -S shows the memory/cpu_count
overrides arrived in 4ab5692 (0.9.2, #160, May 8) — and every weekly
cron from May 11 onward hung at SSH. The full pipeline has therefore
NEVER completed with its current content; the 45-minute limit predates
the suite's growth (unit tests 1316 -> 1393 across 1.1.x, ci-verify at
29 sections with runtime jail integration) and was never re-validated
because the SSH hang masked everything behind it.

Budget reasoning: the build itself is fast (lite does boot + pkg +
compile + link + unit suite in ~3 min on the same default VM); the
long tail is the functional kyua tests + ci-verify's runtime sections
(real jails, base.txz handling) on a 2-vCPU VM. T10/29 at 45 min
suggests a legitimate total around 60-80 min; 90 (step) / 100 (job)
leaves headroom without letting a genuine hang burn six hours.

This PR's own full run is the experiment; if T11..T29 still don't fit,
the next move is profiling ci-verify's sections, not more timeout.

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Hypothesis "T10 timeout means raise to 90 min" was burned by the next
run: the 90-min experiment hung at SSH again on BOTH 14.2 and 15.0, ate
the full budget for nothing, identical ntpd-spam signature as every
other hang. The single non-hang observation (08:03 / 14.2 / v1.2.0,
with the old overrides) was therefore an outlier, not a fix — that one
got a healthy VM out of the pool and hit a legitimate "45 min was just
too tight for T10..T29" issue underneath the hang.

Sanest pragma at the current data quality:

- keep v1.2.0 (Node 24 readiness is independently real, and v1.x has
  no breaking changes affecting us);
- keep default VM resources (overrides were superstition);
- revert 90/100 -> 45/50: do not burn 90 min on a deterministically-
  dead VM. Re-pushes from later merges sample a fresh VM each time;
  ~1-in-5 hits a healthy one.

If the un-hung run also legitimately needs > 45 min to finish ci-verify,
that's a separate problem from the hang — we'll see real ci-verify
output in the logs then, and can profile. Today every "long" run is
hang, not slow build.

In-action retry (nick-fields/retry around cross-platform-actions) was
attempted and rolled back: cross-platform-actions runs as a docker
action with its own setup/teardown, not a shell command — wrapping it
in nick-fields/retry needs cpa.sh multi-step migration (deprecating
the `run:` input). Tracked as a follow-up.

Full investigation history kept in the in-file comment so the next
maintainer doesn't have to reconstruct.

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
Operator call after the SSH/timeout investigation: GitHub-hosted
FreeBSD runners are too unreliable for the full pipeline, so stop
trying to gate on it.

Two independent, separately-confirmed failure modes make the full
workflow unfit for automation on GitHub's QEMU-via-cross-platform-
actions runners:

  1. Per-boot SSH/DNS flake — ~4 of 5 VM boots hang: the guest's
     slirp DNS is dead, host-side ssh blocks until the step timeout
     (console shows ntpd spamming "Name does not resolve" for 45+
     min). Not our config: reproduced across v0.32.0 and v1.2.0,
     with and without the memory/cpu_count overrides.
  2. Even a healthy boot doesn't finish — the one run that got past
     SSH built fine, ran the kyua suite, and then timed out at
     section T10/29 of ci-verify.sh; the runtime jail tests simply
     do not fit in 90 min on a 2-vCPU QEMU VM.

Changes:
  - freebsd-build.yml: drop push / pull_request / cron triggers,
    leave workflow_dispatch only. Bump the step/job timeouts
    (190/200) so a *manual* run on a lucky healthy VM can actually
    complete ci-verify. Header documents the whole investigation and
    the path back to automation (self-hosted FreeBSD runner, or an
    upstream slirp fix).
  - freebsd-build-lite.yml: remove `branches-ignore: [main]` so lite
    now runs on EVERY push including main. It becomes the sole
    automated FreeBSD gate: boot + pkg + compile + link crate(1) +
    unit suite, ~3-5 min, reliably green.

Net coverage per push (incl. PRs and main): Linux unit tests +
FreeBSD lite (compile/link/unit). Functional tests + ci-verify
become on-demand deep validation via the manual full workflow,
re-run until a healthy VM.

https://claude.ai/code/session_01X6t6tzVypHye5bDGLxzmZK
@click0 click0 changed the title ci: bump cross-platform-actions to v1.2.0 (Node 24 readiness, flake follow-up) ci: make full FreeBSD build manual-only, lite gates every push (+ v1.2.0 bump) Jun 10, 2026
@click0 click0 merged commit b0f4368 into main Jun 10, 2026
1 check passed
@click0 click0 deleted the claude/analyze-test-coverage-nCOJW branch June 10, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants