Skip to content

relops-provisioner: zero-touch bootstrap for the macOS fleet#1

Merged
rcurranmoz merged 6 commits into
mainfrom
relops-provisioner
Jun 30, 2026
Merged

relops-provisioner: zero-touch bootstrap for the macOS fleet#1
rcurranmoz merged 6 commits into
mainfrom
relops-provisioner

Conversation

@rcurranmoz

@rcurranmoz rcurranmoz commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

New Cloud Run service that closes the last manual step in the RelOps macOS bootstrap. Every 5 minutes, it polls SimpleMDM assignment groups, evaluates 6 safety guards per device, and creates a SimpleMDM script-job against any device that passes all 6. The operator workflow for re-provisioning becomes: allowlist → quarantine → EACS → walk away.

  • 6 guards (default-deny, all-must-pass): kill_switch, allowlist, not_locked, mdm_state (BST preconditions), rate_limit (24h GCS-backed), tc_state (composite — fire only on no-TC-record OR currently-quarantined)
  • Three independent emergency stops: DRY_RUN env var, kill switch secret, allowlist secret
  • Internal-only Cloud Run ingress, OIDC-authed Cloud Scheduler, narrowly-scoped service accounts for run + cron
  • Quarantine becomes positive operator consent for re-provisioning (the original design treated it as veto — broke the actual workflow)

Validation done

  • End-to-end fire validated against m4-81 (production worker, allowlisted + quarantined) on 2026-06-30: device.fired hostname=macmini-m4-81 script_job_id=305469 followed by SimpleMDM delivering the bootstrap script
  • Smoke tests on live allowlist + kill switch (each flip → next tick → correct skip reason → restore) — all three cycles clean
  • Production-protection: same provisioner, same DRY_RUN=false, tested against m4-81 BEFORE quarantine semantics were inverted — three independent TC guards all caught its in-service state and skipped. Defense-in-depth held.

Test plan

  • terraform plan validates cleanly (13 new resources + 2 cosmetic in-place updates)
  • terraform apply succeeded; Cloud Run + Cloud Scheduler + secrets + GCS state bucket created
  • Container builds via gcloud builds submit; runs cleanly with empty secrets (lifespan is side-effect-free)
  • Dry-run validated: would_fire decisions logged, no SimpleMDM mutation
  • Live smoke tests: allowlist guard, kill switch guard
  • Live production-protection test: m4-81 quarantined → skip with three guards independently catching state
  • First real fire: m4-81 post-EACS → script_job 305469 created in SimpleMDM
  • Power-management profile gap (filed separately as relops-provisioner follow-up): hosts without a power-mgmt MDM profile sleep mid-bootstrap; SSH-wake is currently required. Addressed by pushing the profile to the assignment group.
  • First-time provisioning of fresh M4 hardware (haven't done this yet — only re-provisioning validated)
  • provisioned_at custom-attribute idempotency guard (filed as future work)

Files

  • provisioner/ — new directory, Cloud Run service (FastAPI + httpx + Pydantic, ~400 LOC across 7 modules)
  • terraform/provisioner.tf — Cloud Run service + Scheduler job + 3 secrets + GCS state bucket + Artifact Registry repo + IAM
  • terraform/main.tf — added cloudscheduler.googleapis.com and storage.googleapis.com to enabled APIs
  • terraform/mtls.tf — fixed broker client_validation_trust_config to reference project number (matches what GCP stores), eliminating a spurious plan-time drift that would have forced TLS policy replacement on every apply
  • terraform/variables.tfprovisioner_image and provisioner_dry_run vars

🤖 Generated with Claude Code

rcurranmoz and others added 6 commits June 30, 2026 14:14
The mtls.tf server_tls_policy referenced the trust config by project
id (var.project_id). GCP's Network Security API stores that field as
the project number form internally, and it's immutable, so every
terraform plan flagged the policy for delete+recreate even though
nothing had actually changed. Each apply would have briefly knocked
out broker mTLS during the swap.

Switching the reference to data.google_project.project.number matches
what the API stores, eliminates the spurious diff, and removes the
tripwire for future operators running `terraform plan`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New service (relops-provisioner) closes the last manual step in the
zero-touch bootstrap. Cloud Scheduler ticks every 5 minutes; the
service walks configured SimpleMDM assignment groups, evaluates 7
safety guards per device, and triggers the per-group bootstrap script
on any device that passes all 7.

Safety design — default-deny across multiple independent layers:
* DRY_RUN env var (default true) — firing branch unreachable in code
* kill_switch secret — operator-flippable halt, read every tick
* allowlist secret — opt-in list of host names, fresh each tick
* not_locked custom-attribute guard — per-device opt-out
* rate_limit (24h GCS-backed) — defense against guard-logic bugs
* tc_not_alive / no_recent_task / not_quarantined — production-protection
  via Taskcluster worker-manager state

Flipping DRY_RUN=false requires a deliberate tfvar change + apply.
The kill switch and allowlist independently halt firing without a
redeploy, both verified via smoke tests against m4-81 before this
commit.

Infra (terraform/provisioner.tf): internal-only Cloud Run service,
Cloud Scheduler with OIDC, three Secret Manager secrets (api token,
allowlist, kill switch), GCS bucket for per-device rate-limit state,
dedicated Artifact Registry repo, and two service accounts (run +
cron) with narrowly-scoped IAM. cloudscheduler.googleapis.com and
storage.googleapis.com added to the project APIs list.

The 8th guard from the original design (no_prior_success — check
SimpleMDM script-job history) was dropped: SimpleMDM's list endpoint
exposes only aggregate per-job counts, not per-device outcomes, so
the guard wasn't implementable against the public API. Future
replacement is a `provisioned_at` custom attribute written by the
bootstrap script on completion (SimpleMDM's POST /script_jobs accepts
a custom_attribute param for exactly this).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two corrections found during the first live test against m4-81:

1) Role in target_groups was 'gecko_t_osx_1500_m4_no_sip', conflating
   SimpleMDM assignment-group membership with TC worker-pool membership.
   m4-81 is in the no-sip SimpleMDM assignment group (where the MDM
   bootstrap script targets it) but its TC registration is in the
   production releng-hardware/gecko-t-osx-1500-m4 pool. These two
   groupings are independent. Role string now points at the real TC
   pool so the production-protection guards see real state.

2) Taskcluster client was querying worker-manager, which only tracks
   provisioner-spawned workers. Bare-metal hardware lives in the queue
   API — that's where quarantine, lastRun, and expires actually
   surface. Switched to /api/queue/v1/provisioners/.../workers/<group>/<id>.

   The freshness signal had to change too: queue records have no
   'lastChecked' field. Using the existence of recentTasks as the
   freshness proxy — conservative: any recentTasks entry means the
   worker has been active in TC's rolling window and we should not
   touch it.

Validated against m4-81 (production worker, quarantined,
quarantineUntil=3026): with dry_run=False, all three TC guards
independently caught its state and skipped the device. Defense-in-depth
held end to end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two design changes informed by the live test against m4-81.

1) Collapse three TC guards into one composite tc_state.

The original tc_not_alive / no_recent_task / not_quarantined guards
treated quarantine as a hard veto: "operator pulled this worker,
don't touch." That breaks the re-provisioning workflow where
quarantine is the operator's explicit consent signal — the natural
flow is quarantine -> EACS -> auto-fire -> un-quarantine.

The composite guard inverts the quarantine semantic:

    fire eligibility = (no TC record at all)
                    OR (TC record exists AND device is currently quarantined)

i.e. quarantine becomes positive consent. The allowlist (guard #2)
remains the load-bearing per-host opt-in. To re-provision a host the
operator must both allowlist it AND quarantine it; either action
alone is insufficient.

2) Add mdm_state guard for Bootstrap Token preconditions.

Future EACS-ability requires the SimpleMDM enrollment to have the
right level of management rights: DEP-enrolled, User-Approved MDM
Enrollment, and Supervised mode. If any of these is False, the BST
escrow flow can't work and a future EACS will silently fail,
breaking the next re-provisioning cycle.

This is a necessary, not sufficient, check. SimpleMDM doesn't expose
which user holds the escrowed BST — the gotcha where BST lands on
cltbld instead of admin can still bite even with this guard passing.
The guard catches the easier misconfiguration cases (wrong enrollment
flow, missing supervision) before they cause silent breakage.

Plumbed through Device dataclass via three new fields read from the
SimpleMDM device record: dep_enrolled, is_user_approved_enrollment,
is_supervised.

Guard count: 7 -> 5 (drop three TC, add composite) -> 6 (add
mdm_state). See README for the canonical list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three small changes following the live test against m4-81:

* Remove most_recent_task_completion and _most_recent_run_time from
  taskcluster.py. Both went unused after the three TC guards collapsed
  into the composite tc_state guard.
* Remove tc_alive_threshold_minutes and tc_recent_task_threshold_hours
  from config.py for the same reason.
* Simplify get_worker to return only the field tc_state actually reads
  (quarantineUntil). The expanded shape was overkill once recentTasks
  stopped being consulted.

README is a full rewrite: one-tick ASCII diagram, guards table, the
re-provisioning runbook called out as a first-class section, a known-
limits catalog (BST-on-cltbld blind spot, power-management dependency,
no provisioned_at idempotency yet). Same technical content, less wall
of text.

Version 0.1.8 -> 0.1.9. No behavior change; cleanup-only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's `terraform fmt -check` flagged the alignment of
client_validation_mode after my project-number fix split the previously
aligned pair. fmt-canonical now (no padding on the lone field).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rcurranmoz

Copy link
Copy Markdown
Collaborator Author

Wow very good

@rcurranmoz rcurranmoz merged commit c634c4d into main Jun 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant