feat(tools): standalone tvc-deploy helper + manual deploy workflow#395
Conversation
Lean, fast-compiling Rust binary (own workspace; deps: qos_p256 + serde_json only, no visualsign crates) that owns the parser_app TVC deploy flow: - gen-operator-key: mint a qos_p256 operator key; seed -> 0600 file, only the public key is printed (never the seed). - deploy: assemble tvc-deploy.json (gRPC health), create the deployment, gate on the manifest pivot hash matching --expected-digest before approving, approve, poll until replicas are healthy, then set live. Retries the transient TVC states (status-not-ready, zero-healthy-replicas) and tolerates the fresh-app auto-target. Shells out to the `tvc` CLI for API actions. Proven end-to-end against a throwaway dev app (create -> digest gate -> approve -> 3/3 -> live), which was then deleted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A workflow_dispatch workflow to exercise dev/staging parser_app deploys on demand: builds the standalone tvc-deploy helper and runs create -> digest safety-gate -> approve -> poll-to-healthy -> set-live with operator/API-key secrets. Inputs (app_id, image_url, expected_digest, operator_id, qos/host/port) let an operator trigger a deploy with args. Dev/staging only; prod stays manual. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…code Drop the hand-rolled hex_encode; use P256Pair::to_master_seed_hex() and P256Public::to_hex_bytes() (qos_p256 owns these formats). No new deps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a standalone Rust helper and a manually-triggered GitHub Actions workflow to perform dev/staging TVC deployments of parser_app, including manifest digest gating and health polling, without pulling in the main src/ Rust workspace.
Changes:
- Introduces
tools/tvc-deploy/as an independent Rust crate to generate operator keys and orchestrate TVC deploy flows via thetvcCLI. - Adds a
workflow_dispatchGitHub Actions workflow to build/run the helper with input parameters and scrub the operator seed afterward. - Vendors a lockfile and minimal
.gitignoreto keep the helper self-contained and fast to build.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/tvc-deploy/src/main.rs | Implements the deploy helper (flag parsing, config generation, digest gate, approve, poll health, set-live). |
| tools/tvc-deploy/Cargo.toml | Declares the standalone crate and its minimal dependencies, with its own workspace root. |
| tools/tvc-deploy/Cargo.lock | Locks the helper’s dependency graph for reproducible builds. |
| tools/tvc-deploy/.gitignore | Ignores the helper’s local target/ directory. |
| .github/workflows/tvc-deploy.yml | Adds a manual CI workflow to install tvc, build the helper, run deploy, and scrub the seed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…gate Reorganize the deploy helper on the idiomatic shell-orchestration stack (xshell cmd! + anyhow + lexopt), dropping the hand-rolled process wrappers, String errors, HashMap flag parser, and the to_str()-lossy path() helper. Replace the fragile `approve --dry-run` manifest-hash parse with an image-derived digest gate: extract /parser_app from the image and sha256 it, asserting it equals --expected-digest before deploying (ties the digest to the actual binary). Addresses review on #395: - workflow: add minimal `permissions: contents: read`; pin `cargo install tvc --version 0.6.2 --locked` (reproducible, no drift). - deploy(): move config-write inside the cleanup closure so an early error never leaves the operator seed temp file on disk. - path() (lossy non-UTF-8 -> "") removed; xshell takes Path directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Addressed the review feedback in
Also reorganized the helper onto the idiomatic shell-orchestration stack ( |
…label Add a label-gated pull_request trigger (mirrors the stagex pattern) so the deploy workflow can be exercised from a PR before merge: applying the `tvc-deploy-test` label runs a deploy against the dev TEST app using repo vars.TVC_TEST_* (never the live app). workflow_dispatch keeps explicit inputs; the job `if` gates to dispatch or the specific label, and a concurrency group prevents overlapping deploys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
crates.io tvc 0.6.2 takes `deploy create <CONFIG_FILE>` positionally, but the helper (and the locally-built tvc) use `--config-file`. 0.7.0 is the published release that carries the `--config-file` flag interface; pin to it so CI matches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a smoke step to the TVC deploy workflow that parses a known Solana V0+ALT transaction through the deployed dev parser (/visualsign-dev) and asserts it renders — a regression guard for the "Cannot render V0" failure. Drives the published turnkey-client container (no Go), reconstructing the parse API key from the existing TVC_API_KEY_* secrets; abort guard skips on a transport outage. Requires the turnkey-client image on GHCR (visualsign-turnkeyclient). - scripts/smoke.sh: container-driven parse + jq assertions. - testdata/solana_v0_alt.b64: the V0+ALT fixture. - tvc-deploy.yml: Smoke step (+ packages: read) and key scrub. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- write_secret_file: explicitly chmod 0600 after open (mode() only applies on create; a pre-existing broader-perm file could leave the seed world-readable). - verify_image_digest: always remove the extracted temp binary + rm the container, even when docker cp / sha256sum fails early. - temp_path: add a per-process atomic counter so same-tick calls can't collide. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bound the job (cargo install tvc + poll-to-healthy + smoke) so a hung step can't run indefinitely. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve the approval seed as flag -> env TVC_CI_OPERATOR_SEED -> none. When none is given, omit --operator-seed so `tvc deploy approve` falls back to the logged-in org operator key (the local `tvc login` path), making local deploys work without materializing a seed file. CI still sets the env, so it continues to write and scrub a temp seed; the flag is splatted as OsString args (no lossy path conversion) and only the env-sourced temp file is cleaned up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sion Switch the post-deploy smoke from `parse` to `verify --dev-path`, asserting the tx both RENDERS (Cannot-render-V0 guard) and cryptographically VERIFIES (AWS Nitro attestation + enclave signature via .valid/.attestationValid/ .signatureValid). Also: - Failure classification defaults to a hard error (exit 2); only a recognized endpoint outage skips, so an unpullable image can't masquerade as a pass. - --turnkey-client-path falls back to a local binary or source dir (built) when the container image is unavailable. - --turnkey-client-version / VSP_SMOKE_CLIENT_VERSION pins an approved client image instead of :latest; wired through the workflow (input -> repo var). - Pass the client's verification log through by default; --quiet suppresses it. - Fix a set -e bug where the FAIL-branch summary jq could mask exit 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| # Manual dispatch, or the `tvc-deploy-test` label on a PR (test app only). | ||
| if: >- | ||
| github.event_name == 'workflow_dispatch' || | ||
| github.event.label.name == 'tvc-deploy-test' |
There was a problem hiding this comment.
must: The pull_request: types: [labeled] trigger runs tvc-deploy deploy, which unconditionally calls tvc deploy approve ... --dangerous-skip-interactive right after create, i.e. a full create->approve->set-live with no human checkpoint, fed by TVC_CI_OPERATOR_SEED/TVC_API_KEY_* secrets, triggered by anyone able to label a same-repo PR tvc-deploy-test. This already happened for real on this PRs own branch (feat/tvc-deploy-helper): GitHub Actions run 28013160686 shows checkout, build, Deploy and Scrub all green from the PRs own head commit. Because pull_request (unlike pull_request_target) resolves both the checked-out code AND the workflow YAML from the PR head, a same-repo collaborator can edit main.rs or this workflow file in the same PR to exfiltrate the operator seed / API keys, and there is no environment: gate or required-reviewer step anywhere in the file to stop it. This directly contradicts the requesters own stated design goal in the linked Slack thread ("give a role to Github thats only allowed to initiate a deploy, it can never approve, only humans should approve") and runs ahead of the phased plan (PRS-516/519/520/521), whose Enabler ticket (PRS-521, still Backlog) is supposed to provision scoped CI keys + a Turnkey policy before any auto-approve happens from CI.
There was a problem hiding this comment.
Agreed, and this is the exact concern #404 is designed around: splitting deploy into separate initiate / approve / promote subcommands (behind a TvcOps trait, with unit tests) so CI can be scoped to initiate-only and a human/operator key does the approve. That said, #404 currently only has the main.rs split + design docs landed — the workflow-level fix (retiring this labeled-PR auto-approve path in favor of the planned Release/Promote workflows, Phase 4) hasn't shipped, and PRS-521 (scoped CI keys + Turnkey policy) is still Backlog. So this stays an open risk on the actual tvc-deploy-test label path in this PR until that phase lands. Tracking there rather than duplicating the fix here — leaving this thread open until Phase 4 ships or we explicitly decide to gate/disable the labeled-PR path in the interim.
| "$ERRFILE"; then | ||
| echo "SKIP: dev endpoint unreachable / outage — not a regression" >&2 | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
must: The outage-vs-regression classifier greps stderr for generic terms including bare timeout, context deadline, and \bEOF\b, and treats any match as a pre-existing outage (exit 0, reported as a pass/skip). A genuinely broken deployment, e.g. an enclave that hangs on the gRPC health check or crashes mid-response, is very likely to surface as exactly context deadline exceeded or an EOF from the clients perspective. Since this step runs immediately after set-live specifically to catch a broken deploy, a deploy broken in a way that manifests as a timeout or dropped connection is misclassified as an unrelated outage and CI reports smoke PASS, defeating the checks purpose at the moment it matters most.
There was a problem hiding this comment.
Fixed in c1fb6be: the outage classifier no longer treats bare timeout/EOF/context deadline as a benign outage-skip. It now only skips on unambiguous never-connected errors (connection refused/reset, no such host, dial tcp, tls handshake, network unreachable, etc.); timeout/EOF/context-deadline on an otherwise-reachable endpoint now falls through to an explicit FAIL (exit 1) instead of SKIP, since those are exactly what a hung or crashed enclave would surface as right after set-live.
| sleep(POLL_INTERVAL); | ||
| continue; | ||
| } | ||
| bail!("set-live failed: {}", msg.trim()); |
There was a problem hiding this comment.
should: set_live lowercases combined stdout+stderr from tvc app set-live-deploy and treats a bare substring match on "already" (line 259) as success, regardless of the actual failure mode. If tvc ever emits an unrelated error containing "already" (e.g. a retry-exhaustion message), this would silently report success on a failed deploy in a fully automated, no-human-in-the-loop path. No current evidence this string appears in a real failure message today, so this is a latent risk rather than a proven break.
There was a problem hiding this comment.
Fixed in c1fb6be — set_live now requires both "already" and "live" in the combined stdout+stderr before treating it as success, so an unrelated failure that happens to contain "already" (e.g. a retry-exhaustion message) can no longer be misreported as a live deploy.
| # Avoid overlapping deploys to the same target. | ||
| concurrency: | ||
| group: tvc-deploy-${{ github.ref }} | ||
| cancel-in-progress: false |
There was a problem hiding this comment.
should: For the PR-label path, github.ref is per-PR (refs/pull/<n>/merge), so two different PRs both labeled tvc-deploy-test at the same time land in different concurrency groups and can race each others create/approve/set-live against the same vars.TVC_TEST_APP_ID. Same issue between two workflow_dispatch runs targeting the same app_id from different branches. Blast radius is limited to the dev/test app today, not prod.
| cancel-in-progress: false | |
| Key the group on the resolved app id instead, e.g. `tvc-deploy-${{ inputs.app_id || vars.TVC_TEST_APP_ID }}`. |
There was a problem hiding this comment.
Fixed in c1fb6be — took the suggested fix, concurrency group is now keyed on inputs.app_id || vars.TVC_TEST_APP_ID instead of github.ref.
| } | ||
| } | ||
| None | ||
| } |
There was a problem hiding this comment.
should: These are pure functions scraping freeform tvc CLI output (e.g. Healthy / Desired Replicas: X/Y) with zero test coverage, and the workflow only runs cargo build --release, never cargo test. If tvc ever reformats that text, deployment_health silently stops matching and poll_health spins until timeout with an "unknown" status, with nothing catching the regression before a real deploy run.
There was a problem hiding this comment.
Fixed in c1fb6be — added a #[cfg(test)] module covering deployment_health, validate_digest, and parse_after (8 unit tests), and added a cargo test step to the workflow before the build step (previously it only ran cargo build), so a regression in the freeform-output parsing now fails CI instead of silently spinning to an "unknown" status.
- smoke.sh: stop classifying timeout/EOF/context-deadline as a benign outage-skip -- those equally describe a connection reaching a hung or crashed enclave, which is exactly what this smoke check exists to catch. - set_live: require both "already" and "live" in the tvc output before treating it as success, so an unrelated failure containing "already" can't be misreported as a live deploy. - tvc-deploy.yml: key the concurrency group on the resolved app id instead of github.ref, so two PRs/dispatches targeting the same app serialize. - add unit tests for deployment_health/validate_digest/parse_after and run cargo test in CI (previously only cargo build ran). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
shahan-khatchadourian-anchorage
left a comment
There was a problem hiding this comment.
Nice, thorough hardening across the review history here — the digest gate, seed scrubbing, and the recent set-live/concurrency/test-coverage fixes all look solid, and I independently re-verified cargo build/cargo test/cargo clippy all pass clean for tools/tvc-deploy.
Ran a second pass looking for anything not already covered in this thread. The one structural gap (no human-approval checkpoint on the label-triggered path) is already tracked as a must above and deferred to #404/PRS-521, so not re-raising it. Found a few smaller things below — none of these block merge, flagging as fast-follow candidates.
| CLIENT="$(resolve_fallback_client "$CLIENT_PATH")" || exit $? | ||
| else | ||
| CLIENT="$CONTAINER_CLIENT" | ||
| fi |
There was a problem hiding this comment.
Minor, non-blocking: this fallback still has a gap in the "harness failure vs. endpoint outage" distinction the comment above (lines 96-98) says it maintains.
If docker image inspect/docker pull both fail (e.g. GHCR unreachable from the runner) and no --turnkey-client-path is given, we fall through to CLIENT="$CONTAINER_CLIENT" here and try it anyway at line 116. docker run will attempt its own implicit pull, hit the same network failure, and emit text like dial tcp: lookup ghcr.io: ... or temporary failure in name resolution into $ERRFILE — which matches the SKIP regex at line 139, not the "harness couldn't run" path. So a CI-runner-side GHCR outage gets reported as SKIP: dev endpoint unreachable (exit 0) instead of the exit 2 harness failure this comment says it should be, and a real deploy regression underneath it wouldn't get caught.
Would be worth either treating a failed pull explicitly (bail with exit 2 before ever invoking $CLIENT) or having the SKIP/FAIL classifier distinguish "docker itself couldn't run" from "the target endpoint refused/timed out."
|
|
||
| fn deploy(sh: &Shell, flags: &HashMap<String, String>) -> Result<()> { | ||
| let app_id = req(flags, "app-id")?; | ||
| let image = req(flags, "image-url")?; |
There was a problem hiding this comment.
Minor, non-blocking: --image-url isn't validated as digest-pinned (e.g. ...@sha256:...), only --expected-digest's hex format is checked (validate_digest, line 329). The workflow's own doc comment asks callers to pass a pinned reference, but nothing here enforces it.
If a mutable tag is passed, verify_image_digest (line 211) hashes whatever that tag resolves to right now and the gate passes — but the same tag string is also written into pivotContainerImageUrl for the TVC deploy config, so if TVC ever re-resolves that tag later (redeploy, restart) it could pull a different image than the one actually verified. A cheap image.contains("@sha256:") check before the digest gate would close this without much code.
| } | ||
| let transient = msg.contains("not yet available") | ||
| || msg.contains("try again") | ||
| || msg.contains("not found") |
There was a problem hiding this comment.
Minor, non-blocking: msg.contains("not found") in the transient-retry set is fairly broad — it'd also match a permanent error like a typo'd --operator-id or --deploy-id (e.g. "operator not found in manifest set"), causing set_live to retry every 15s for the full 300s timeout before bailing with a generic message, instead of failing fast with something actionable.
Separately (very minor): the loop checks start.elapsed() < timeout before sleeping (line 270) rather than after, so on a transient message near the deadline it can overshoot SETLIVE_TIMEOUT by up to one POLL_INTERVAL (~15s) — poll_health's loop checks elapsed right before bailing and doesn't have this gap.
|
|
||
| /// Extract `/parser_app` from the image and sha256 it; it MUST equal the | ||
| /// submitted `--expected-digest`. Ties the deployed digest to the real binary. | ||
| fn verify_image_digest(sh: &Shell, image: &str, expected: &str) -> Result<()> { |
There was a problem hiding this comment.
Minor, non-blocking, efficiency nit: verify_image_digest spawns 4 subprocesses + a host temp file (docker create → docker cp → sha256sum → docker rm) to hash one file already inside the image. Since parser_app's final image stage is stagex/core-busybox (which ships sha256sum), docker run --rm --entrypoint sha256sum {image} /parser_app would get the identical guarantee (still hashing the real image content, not a rebuilt binary) in one subprocess, with --rm handling cleanup and no host temp file needed at all.
Summary
A standalone Rust helper + manual workflow to perform dev/staging
parser_appTVC deploys, plus a post-deploy smoke check.tools/tvc-deploy/— standalone helper (xshell + anyhow + lexopt; own workspace, novisualsigndeps)gen-operator-key --out <path>: mint aqos_p256operator key (seed → 0600 file, only the public key printed).deploy …: image-derived digest gate — extract/parser_appfrom the image,sha256it, and assert it equals--expected-digestbefore anything is submitted (ties the deployed digest to the actual binary) — then assembletvc-deploy.json(gRPC health),tvc deploy create→ approve → poll-to-healthy →set-live..github/workflows/tvc-deploy.yml— manual + label-gated deployworkflow_dispatch(explicit inputs) and atvc-deploy-test-labeledpull_requestpath (test app viavars.TVC_TEST_*). Pinstvc --version 0.7.0 --locked; minimal permissions.turnkey-clientcontainerparse --dev-pathagainst/visualsign-devand asserts the V0+ALT fixture renders (regression guard for "Cannot render V0"), reusingTVC_API_KEY_*; outage abort-guard. Requires the turnkey-client image on GHCR.Validated: helper proven end-to-end on a throwaway dev app and via a green labeled CI run against the test app
80b6f48f; smoke logic validated live (35,177 chars rendered).🤖 Generated with Claude Code