Skip to content

feat(tools): standalone tvc-deploy helper + manual deploy workflow#395

Merged
prasanna-anchorage merged 12 commits into
mainfrom
feat/tvc-deploy-helper
Jul 2, 2026
Merged

feat(tools): standalone tvc-deploy helper + manual deploy workflow#395
prasanna-anchorage merged 12 commits into
mainfrom
feat/tvc-deploy-helper

Conversation

@prasanna-anchorage

@prasanna-anchorage prasanna-anchorage commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

A standalone Rust helper + manual workflow to perform dev/staging parser_app TVC deploys, plus a post-deploy smoke check.

tools/tvc-deploy/ — standalone helper (xshell + anyhow + lexopt; own workspace, no visualsign deps)

  • gen-operator-key --out <path>: mint a qos_p256 operator key (seed → 0600 file, only the public key printed).
  • deploy …: image-derived digest gate — extract /parser_app from the image, sha256 it, and assert it equals --expected-digest before anything is submitted (ties the deployed digest to the actual binary) — then assemble tvc-deploy.json (gRPC health), tvc deploy create → approve → poll-to-healthy → set-live.

.github/workflows/tvc-deploy.yml — manual + label-gated deploy

  • workflow_dispatch (explicit inputs) and a tvc-deploy-test-labeled pull_request path (test app via vars.TVC_TEST_*). Pins tvc --version 0.7.0 --locked; minimal permissions.
  • Smoke step (post set-live): runs the published turnkey-client container parse --dev-path against /visualsign-dev and asserts the V0+ALT fixture renders (regression guard for "Cannot render V0"), reusing TVC_API_KEY_*; outage abort-guard. Requires the turnkey-client image on GHCR.

Validated: helper proven end-to-end on a throwaway dev app and via a green labeled CI run against the test app 80b6f48f; smoke logic validated live (35,177 chars rendered).

🤖 Generated with Claude Code

prasanna-anchorage and others added 2 commits June 23, 2026 07:17
Lean, fast-compiling Rust binary (own workspace; deps: qos_p256 + serde_json
only, no visualsign crates) that owns the parser_app TVC deploy flow:

- gen-operator-key: mint a qos_p256 operator key; seed -> 0600 file, only the
  public key is printed (never the seed).
- deploy: assemble tvc-deploy.json (gRPC health), create the deployment, gate
  on the manifest pivot hash matching --expected-digest before approving,
  approve, poll until replicas are healthy, then set live. Retries the
  transient TVC states (status-not-ready, zero-healthy-replicas) and tolerates
  the fresh-app auto-target. Shells out to the `tvc` CLI for API actions.

Proven end-to-end against a throwaway dev app (create -> digest gate -> approve
-> 3/3 -> live), which was then deleted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A workflow_dispatch workflow to exercise dev/staging parser_app deploys on
demand: builds the standalone tvc-deploy helper and runs create -> digest
safety-gate -> approve -> poll-to-healthy -> set-live with operator/API-key
secrets. Inputs (app_id, image_url, expected_digest, operator_id, qos/host/port)
let an operator trigger a deploy with args. Dev/staging only; prod stays manual.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 23, 2026 07:19
Comment thread .github/workflows/tvc-deploy.yml Fixed
…code

Drop the hand-rolled hex_encode; use P256Pair::to_master_seed_hex() and
P256Public::to_hex_bytes() (qos_p256 owns these formats). No new deps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a standalone Rust helper and a manually-triggered GitHub Actions workflow to perform dev/staging TVC deployments of parser_app, including manifest digest gating and health polling, without pulling in the main src/ Rust workspace.

Changes:

  • Introduces tools/tvc-deploy/ as an independent Rust crate to generate operator keys and orchestrate TVC deploy flows via the tvc CLI.
  • Adds a workflow_dispatch GitHub Actions workflow to build/run the helper with input parameters and scrub the operator seed afterward.
  • Vendors a lockfile and minimal .gitignore to keep the helper self-contained and fast to build.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/tvc-deploy/src/main.rs Implements the deploy helper (flag parsing, config generation, digest gate, approve, poll health, set-live).
tools/tvc-deploy/Cargo.toml Declares the standalone crate and its minimal dependencies, with its own workspace root.
tools/tvc-deploy/Cargo.lock Locks the helper’s dependency graph for reproducible builds.
tools/tvc-deploy/.gitignore Ignores the helper’s local target/ directory.
.github/workflows/tvc-deploy.yml Adds a manual CI workflow to install tvc, build the helper, run deploy, and scrub the seed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/tvc-deploy/src/main.rs Outdated
Comment thread tools/tvc-deploy/src/main.rs Outdated
Comment thread .github/workflows/tvc-deploy.yml
…gate

Reorganize the deploy helper on the idiomatic shell-orchestration stack
(xshell cmd! + anyhow + lexopt), dropping the hand-rolled process wrappers,
String errors, HashMap flag parser, and the to_str()-lossy path() helper.

Replace the fragile `approve --dry-run` manifest-hash parse with an
image-derived digest gate: extract /parser_app from the image and sha256 it,
asserting it equals --expected-digest before deploying (ties the digest to the
actual binary).

Addresses review on #395:
- workflow: add minimal `permissions: contents: read`; pin `cargo install tvc
  --version 0.6.2 --locked` (reproducible, no drift).
- deploy(): move config-write inside the cleanup closure so an early error
  never leaves the operator seed temp file on disk.
- path() (lossy non-UTF-8 -> "") removed; xshell takes Path directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prasanna-anchorage

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in 6c4aef8d:

  • CodeQL (missing permissions): added a minimal permissions: { contents: read } to the workflow.
  • cargo install reproducibility (+ pin request): now cargo install tvc --version 0.6.2 --locked.
  • Operator-seed temp file could leak on early error: moved config assembly/write inside the cleanup closure, so the seed + config temp files are always removed regardless of where the flow fails.
  • path() lossy non-UTF-8 → empty arg: removed entirely — the helper was reorganized onto xshell (cmd!), which takes Path/PathBuf directly, so there's no to_str() fallback path anymore.

Also reorganized the helper onto the idiomatic shell-orchestration stack (xshell + anyhow + lexopt, all small/fast-compiling) and replaced the fragile approve --dry-run manifest parse with an image-derived digest gate (extract /parser_app, sha256, assert == --expected-digest) — stronger, ties the digest to the actual binary.

…label

Add a label-gated pull_request trigger (mirrors the stagex pattern) so the
deploy workflow can be exercised from a PR before merge: applying the
`tvc-deploy-test` label runs a deploy against the dev TEST app using repo
vars.TVC_TEST_* (never the live app). workflow_dispatch keeps explicit inputs;
the job `if` gates to dispatch or the specific label, and a concurrency group
prevents overlapping deploys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Comment thread tools/tvc-deploy/src/main.rs
Comment thread tools/tvc-deploy/src/main.rs
Comment thread tools/tvc-deploy/src/main.rs
Comment thread tools/tvc-deploy/src/main.rs
@prasanna-anchorage prasanna-anchorage added the tvc-deploy-test Trigger a TVC test-app deploy from this PR label Jun 23, 2026
crates.io tvc 0.6.2 takes `deploy create <CONFIG_FILE>` positionally, but the
helper (and the locally-built tvc) use `--config-file`. 0.7.0 is the published
release that carries the `--config-file` flag interface; pin to it so CI matches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prasanna-anchorage prasanna-anchorage added tvc-deploy-test Trigger a TVC test-app deploy from this PR and removed tvc-deploy-test Trigger a TVC test-app deploy from this PR labels Jun 23, 2026
prasanna-anchorage and others added 3 commits June 24, 2026 20:08
Add a smoke step to the TVC deploy workflow that parses a known Solana V0+ALT
transaction through the deployed dev parser (/visualsign-dev) and asserts it
renders — a regression guard for the "Cannot render V0" failure. Drives the
published turnkey-client container (no Go), reconstructing the parse API key
from the existing TVC_API_KEY_* secrets; abort guard skips on a transport
outage. Requires the turnkey-client image on GHCR (visualsign-turnkeyclient).

- scripts/smoke.sh: container-driven parse + jq assertions.
- testdata/solana_v0_alt.b64: the V0+ALT fixture.
- tvc-deploy.yml: Smoke step (+ packages: read) and key scrub.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- write_secret_file: explicitly chmod 0600 after open (mode() only applies on
  create; a pre-existing broader-perm file could leave the seed world-readable).
- verify_image_digest: always remove the extracted temp binary + rm the
  container, even when docker cp / sha256sum fails early.
- temp_path: add a per-process atomic counter so same-tick calls can't collide.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bound the job (cargo install tvc + poll-to-healthy + smoke) so a hung step
can't run indefinitely.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prasanna-anchorage prasanna-anchorage enabled auto-merge (squash) June 29, 2026 14:49
prasanna-anchorage and others added 2 commits June 29, 2026 20:23
Resolve the approval seed as flag -> env TVC_CI_OPERATOR_SEED -> none. When
none is given, omit --operator-seed so `tvc deploy approve` falls back to the
logged-in org operator key (the local `tvc login` path), making local deploys
work without materializing a seed file. CI still sets the env, so it continues
to write and scrub a temp seed; the flag is splatted as OsString args (no lossy
path conversion) and only the env-sourced temp file is cleaned up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sion

Switch the post-deploy smoke from `parse` to `verify --dev-path`, asserting the
tx both RENDERS (Cannot-render-V0 guard) and cryptographically VERIFIES (AWS
Nitro attestation + enclave signature via .valid/.attestationValid/
.signatureValid). Also:

- Failure classification defaults to a hard error (exit 2); only a recognized
  endpoint outage skips, so an unpullable image can't masquerade as a pass.
- --turnkey-client-path falls back to a local binary or source dir (built) when
  the container image is unavailable.
- --turnkey-client-version / VSP_SMOKE_CLIENT_VERSION pins an approved client
  image instead of :latest; wired through the workflow (input -> repo var).
- Pass the client's verification log through by default; --quiet suppresses it.
- Fix a set -e bug where the FAIL-branch summary jq could mask exit 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Manual dispatch, or the `tvc-deploy-test` label on a PR (test app only).
if: >-
github.event_name == 'workflow_dispatch' ||
github.event.label.name == 'tvc-deploy-test'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must: The pull_request: types: [labeled] trigger runs tvc-deploy deploy, which unconditionally calls tvc deploy approve ... --dangerous-skip-interactive right after create, i.e. a full create->approve->set-live with no human checkpoint, fed by TVC_CI_OPERATOR_SEED/TVC_API_KEY_* secrets, triggered by anyone able to label a same-repo PR tvc-deploy-test. This already happened for real on this PRs own branch (feat/tvc-deploy-helper): GitHub Actions run 28013160686 shows checkout, build, Deploy and Scrub all green from the PRs own head commit. Because pull_request (unlike pull_request_target) resolves both the checked-out code AND the workflow YAML from the PR head, a same-repo collaborator can edit main.rs or this workflow file in the same PR to exfiltrate the operator seed / API keys, and there is no environment: gate or required-reviewer step anywhere in the file to stop it. This directly contradicts the requesters own stated design goal in the linked Slack thread ("give a role to Github thats only allowed to initiate a deploy, it can never approve, only humans should approve") and runs ahead of the phased plan (PRS-516/519/520/521), whose Enabler ticket (PRS-521, still Backlog) is supposed to provision scoped CI keys + a Turnkey policy before any auto-approve happens from CI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and this is the exact concern #404 is designed around: splitting deploy into separate initiate / approve / promote subcommands (behind a TvcOps trait, with unit tests) so CI can be scoped to initiate-only and a human/operator key does the approve. That said, #404 currently only has the main.rs split + design docs landed — the workflow-level fix (retiring this labeled-PR auto-approve path in favor of the planned Release/Promote workflows, Phase 4) hasn't shipped, and PRS-521 (scoped CI keys + Turnkey policy) is still Backlog. So this stays an open risk on the actual tvc-deploy-test label path in this PR until that phase lands. Tracking there rather than duplicating the fix here — leaving this thread open until Phase 4 ships or we explicitly decide to gate/disable the labeled-PR path in the interim.

Comment thread scripts/smoke.sh
"$ERRFILE"; then
echo "SKIP: dev endpoint unreachable / outage — not a regression" >&2
exit 0
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must: The outage-vs-regression classifier greps stderr for generic terms including bare timeout, context deadline, and \bEOF\b, and treats any match as a pre-existing outage (exit 0, reported as a pass/skip). A genuinely broken deployment, e.g. an enclave that hangs on the gRPC health check or crashes mid-response, is very likely to surface as exactly context deadline exceeded or an EOF from the clients perspective. Since this step runs immediately after set-live specifically to catch a broken deploy, a deploy broken in a way that manifests as a timeout or dropped connection is misclassified as an unrelated outage and CI reports smoke PASS, defeating the checks purpose at the moment it matters most.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1fb6be: the outage classifier no longer treats bare timeout/EOF/context deadline as a benign outage-skip. It now only skips on unambiguous never-connected errors (connection refused/reset, no such host, dial tcp, tls handshake, network unreachable, etc.); timeout/EOF/context-deadline on an otherwise-reachable endpoint now falls through to an explicit FAIL (exit 1) instead of SKIP, since those are exactly what a hung or crashed enclave would surface as right after set-live.

sleep(POLL_INTERVAL);
continue;
}
bail!("set-live failed: {}", msg.trim());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should: set_live lowercases combined stdout+stderr from tvc app set-live-deploy and treats a bare substring match on "already" (line 259) as success, regardless of the actual failure mode. If tvc ever emits an unrelated error containing "already" (e.g. a retry-exhaustion message), this would silently report success on a failed deploy in a fully automated, no-human-in-the-loop path. No current evidence this string appears in a real failure message today, so this is a latent risk rather than a proven break.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1fb6beset_live now requires both "already" and "live" in the combined stdout+stderr before treating it as success, so an unrelated failure that happens to contain "already" (e.g. a retry-exhaustion message) can no longer be misreported as a live deploy.

# Avoid overlapping deploys to the same target.
concurrency:
group: tvc-deploy-${{ github.ref }}
cancel-in-progress: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should: For the PR-label path, github.ref is per-PR (refs/pull/<n>/merge), so two different PRs both labeled tvc-deploy-test at the same time land in different concurrency groups and can race each others create/approve/set-live against the same vars.TVC_TEST_APP_ID. Same issue between two workflow_dispatch runs targeting the same app_id from different branches. Blast radius is limited to the dev/test app today, not prod.

Suggested change
cancel-in-progress: false
Key the group on the resolved app id instead, e.g. `tvc-deploy-${{ inputs.app_id || vars.TVC_TEST_APP_ID }}`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1fb6be — took the suggested fix, concurrency group is now keyed on inputs.app_id || vars.TVC_TEST_APP_ID instead of github.ref.

}
}
None
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should: These are pure functions scraping freeform tvc CLI output (e.g. Healthy / Desired Replicas: X/Y) with zero test coverage, and the workflow only runs cargo build --release, never cargo test. If tvc ever reformats that text, deployment_health silently stops matching and poll_health spins until timeout with an "unknown" status, with nothing catching the regression before a real deploy run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1fb6be — added a #[cfg(test)] module covering deployment_health, validate_digest, and parse_after (8 unit tests), and added a cargo test step to the workflow before the build step (previously it only ran cargo build), so a regression in the freeform-output parsing now fails CI instead of silently spinning to an "unknown" status.

- smoke.sh: stop classifying timeout/EOF/context-deadline as a benign
  outage-skip -- those equally describe a connection reaching a hung or
  crashed enclave, which is exactly what this smoke check exists to catch.
- set_live: require both "already" and "live" in the tvc output before
  treating it as success, so an unrelated failure containing "already"
  can't be misreported as a live deploy.
- tvc-deploy.yml: key the concurrency group on the resolved app id instead
  of github.ref, so two PRs/dispatches targeting the same app serialize.
- add unit tests for deployment_health/validate_digest/parse_after and run
  cargo test in CI (previously only cargo build ran).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thorough hardening across the review history here — the digest gate, seed scrubbing, and the recent set-live/concurrency/test-coverage fixes all look solid, and I independently re-verified cargo build/cargo test/cargo clippy all pass clean for tools/tvc-deploy.

Ran a second pass looking for anything not already covered in this thread. The one structural gap (no human-approval checkpoint on the label-triggered path) is already tracked as a must above and deferred to #404/PRS-521, so not re-raising it. Found a few smaller things below — none of these block merge, flagging as fast-follow candidates.

Comment thread scripts/smoke.sh
CLIENT="$(resolve_fallback_client "$CLIENT_PATH")" || exit $?
else
CLIENT="$CONTAINER_CLIENT"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, non-blocking: this fallback still has a gap in the "harness failure vs. endpoint outage" distinction the comment above (lines 96-98) says it maintains.

If docker image inspect/docker pull both fail (e.g. GHCR unreachable from the runner) and no --turnkey-client-path is given, we fall through to CLIENT="$CONTAINER_CLIENT" here and try it anyway at line 116. docker run will attempt its own implicit pull, hit the same network failure, and emit text like dial tcp: lookup ghcr.io: ... or temporary failure in name resolution into $ERRFILE — which matches the SKIP regex at line 139, not the "harness couldn't run" path. So a CI-runner-side GHCR outage gets reported as SKIP: dev endpoint unreachable (exit 0) instead of the exit 2 harness failure this comment says it should be, and a real deploy regression underneath it wouldn't get caught.

Would be worth either treating a failed pull explicitly (bail with exit 2 before ever invoking $CLIENT) or having the SKIP/FAIL classifier distinguish "docker itself couldn't run" from "the target endpoint refused/timed out."


fn deploy(sh: &Shell, flags: &HashMap<String, String>) -> Result<()> {
let app_id = req(flags, "app-id")?;
let image = req(flags, "image-url")?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, non-blocking: --image-url isn't validated as digest-pinned (e.g. ...@sha256:...), only --expected-digest's hex format is checked (validate_digest, line 329). The workflow's own doc comment asks callers to pass a pinned reference, but nothing here enforces it.

If a mutable tag is passed, verify_image_digest (line 211) hashes whatever that tag resolves to right now and the gate passes — but the same tag string is also written into pivotContainerImageUrl for the TVC deploy config, so if TVC ever re-resolves that tag later (redeploy, restart) it could pull a different image than the one actually verified. A cheap image.contains("@sha256:") check before the digest gate would close this without much code.

}
let transient = msg.contains("not yet available")
|| msg.contains("try again")
|| msg.contains("not found")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, non-blocking: msg.contains("not found") in the transient-retry set is fairly broad — it'd also match a permanent error like a typo'd --operator-id or --deploy-id (e.g. "operator not found in manifest set"), causing set_live to retry every 15s for the full 300s timeout before bailing with a generic message, instead of failing fast with something actionable.

Separately (very minor): the loop checks start.elapsed() < timeout before sleeping (line 270) rather than after, so on a transient message near the deadline it can overshoot SETLIVE_TIMEOUT by up to one POLL_INTERVAL (~15s) — poll_health's loop checks elapsed right before bailing and doesn't have this gap.


/// Extract `/parser_app` from the image and sha256 it; it MUST equal the
/// submitted `--expected-digest`. Ties the deployed digest to the real binary.
fn verify_image_digest(sh: &Shell, image: &str, expected: &str) -> Result<()> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, non-blocking, efficiency nit: verify_image_digest spawns 4 subprocesses + a host temp file (docker createdocker cpsha256sumdocker rm) to hash one file already inside the image. Since parser_app's final image stage is stagex/core-busybox (which ships sha256sum), docker run --rm --entrypoint sha256sum {image} /parser_app would get the identical guarantee (still hashing the real image content, not a rebuilt binary) in one subprocess, with --rm handling cleanup and no host temp file needed at all.

@prasanna-anchorage prasanna-anchorage merged commit 3b52d28 into main Jul 2, 2026
9 checks passed
@prasanna-anchorage prasanna-anchorage deleted the feat/tvc-deploy-helper branch July 2, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tvc-deploy-test Trigger a TVC test-app deploy from this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants