From 103761eea9a05e681c9d0694509e51a5b45f10ea Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Sat, 27 Jun 2026 15:22:08 +0400 Subject: [PATCH 1/2] docs(secrets): RFC for keyless secret backend (KMS recipients + OIDC) Proposes pluggable recipient types whose private key never leaves a remote authority (cloud KMS / Vault) plus a keyless CI mode via OIDC federation, so the store private key (SIMPLE_CONTAINER_CONFIG) no longer has to be materialized on CI runners. Cloud-agnostic and backward-compatible: the local recipient stays the default; migration is additive, per-repo, and reversible. Covers the v2 envelope prerequisite (single AEAD DEK + per-recipient wrapping + AAD binding), the OIDC trust/authorization model, format versioning with a fail-closed guard, revocation reality (value rotation), operational concerns, and a phased migration plan. Signed-off-by: Dmitrii Creed --- docs/design/keyless-secrets/README.md | 235 ++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 docs/design/keyless-secrets/README.md diff --git a/docs/design/keyless-secrets/README.md b/docs/design/keyless-secrets/README.md new file mode 100644 index 00000000..49bc9f60 --- /dev/null +++ b/docs/design/keyless-secrets/README.md @@ -0,0 +1,235 @@ +# RFC: Keyless secret backend (KMS-wrapped recipients + OIDC federation) + +**Status:** Draft / request for comments +**Area:** `pkg/api/secrets` (encryption envelope, recipients, CLI `secrets` commands) +**Relates to:** [`docs/SECRETS-POLICY.md`](../../SECRETS-POLICY.md), [`docs/SECURITY.md`](../../SECURITY.md) + +## Summary + +Today, decrypting an SC secret store in CI requires materializing the store's +**private key** (`SIMPLE_CONTAINER_CONFIG`) in plaintext on the runner — typically +as a CI secret. That value is long-lived, all-powerful (decrypts the entire store), +not rotated in practice, not audited, and not revocable after a leak. + +This RFC proposes adding **pluggable recipient types whose private key never leaves a +remote authority** (a cloud KMS or Vault), and a **keyless CI mode** where the runner +proves its identity via OIDC federation and obtains a short-lived `Decrypt` permission +instead of holding any long-lived key. It is **cloud-agnostic** (AWS/GCP/Azure/Vault, +or none) and **backward-compatible** (the existing local-key recipient stays the +default; migration is additive, per-repo, and reversible). + +The crypto primitives are not reinvented; the change is the **key-custody model** plus +a **v2 envelope format** required to support per-recipient key wrapping cleanly. + +## Motivation + +The SC envelope is already **multi-recipient** (`AddPublicKey`/`RemovePublicKey`, +CLI `secrets allow`/`disallow`). The limitation is that there is exactly **one +recipient type**: a raw asymmetric private key. Because decryption requires possessing +that private key, the key must be placed on every machine that decrypts — including +CI. There is no way to delegate the unwrap to an external authority. + +Consequences of the single long-lived key in CI: + +- **Exfiltration surface.** Any code that runs in the job (including a malicious + transitive dependency on an untrusted PR build) can read the key and decrypt the + whole store. +- **No rotation in practice.** Rotating means re-keying and redistributing the key to + every consumer. +- **No audit.** Use of a local private key is not logged anywhere. +- **No revocation.** A leaked private key decrypts forever. + +## Goals / non-goals + +**Goals** +- Allow a store to be decrypted in CI with **no long-lived secret present** on the runner. +- Keep it **cloud-agnostic** and keep the **simple local-key path as the default**. +- Make access **scoped, audited, and revocable**. +- Provide an **additive, reversible, per-repo** migration. + +**Non-goals** +- Reinventing cryptographic primitives. +- Replacing the on-disk format with SOPS/age (we keep SC's format + `allow`/`disallow` UX). +- Putting an external SaaS vault in the deploy hot path by default. +- Moving application runtime secrets out of the store (separate concern). + +## Design + +### 1. v2 envelope (prerequisite) + +The current format encrypts each file independently for each recipient. To wrap keys +per recipient cleanly — and to keep KMS calls O(1) per deploy rather than O(files) — +introduce a **versioned v2 envelope**: + +- A single random **data-encryption key (DEK)** encrypts the payload once with an + AEAD (ChaCha20-Poly1305), per file-set (or per environment). +- The DEK is **wrapped once per recipient**. +- The AEAD includes **associated data (AAD)** binding the ciphertext to + `{format-version, path, environment, recipient-id}` so a wrapped DEK or ciphertext + cannot be transplanted into another context. +- Recipient wrapping standardizes on a **vetted public-key scheme** (an X25519-based + KEM for asymmetric recipients) and **KMS Encrypt/Decrypt** for KMS recipients. Key + material and algorithm identifiers are carried in **authenticated** headers. + +`secrets.yaml` gains an explicit `version` and a typed `recipients[]` list. + +### 2. Typed recipients + `KeyProvider` + +```go +type KeyProvider interface { + Wrap(ctx context.Context, dataKey []byte) (wrapped []byte, err error) + Unwrap(ctx context.Context, wrapped []byte) (dataKey []byte, err error) + Recipients() []RecipientRef // metadata only, no network, no decrypt +} +``` + +Recipient types: + +| Type | Private key location | CI auth | +|---|---|---| +| `local` (default) | a local private key (today's `SIMPLE_CONTAINER_CONFIG`) | the key itself | +| `aws-kms://` | AWS KMS | OIDC → STS → `kms:Decrypt` | +| `gcp-kms://` | GCP KMS | OIDC → Workload Identity Federation → `decrypt` | +| `azure-kv://` | Azure Key Vault | OIDC → federated credential | +| `vault://transit/` | HashiCorp Vault | OIDC/JWT auth → transit decrypt | + +`DecryptAll` selects the recipient by **authenticated recipient-id** (not "first that +succeeds") and unwraps via the matching provider. Any one recipient suffices; never all. + +KMS recipients reference **immutable key identifiers** (not mutable aliases) and bind a +KMS **encryption context** matching the AAD above. + +### 3. Keyless CI via OIDC + +When a KMS recipient is configured and the platform exposes an OIDC token, SC exchanges +the token for short-lived credentials scoped to `Decrypt` on the recipient key. On +GitHub Actions the runner only needs `id-token: write`; no long-lived secret is stored. +Every unwrap is recorded in the cloud audit log under the federated identity. + +To avoid handing the OIDC mint capability to arbitrary job steps, credential +acquisition should happen in **one isolated step** that passes only the resulting +short-lived credentials onward. + +### 4. Federated deploy credentials + +Independently of store decryption, the cloud credentials used by `sc deploy` can be +sourced from the same OIDC federation (`auth: { provider: oidc }`) instead of static +keys stored inside the secret store. This requires **first-class auth provider types** +that consume ambient federated credentials rather than deserializing static key +material. Static keys remain supported as a fallback. + +### 5. Environment-scoped recipients + +A recipient may be granted to a single environment. Combined with per-environment DEKs, +an untrusted/preview context can be limited to decrypting only its own environment and +never production. This is enforced by encrypting each environment under a distinct DEK +and never wrapping production's DEK to a preview-reachable recipient. + +### 6. Decouple repository checkout from the encryption key + +Where the store private key is currently reused as an SSH key to clone a private +parent/stacks repository, keyless mode needs a separate mechanism (e.g. a narrowly +scoped GitHub App installation token, or an existing deploy key). This is **orthogonal** +to encryption and is a **prerequisite** before the local key can be removed. + +## Authorization model (OIDC trust) + +The trust policy — not the runner — is the control. Hard-won requirements: + +- **Per-stack / per-environment roles**, not one shared role. A single role shared by + many consumers, trusted by `job_workflow_ref` alone, lets *any* repo that calls the + shared workflow assume it. +- **Pin the concrete caller**: immutable repository id **and** `job_workflow_ref` (to a + pinned ref), plus `aud`. **No wildcards** in the subject. +- **`ref` and `environment` subjects are mutually exclusive.** A job using a GitHub + Environment emits `repo:ORG/REPO:environment:NAME` with **no** `ref` component; a job + without one emits `...:ref:...`. You cannot pin both in one subject — choose per role. +- **Production = protected Environment with required reviewers.** OIDC trust alone is + not sufficient to gate production. +- **Never `pull_request_target` with PR-head checkout** in any workflow that can mint an + id-token or reach a recipient. +- **Attribution:** set a session name carrying repo + run id; enable cloud data-access + logging where it is off by default (e.g. GCP KMS), so a shared role does not erase + per-run attribution. + +## Backward compatibility & format versioning + +- The `local` recipient + existing config remain the **default**; nothing changes for + current users until they opt in. +- **Forward-compat guard (ship first):** older clients ignore unknown YAML fields and + rewrite only the fields they know, which would **silently drop** new recipients on the + next `allow`/`disallow`. A version-aware client that **fails closed** on an unknown + `version` must be released and rolled out **before** any store is written in v2. +- Migration uses existing verbs: `secrets allow --kms ` (add a KMS recipient), + verify keyless decrypt, then `secrets disallow `. Multi-recipient means + both work during overlap; per-repo; reversible. +- A `require-kms` (or equivalent) per-store setting disables the legacy path once a store + is fully migrated, to prevent silent downgrade back to the local key. +- Cloud-agnostic: customers on no cloud keep the `local` recipient permanently — it is + not a deprecation target. + +## Revocation reality + +Removing a recipient or rotating a KMS key does **not** revoke access to ciphertext +already committed to version control — history retains blobs decryptable by previously +valid recipients. Therefore: + +- **Removing a recipient MUST be paired with rotating the underlying secret values** + (and the upstream credential they represent), not just dropping a wrapper. +- The recipient list is effectively an **append-only audit surface**; changes to it + should be review-gated (code owners). + +## Operational considerations + +- **Availability:** decryption now depends on cloud IAM/KMS. Decrypt **all** required + material **up front**, before any infrastructure mutation, so a mid-deploy KMS error + cannot leave a half-applied stack. +- **Throughput/cost:** one DEK per file-set ⇒ one `Decrypt` per deploy, not one per + file. Use bounded retries with backoff + jitter; cache the unwrapped DEK in-process + for the run only. +- **Locality:** pin the KMS region to the deploy region; document cross-account key + policies. +- **Dependencies:** the cloud KMS SDKs become direct dependencies (today they are + transitive), expanding the SCA/dependency-update surface. + +## Threat model (summary) + +**Mitigates:** leak of a long-lived master key from CI; secret theft by untrusted +PR-build code (via scoped trust + environment-scoped recipients); lateral movement +after a single job compromise (scope + short TTL); long-lived static cloud credentials; +absence of audit; departed-operator access on the CI path; replay of a stolen token in +another repo (pinned `aud`/subject). + +**Does not fully mitigate (residual):** malicious code inside a *legitimate* +production job (gets short-lived scoped credentials during the run — bounded by scope, +TTL, branch protection, audit, and, for the highest-value targets, isolated runners); +application secrets being decrypted into the workload at deploy time; exfiltration over +otherwise-allowed network egress; compromise of the cloud account / IAM / KMS policy +itself; supply-chain compromise of the CLI or CI actions (mitigated by signed releases ++ digest pinning); phishing of operator cloud credentials. + +## Migration phases + +0. **Design + review** (this document). +1. **v2 envelope + version-aware fail-closed client**, released and rolled out before + any v2 write. Modernize the asymmetric recipient scheme as part of v2. +2. **`KeyProvider` + first KMS provider + OIDC acquisition**, behind a feature flag. +3. **Canary** on one low-risk staging stack. Gate: `sc deploy` obtains cloud + credentials from the federated environment, not the store; measured KMS latency; + verified rollback runbook. +4. **Migration UX** (`allow --kms`, `secrets doctor`/recipient listing, `require-kms`), + environment-scoped recipients, repo-checkout decoupling, docs. +5. **Federated deploy credentials** (`auth: { provider: oidc }`); drop static keys from + the store. +6. **Staging bake**, then **production** behind protected Environments, stack by stack. +7. **Decommission**: remove the local recipient, **rotate secret values**, delete static + credentials after a zero-usage soak. Add other cloud providers (GCP/Azure/Vault). + +## Open questions + +- Per-file-set vs per-environment DEK granularity (audit/scoping vs rewrap cost). +- How aggressively to enforce a minimum recipient-strength / forbidden-type policy. +- Whether `require-kms` is per-store, per-environment, or both. +- Provider parity: each KMS/Vault has distinct federation, audit, and key-versioning + semantics; ship one provider first behind an explicit contract, then add others with + per-provider review. From f28aab7a56729327861833fc8b3b8eaec60baeb1 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Sat, 27 Jun 2026 19:53:40 +0400 Subject: [PATCH 2/2] =?UTF-8?q?docs(secrets):=20record=20decision=20?= =?UTF-8?q?=E2=80=94=20inline=20values=20+=20per-scope=20keys=20via=20SOPS?= =?UTF-8?q?,=20file-per-scope?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds the multi-model design review outcome into the RFC: adopt SOPS for the inline-value crypto/format, one readable file per scope (not a multiplexed single file — avoids the partial-MAC paradox), per-scope isolated keys, the non-negotiables (hard-fail on missing required secret, MAC+AAD, CODEOWNERS-gated recipients, plaintext lint, rotate-not-just-remove, OIDC->KMS provider keys), the strict-backcompat constraint (separate files + fail-closed version reader first), and a minimal v1. Resolves the format/native-vs-SOPS open questions. Signed-off-by: Dmitrii Creed --- docs/design/keyless-secrets/README.md | 59 +++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/docs/design/keyless-secrets/README.md b/docs/design/keyless-secrets/README.md index 49bc9f60..3279d65e 100644 --- a/docs/design/keyless-secrets/README.md +++ b/docs/design/keyless-secrets/README.md @@ -21,6 +21,63 @@ default; migration is additive, per-repo, and reversible). The crypto primitives are not reinvented; the change is the **key-custody model** plus a **v2 envelope format** required to support per-recipient key wrapping cleanly. +## Decision (2026-06-27): inline values + per-scope keys, via SOPS, file-per-scope + +After a multi-model design review, the concrete shape for the convenience-and-isolation +feature ("encrypt individual values inline, decryptable per scope, like ansible-vault") +is settled: + +- **Adopt SOPS (`getsops/sops`) as the inline-value crypto + format layer** — do **not** + hand-roll inline encryption, a file MAC, and per-backend KMS clients. SOPS already + provides inline value encryption (structure stays readable, only leaves opaque), a + whole-file MAC, partial decryption, and age / AWS-KMS / GCP-KMS / Azure-KV / Vault + recipients. (A just-fixed in-house ed25519 scheme that had zero confidentiality is the + decisive reason not to grow more bespoke crypto here.) +- **One readable file per scope** (`.sc/secrets..yaml`), **not** one multiplexed + file holding every scope. A single multi-scope file forces a "partial-MAC paradox" — a + holder of only the preview key cannot recompute a whole-file MAC over prod values it + cannot read — and would require hand-rolled per-value crypto. File-per-scope delivers + per-scope isolation with SOPS's standard whole-file MAC, in weeks rather than months. +- **A key decrypts only the scope file(s) it is a recipient of.** This replaces the + single all-powerful master key with per-scope, isolated, revocable keys (an age key, an + OIDC→KMS grant, or an isolated secret), so a preview deploy gets only the preview key + and cannot read prod. +- SC owns the thin layer SOPS does not: the scope→recipient config, deploy-time decrypt + with **hard-fail on a missing required secret**, a plaintext-leak lint, and provider + key delivery. + +A multiplexed single-file format and native (non-SOPS) crypto were considered and +rejected (months of work + crypto risk + the partial-MAC paradox). + +### Non-negotiables (must hold before merge) + +1. **Hard-fail on a missing required secret** at deploy — never substitute empty / + placeholder / ciphertext, never deploy partially. +2. **Mandatory MAC + AAD** binding path/scope/version; encrypt-then-MAC; always-random nonce. +3. **Scope→recipient mapping is CODEOWNERS-gated**, out of PR control — a MAC does not + stop re-encryption to an attacker-added recipient. +4. **Plaintext-leak lint** (`encrypted_regex` + CI gate), shipped *with* the feature. +5. **Rotation ≠ removing a recipient**: git history retains old ciphertext, so removing a + recipient requires rotating the underlying secret values. +6. **Provider keys via OIDC→KMS unwrap**, not handing a private key to CI (otherwise it is + just a smaller master key); a provider failure is a hard-fail; the OIDC trust policy is + the real perimeter (IaC + CODEOWNERS). + +### Strict backward compatibility (maintainer requirement) + +The existing whole-file `.sc/secrets.yaml` keeps working unchanged; the feature is +**additive and lives in separate files** old binaries never open. A **fail-closed version +reader must ship and bake fleet-wide first** — old binaries silently drop unknown YAML +fields on rewrite, so the new format must never live in `secrets.yaml`. A given path is +either whole-file (mode A) or inline (mode B), never both. + +### Minimal v1 + +SOPS, **file-per-scope**, age recipients, `sc secrets set` / `edit` + transparent decrypt +at `sc deploy` + the plaintext lint + hard-fail-on-missing. **No KMS/OIDC and no +multi-scope multiplex file in v1** — those are v2 (providers + governance) and a possible +later multiplex format only if practice demands it. + ## Motivation The SC envelope is already **multi-recipient** (`AddPublicKey`/`RemovePublicKey`, @@ -227,6 +284,8 @@ itself; supply-chain compromise of the CLI or CI actions (mitigated by signed re ## Open questions +- **Resolved:** multiplex-single-file vs file-per-scope → **file-per-scope** (see Decision); + native crypto vs SOPS → **SOPS** for the inline-value layer. - Per-file-set vs per-environment DEK granularity (audit/scoping vs rewrap cost). - How aggressively to enforce a minimum recipient-strength / forbidden-type policy. - Whether `require-kms` is per-store, per-environment, or both.