Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
294 changes: 294 additions & 0 deletions docs/design/keyless-secrets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
# RFC: Keyless secret backend (KMS-wrapped recipients + OIDC federation)

**Status:** Draft / request for comments
**Area:** `pkg/api/secrets` (encryption envelope, recipients, CLI `secrets` commands)
**Relates to:** [`docs/SECRETS-POLICY.md`](../../SECRETS-POLICY.md), [`docs/SECURITY.md`](../../SECURITY.md)

## Summary

Today, decrypting an SC secret store in CI requires materializing the store's
**private key** (`SIMPLE_CONTAINER_CONFIG`) in plaintext on the runner — typically
as a CI secret. That value is long-lived, all-powerful (decrypts the entire store),
not rotated in practice, not audited, and not revocable after a leak.

This RFC proposes adding **pluggable recipient types whose private key never leaves a
remote authority** (a cloud KMS or Vault), and a **keyless CI mode** where the runner
proves its identity via OIDC federation and obtains a short-lived `Decrypt` permission
instead of holding any long-lived key. It is **cloud-agnostic** (AWS/GCP/Azure/Vault,
or none) and **backward-compatible** (the existing local-key recipient stays the
default; migration is additive, per-repo, and reversible).

The crypto primitives are not reinvented; the change is the **key-custody model** plus
a **v2 envelope format** required to support per-recipient key wrapping cleanly.

## Decision (2026-06-27): inline values + per-scope keys, via SOPS, file-per-scope

After a multi-model design review, the concrete shape for the convenience-and-isolation
feature ("encrypt individual values inline, decryptable per scope, like ansible-vault")
is settled:

- **Adopt SOPS (`getsops/sops`) as the inline-value crypto + format layer** — do **not**
hand-roll inline encryption, a file MAC, and per-backend KMS clients. SOPS already
provides inline value encryption (structure stays readable, only leaves opaque), a
whole-file MAC, partial decryption, and age / AWS-KMS / GCP-KMS / Azure-KV / Vault
recipients. (A just-fixed in-house ed25519 scheme that had zero confidentiality is the
decisive reason not to grow more bespoke crypto here.)
- **One readable file per scope** (`.sc/secrets.<scope>.yaml`), **not** one multiplexed
file holding every scope. A single multi-scope file forces a "partial-MAC paradox" — a
holder of only the preview key cannot recompute a whole-file MAC over prod values it
cannot read — and would require hand-rolled per-value crypto. File-per-scope delivers
per-scope isolation with SOPS's standard whole-file MAC, in weeks rather than months.
- **A key decrypts only the scope file(s) it is a recipient of.** This replaces the
single all-powerful master key with per-scope, isolated, revocable keys (an age key, an
OIDC→KMS grant, or an isolated secret), so a preview deploy gets only the preview key
and cannot read prod.
- SC owns the thin layer SOPS does not: the scope→recipient config, deploy-time decrypt
with **hard-fail on a missing required secret**, a plaintext-leak lint, and provider
key delivery.

A multiplexed single-file format and native (non-SOPS) crypto were considered and
rejected (months of work + crypto risk + the partial-MAC paradox).

### Non-negotiables (must hold before merge)

1. **Hard-fail on a missing required secret** at deploy — never substitute empty /
placeholder / ciphertext, never deploy partially.
2. **Mandatory MAC + AAD** binding path/scope/version; encrypt-then-MAC; always-random nonce.
3. **Scope→recipient mapping is CODEOWNERS-gated**, out of PR control — a MAC does not
stop re-encryption to an attacker-added recipient.
4. **Plaintext-leak lint** (`encrypted_regex` + CI gate), shipped *with* the feature.
5. **Rotation ≠ removing a recipient**: git history retains old ciphertext, so removing a
recipient requires rotating the underlying secret values.
6. **Provider keys via OIDC→KMS unwrap**, not handing a private key to CI (otherwise it is
just a smaller master key); a provider failure is a hard-fail; the OIDC trust policy is
the real perimeter (IaC + CODEOWNERS).

### Strict backward compatibility (maintainer requirement)

The existing whole-file `.sc/secrets.yaml` keeps working unchanged; the feature is
**additive and lives in separate files** old binaries never open. A **fail-closed version
reader must ship and bake fleet-wide first** — old binaries silently drop unknown YAML
fields on rewrite, so the new format must never live in `secrets.yaml`. A given path is
either whole-file (mode A) or inline (mode B), never both.

### Minimal v1

SOPS, **file-per-scope**, age recipients, `sc secrets set` / `edit` + transparent decrypt
at `sc deploy` + the plaintext lint + hard-fail-on-missing. **No KMS/OIDC and no
multi-scope multiplex file in v1** — those are v2 (providers + governance) and a possible
later multiplex format only if practice demands it.

## Motivation

The SC envelope is already **multi-recipient** (`AddPublicKey`/`RemovePublicKey`,
CLI `secrets allow`/`disallow`). The limitation is that there is exactly **one
recipient type**: a raw asymmetric private key. Because decryption requires possessing
that private key, the key must be placed on every machine that decrypts — including
CI. There is no way to delegate the unwrap to an external authority.

Consequences of the single long-lived key in CI:

- **Exfiltration surface.** Any code that runs in the job (including a malicious
transitive dependency on an untrusted PR build) can read the key and decrypt the
whole store.
- **No rotation in practice.** Rotating means re-keying and redistributing the key to
every consumer.
- **No audit.** Use of a local private key is not logged anywhere.
- **No revocation.** A leaked private key decrypts forever.

## Goals / non-goals

**Goals**
- Allow a store to be decrypted in CI with **no long-lived secret present** on the runner.
- Keep it **cloud-agnostic** and keep the **simple local-key path as the default**.
- Make access **scoped, audited, and revocable**.
- Provide an **additive, reversible, per-repo** migration.

**Non-goals**
- Reinventing cryptographic primitives.
- Replacing the on-disk format with SOPS/age (we keep SC's format + `allow`/`disallow` UX).
- Putting an external SaaS vault in the deploy hot path by default.
- Moving application runtime secrets out of the store (separate concern).

## Design

### 1. v2 envelope (prerequisite)

The current format encrypts each file independently for each recipient. To wrap keys
per recipient cleanly — and to keep KMS calls O(1) per deploy rather than O(files) —
introduce a **versioned v2 envelope**:

- A single random **data-encryption key (DEK)** encrypts the payload once with an
AEAD (ChaCha20-Poly1305), per file-set (or per environment).
- The DEK is **wrapped once per recipient**.
- The AEAD includes **associated data (AAD)** binding the ciphertext to
`{format-version, path, environment, recipient-id}` so a wrapped DEK or ciphertext
cannot be transplanted into another context.
- Recipient wrapping standardizes on a **vetted public-key scheme** (an X25519-based
KEM for asymmetric recipients) and **KMS Encrypt/Decrypt** for KMS recipients. Key
material and algorithm identifiers are carried in **authenticated** headers.

`secrets.yaml` gains an explicit `version` and a typed `recipients[]` list.

### 2. Typed recipients + `KeyProvider`

```go
type KeyProvider interface {
Wrap(ctx context.Context, dataKey []byte) (wrapped []byte, err error)
Unwrap(ctx context.Context, wrapped []byte) (dataKey []byte, err error)
Recipients() []RecipientRef // metadata only, no network, no decrypt
}
```

Recipient types:

| Type | Private key location | CI auth |
|---|---|---|
| `local` (default) | a local private key (today's `SIMPLE_CONTAINER_CONFIG`) | the key itself |
| `aws-kms://<key-arn>` | AWS KMS | OIDC → STS → `kms:Decrypt` |
| `gcp-kms://<resource>` | GCP KMS | OIDC → Workload Identity Federation → `decrypt` |
| `azure-kv://<key-id>` | Azure Key Vault | OIDC → federated credential |
| `vault://transit/<key>` | HashiCorp Vault | OIDC/JWT auth → transit decrypt |

`DecryptAll` selects the recipient by **authenticated recipient-id** (not "first that
succeeds") and unwraps via the matching provider. Any one recipient suffices; never all.

KMS recipients reference **immutable key identifiers** (not mutable aliases) and bind a
KMS **encryption context** matching the AAD above.

### 3. Keyless CI via OIDC

When a KMS recipient is configured and the platform exposes an OIDC token, SC exchanges
the token for short-lived credentials scoped to `Decrypt` on the recipient key. On
GitHub Actions the runner only needs `id-token: write`; no long-lived secret is stored.
Every unwrap is recorded in the cloud audit log under the federated identity.

To avoid handing the OIDC mint capability to arbitrary job steps, credential
acquisition should happen in **one isolated step** that passes only the resulting
short-lived credentials onward.

### 4. Federated deploy credentials

Independently of store decryption, the cloud credentials used by `sc deploy` can be
sourced from the same OIDC federation (`auth: { provider: oidc }`) instead of static
keys stored inside the secret store. This requires **first-class auth provider types**
that consume ambient federated credentials rather than deserializing static key
material. Static keys remain supported as a fallback.

### 5. Environment-scoped recipients

A recipient may be granted to a single environment. Combined with per-environment DEKs,
an untrusted/preview context can be limited to decrypting only its own environment and
never production. This is enforced by encrypting each environment under a distinct DEK
and never wrapping production's DEK to a preview-reachable recipient.

### 6. Decouple repository checkout from the encryption key

Where the store private key is currently reused as an SSH key to clone a private
parent/stacks repository, keyless mode needs a separate mechanism (e.g. a narrowly
scoped GitHub App installation token, or an existing deploy key). This is **orthogonal**
to encryption and is a **prerequisite** before the local key can be removed.

## Authorization model (OIDC trust)

The trust policy — not the runner — is the control. Hard-won requirements:

- **Per-stack / per-environment roles**, not one shared role. A single role shared by
many consumers, trusted by `job_workflow_ref` alone, lets *any* repo that calls the
shared workflow assume it.
- **Pin the concrete caller**: immutable repository id **and** `job_workflow_ref` (to a
pinned ref), plus `aud`. **No wildcards** in the subject.
- **`ref` and `environment` subjects are mutually exclusive.** A job using a GitHub
Environment emits `repo:ORG/REPO:environment:NAME` with **no** `ref` component; a job
without one emits `...:ref:...`. You cannot pin both in one subject — choose per role.
- **Production = protected Environment with required reviewers.** OIDC trust alone is
not sufficient to gate production.
- **Never `pull_request_target` with PR-head checkout** in any workflow that can mint an
id-token or reach a recipient.
- **Attribution:** set a session name carrying repo + run id; enable cloud data-access
logging where it is off by default (e.g. GCP KMS), so a shared role does not erase
per-run attribution.

## Backward compatibility & format versioning

- The `local` recipient + existing config remain the **default**; nothing changes for
current users until they opt in.
- **Forward-compat guard (ship first):** older clients ignore unknown YAML fields and
rewrite only the fields they know, which would **silently drop** new recipients on the
next `allow`/`disallow`. A version-aware client that **fails closed** on an unknown
`version` must be released and rolled out **before** any store is written in v2.
- Migration uses existing verbs: `secrets allow --kms <uri>` (add a KMS recipient),
verify keyless decrypt, then `secrets disallow <local-key>`. Multi-recipient means
both work during overlap; per-repo; reversible.
- A `require-kms` (or equivalent) per-store setting disables the legacy path once a store
is fully migrated, to prevent silent downgrade back to the local key.
- Cloud-agnostic: customers on no cloud keep the `local` recipient permanently — it is
not a deprecation target.

## Revocation reality

Removing a recipient or rotating a KMS key does **not** revoke access to ciphertext
already committed to version control — history retains blobs decryptable by previously
valid recipients. Therefore:

- **Removing a recipient MUST be paired with rotating the underlying secret values**
(and the upstream credential they represent), not just dropping a wrapper.
- The recipient list is effectively an **append-only audit surface**; changes to it
should be review-gated (code owners).

## Operational considerations

- **Availability:** decryption now depends on cloud IAM/KMS. Decrypt **all** required
material **up front**, before any infrastructure mutation, so a mid-deploy KMS error
cannot leave a half-applied stack.
- **Throughput/cost:** one DEK per file-set ⇒ one `Decrypt` per deploy, not one per
file. Use bounded retries with backoff + jitter; cache the unwrapped DEK in-process
for the run only.
- **Locality:** pin the KMS region to the deploy region; document cross-account key
policies.
- **Dependencies:** the cloud KMS SDKs become direct dependencies (today they are
transitive), expanding the SCA/dependency-update surface.

## Threat model (summary)

**Mitigates:** leak of a long-lived master key from CI; secret theft by untrusted
PR-build code (via scoped trust + environment-scoped recipients); lateral movement
after a single job compromise (scope + short TTL); long-lived static cloud credentials;
absence of audit; departed-operator access on the CI path; replay of a stolen token in
another repo (pinned `aud`/subject).

**Does not fully mitigate (residual):** malicious code inside a *legitimate*
production job (gets short-lived scoped credentials during the run — bounded by scope,
TTL, branch protection, audit, and, for the highest-value targets, isolated runners);
application secrets being decrypted into the workload at deploy time; exfiltration over
otherwise-allowed network egress; compromise of the cloud account / IAM / KMS policy
itself; supply-chain compromise of the CLI or CI actions (mitigated by signed releases
+ digest pinning); phishing of operator cloud credentials.

## Migration phases

0. **Design + review** (this document).
1. **v2 envelope + version-aware fail-closed client**, released and rolled out before
any v2 write. Modernize the asymmetric recipient scheme as part of v2.
2. **`KeyProvider` + first KMS provider + OIDC acquisition**, behind a feature flag.
3. **Canary** on one low-risk staging stack. Gate: `sc deploy` obtains cloud
credentials from the federated environment, not the store; measured KMS latency;
verified rollback runbook.
4. **Migration UX** (`allow --kms`, `secrets doctor`/recipient listing, `require-kms`),
environment-scoped recipients, repo-checkout decoupling, docs.
5. **Federated deploy credentials** (`auth: { provider: oidc }`); drop static keys from
the store.
6. **Staging bake**, then **production** behind protected Environments, stack by stack.
7. **Decommission**: remove the local recipient, **rotate secret values**, delete static
credentials after a zero-usage soak. Add other cloud providers (GCP/Azure/Vault).

## Open questions

- **Resolved:** multiplex-single-file vs file-per-scope → **file-per-scope** (see Decision);
native crypto vs SOPS → **SOPS** for the inline-value layer.
- Per-file-set vs per-environment DEK granularity (audit/scoping vs rewrap cost).
- How aggressively to enforce a minimum recipient-strength / forbidden-type policy.
- Whether `require-kms` is per-store, per-environment, or both.
- Provider parity: each KMS/Vault has distinct federation, audit, and key-versioning
semantics; ship one provider first behind an explicit contract, then add others with
per-provider review.
Loading