docs: add external secrets providers design RFC#2936
Conversation
|
|
||
| ## Open questions | ||
|
|
||
| - Reference syntax: `${{ secrets.x.y }}` vs `arcane+secret://x/y` — pick one and document. |
There was a problem hiding this comment.
Reference syntax is already committed but still listed as open
The "Open questions" section asks to pick between ${{ secrets.x.y }} and arcane+secret://x/y, but the entire document body — Goals, Reference syntax section, example .env snippets, injection seam steps, and the secret-refs API endpoint — has already committed to the ${{ secrets.<provider>.<KEY> }} form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using ${{ }}" to avoid ambiguity in follow-up PRs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 240
Comment:
**Reference syntax is already committed but still listed as open**
The "Open questions" section asks to pick between `${{ secrets.x.y }}` and `arcane+secret://x/y`, but the entire document body — Goals, Reference syntax section, example `.env` snippets, injection seam steps, and the `secret-refs` API endpoint — has already committed to the `${{ secrets.<provider>.<KEY> }}` form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using `${{ }}`" to avoid ambiguity in follow-up PRs.
How can I resolve this? If you propose a fix, please make it concise.| Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL | ||
| configurable). On a provider/network failure: | ||
|
|
||
| - `fail_stale` (default for non-critical): serve cached values, mark deploy as using | ||
| stale secrets, surface staleness in UI/events. | ||
| - `fail_closed`: abort the deploy with a clear error. |
There was a problem hiding this comment.
Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict
The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be in-memory only and process-scoped, and it should acknowledge the consequence: after a process restart with an unreachable provider, fail_stale behaves identically to fail_closed because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 117-122
Comment:
**Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict**
The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be **in-memory only and process-scoped**, and it should acknowledge the consequence: after a process restart with an unreachable provider, `fail_stale` behaves identically to `fail_closed` because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.
How can I resolve this? If you propose a fix, please make it concise.| Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL | ||
| configurable). On a provider/network failure: | ||
|
|
||
| - `fail_stale` (default for non-critical): serve cached values, mark deploy as using | ||
| stale secrets, surface staleness in UI/events. | ||
| - `fail_closed`: abort the deploy with a clear error. |
There was a problem hiding this comment.
fail_stale default has no defined maximum staleness, allowing indefinitely-old secrets
fail_stale is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ fail_closed behaviour) or as valid until replaced.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 117-122
Comment:
**`fail_stale` default has no defined maximum staleness, allowing indefinitely-old secrets**
`fail_stale` is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ `fail_closed` behaviour) or as valid until replaced.
How can I resolve this? If you propose a fix, please make it concise.Adds a draft design document proposing pluggable external secrets providers that resolve a Compose project environment from a secrets manager at deploy time, with provider credentials encrypted at rest and resolved values kept in memory only. Documents the provider interface, data model, security requirements, the API, CLI and frontend surfaces, the bitwarden and infisical backends, and a six-step incremental delivery plan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3a0ffc7 to
f921599
Compare
Promote vault from 'later' to a planned implementation alongside bitwarden and infisical. Adds the vault provider section (KV v1/v2, token auth, sys/health HealthCheck, scope_path mapping), vault config fields in the data model, and inserts it as slice 4 in the PR delivery plan (API/Frontend/CLI shift to 5/6/7).
Checklist
What This PR Implements
This PR adds a design document (RFC) only, with no code changes. It proposes pluggable external secrets providers so a Compose project environment can be resolved at deploy time from a secrets manager (Bitwarden Secrets Manager and Infisical first, Vault, OpenBao and SOPS later) instead of from plaintext on disk.
The document defines a single provider interface with multiple backends selected by config, a provider-agnostic reference syntax embedded in env values, an injection seam in the environment loader, encryption of provider credentials at rest using the existing AES-GCM crypto helper, and a hard requirement that resolved values stay in memory and never reach disk or logs. It also lays out a six-step incremental delivery plan so the feature can land in small reviewable slices.
Opening this as an RFC first to gather maintainer feedback on the interface, the reference syntax choice, and the security model before any implementation PRs.
Fixes:
Changes Made
Added docs/design/external-secrets.md describing the architecture, data model, security requirements, API surface, the bitwarden and infisical providers, and the incremental delivery plan. No source code is touched.
Testing Done
Documentation only, so no runtime testing applies. Verified the markdown renders correctly and that internal references are consistent.
AI Tool Used (if applicable)
Claude Code (Claude Opus 4.8) was used to study existing codebase patterns and to draft this design document. The author reviewed and edited the content before submission. Per AI_POLICY.md, any follow-up implementation PRs will be verified in the development environment with human testing before they are opened.
Additional Context
This is the first item in the delivery plan and intentionally contains no code. Subsequent PRs will implement the interface, the data model and migrations, the resolver seam, and then individual providers, each as its own focused PR.
Disclaimer Greptiles Reviews use AI, make sure to check over its work.
To better help train Greptile on our codebase, if the comment is useful and valid Like the comment, if its not helpful or invalid Dislike
To have Greptile Re-Review the changes, mention
greptileai.Greptile Summary
This PR adds a documentation-only RFC (
docs/design/external-secrets.md) proposing pluggable external secrets providers for Arcane, enabling Docker Compose project environments to be resolved at deploy time from Bitwarden Secrets Manager or Infisical instead of from plaintext on disk.Providerinterface withResolve/HealthCheckmethods, a${{ secrets.<provider>.<KEY> }}reference syntax embedded in env values, and a resolver seam injected into the existingLoadEnvironmentpath after the.envmerge and before compose-go receives the map.fail_stale/fail_closedmodes per binding.Confidence Score: 3/5
Safe to merge as documentation, but two design gaps should be closed before the first implementation PR opens to avoid a security-violating cache implementation or implementer confusion over the reference syntax.
The reference syntax is used consistently throughout the document body but simultaneously listed as an unresolved open question. More critically, the last-known-good cache is described as holding plaintext resolved values with no explicit in-memory-only constraint, which directly conflicts with the hard no-plaintext-on-disk security requirement; a GORM-backed implementation would silently violate it.
docs/design/external-secrets.md — the Resilience section (cache storage medium) and the Open questions section (reference syntax decision) need updates before implementation begins.
Security Review
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "docs: add external secrets providers des..." | Re-trigger Greptile