Skip to content

docs: add external secrets providers design RFC#2936

Open
manawenuz wants to merge 2 commits into
getarcaneapp:mainfrom
manawenuz:docs/external-secrets-rfc
Open

docs: add external secrets providers design RFC#2936
manawenuz wants to merge 2 commits into
getarcaneapp:mainfrom
manawenuz:docs/external-secrets-rfc

Conversation

@manawenuz

@manawenuz manawenuz commented Jun 14, 2026

Copy link
Copy Markdown

Checklist

  • This PR is not opened from my fork's main branch

What This PR Implements

This PR adds a design document (RFC) only, with no code changes. It proposes pluggable external secrets providers so a Compose project environment can be resolved at deploy time from a secrets manager (Bitwarden Secrets Manager and Infisical first, Vault, OpenBao and SOPS later) instead of from plaintext on disk.

The document defines a single provider interface with multiple backends selected by config, a provider-agnostic reference syntax embedded in env values, an injection seam in the environment loader, encryption of provider credentials at rest using the existing AES-GCM crypto helper, and a hard requirement that resolved values stay in memory and never reach disk or logs. It also lays out a six-step incremental delivery plan so the feature can land in small reviewable slices.

Opening this as an RFC first to gather maintainer feedback on the interface, the reference syntax choice, and the security model before any implementation PRs.

Fixes:

Changes Made

Added docs/design/external-secrets.md describing the architecture, data model, security requirements, API surface, the bitwarden and infisical providers, and the incremental delivery plan. No source code is touched.

Testing Done

Documentation only, so no runtime testing applies. Verified the markdown renders correctly and that internal references are consistent.

AI Tool Used (if applicable)

Claude Code (Claude Opus 4.8) was used to study existing codebase patterns and to draft this design document. The author reviewed and edited the content before submission. Per AI_POLICY.md, any follow-up implementation PRs will be verified in the development environment with human testing before they are opened.

Additional Context

This is the first item in the delivery plan and intentionally contains no code. Subsequent PRs will implement the interface, the data model and migrations, the resolver seam, and then individual providers, each as its own focused PR.

Disclaimer Greptiles Reviews use AI, make sure to check over its work.

To better help train Greptile on our codebase, if the comment is useful and valid Like the comment, if its not helpful or invalid Dislike

To have Greptile Re-Review the changes, mention greptileai.

Greptile Summary

This PR adds a documentation-only RFC (docs/design/external-secrets.md) proposing pluggable external secrets providers for Arcane, enabling Docker Compose project environments to be resolved at deploy time from Bitwarden Secrets Manager or Infisical instead of from plaintext on disk.

  • Architecture: A single Provider interface with Resolve/HealthCheck methods, a ${{ secrets.<provider>.<KEY> }} reference syntax embedded in env values, and a resolver seam injected into the existing LoadEnvironment path after the .env merge and before compose-go receives the map.
  • Security model: Provider credentials are AES-GCM encrypted at rest; resolved values are held in memory only and must never reach disk or logs; a last-known-good cache provides resilience with fail_stale / fail_closed modes per binding.
  • Delivery plan: Six incremental PRs \u2014 interface + DB models first, then each provider, then API, frontend, and CLI.

Confidence Score: 3/5

Safe to merge as documentation, but two design gaps should be closed before the first implementation PR opens to avoid a security-violating cache implementation or implementer confusion over the reference syntax.

The reference syntax is used consistently throughout the document body but simultaneously listed as an unresolved open question. More critically, the last-known-good cache is described as holding plaintext resolved values with no explicit in-memory-only constraint, which directly conflicts with the hard no-plaintext-on-disk security requirement; a GORM-backed implementation would silently violate it.

docs/design/external-secrets.md — the Resilience section (cache storage medium) and the Open questions section (reference syntax decision) need updates before implementation begins.

Security Review

  • Cache storage gap (plaintext-at-rest risk): The resilience section specifies a last-known-good cache for resolved secret values without constraining it to in-memory storage. Security requirement chore(deps): bump dockerode from 4.0.5 to 4.0.6 #1 prohibits resolved values from being written to disk; an implementer choosing a GORM-backed persistent cache would violate this silently. The document must explicitly require the cache to be in-memory only.
  • No hardcoded secrets, injection risks, or auth/authz bypasses were identified in this documentation-only PR.

Fix All in Codex Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
docs/design/external-secrets.md:240
**Reference syntax is already committed but still listed as open**

The "Open questions" section asks to pick between `${{ secrets.x.y }}` and `arcane+secret://x/y`, but the entire document body — Goals, Reference syntax section, example `.env` snippets, injection seam steps, and the `secret-refs` API endpoint — has already committed to the `${{ secrets.<provider>.<KEY> }}` form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using `${{ }}`" to avoid ambiguity in follow-up PRs.

### Issue 2 of 3
docs/design/external-secrets.md:117-122
**Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict**

The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be **in-memory only and process-scoped**, and it should acknowledge the consequence: after a process restart with an unreachable provider, `fail_stale` behaves identically to `fail_closed` because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.

### Issue 3 of 3
docs/design/external-secrets.md:117-122
**`fail_stale` default has no defined maximum staleness, allowing indefinitely-old secrets**

`fail_stale` is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ `fail_closed` behaviour) or as valid until replaced.

Reviews (1): Last reviewed commit: "docs: add external secrets providers des..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

@manawenuz manawenuz requested a review from a team June 14, 2026 10:25

## Open questions

- Reference syntax: `${{ secrets.x.y }}` vs `arcane+secret://x/y` — pick one and document.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Reference syntax is already committed but still listed as open

The "Open questions" section asks to pick between ${{ secrets.x.y }} and arcane+secret://x/y, but the entire document body — Goals, Reference syntax section, example .env snippets, injection seam steps, and the secret-refs API endpoint — has already committed to the ${{ secrets.<provider>.<KEY> }} form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using ${{ }}" to avoid ambiguity in follow-up PRs.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 240

Comment:
**Reference syntax is already committed but still listed as open**

The "Open questions" section asks to pick between `${{ secrets.x.y }}` and `arcane+secret://x/y`, but the entire document body — Goals, Reference syntax section, example `.env` snippets, injection seam steps, and the `secret-refs` API endpoint — has already committed to the `${{ secrets.<provider>.<KEY> }}` form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using `${{ }}`" to avoid ambiguity in follow-up PRs.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex Fix in Claude Code

Comment on lines +117 to +122
Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL
configurable). On a provider/network failure:

- `fail_stale` (default for non-critical): serve cached values, mark deploy as using
stale secrets, surface staleness in UI/events.
- `fail_closed`: abort the deploy with a clear error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict

The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be in-memory only and process-scoped, and it should acknowledge the consequence: after a process restart with an unreachable provider, fail_stale behaves identically to fail_closed because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 117-122

Comment:
**Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict**

The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be **in-memory only and process-scoped**, and it should acknowledge the consequence: after a process restart with an unreachable provider, `fail_stale` behaves identically to `fail_closed` because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex Fix in Claude Code

Comment on lines +117 to +122
Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL
configurable). On a provider/network failure:

- `fail_stale` (default for non-critical): serve cached values, mark deploy as using
stale secrets, surface staleness in UI/events.
- `fail_closed`: abort the deploy with a clear error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 fail_stale default has no defined maximum staleness, allowing indefinitely-old secrets

fail_stale is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ fail_closed behaviour) or as valid until replaced.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/design/external-secrets.md
Line: 117-122

Comment:
**`fail_stale` default has no defined maximum staleness, allowing indefinitely-old secrets**

`fail_stale` is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ `fail_closed` behaviour) or as valid until replaced.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex Fix in Claude Code

Adds a draft design document proposing pluggable external secrets
providers that resolve a Compose project environment from a secrets
manager at deploy time, with provider credentials encrypted at rest
and resolved values kept in memory only. Documents the provider
interface, data model, security requirements, the API, CLI and
frontend surfaces, the bitwarden and infisical backends, and a
six-step incremental delivery plan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@manawenuz manawenuz force-pushed the docs/external-secrets-rfc branch from 3a0ffc7 to f921599 Compare June 14, 2026 10:32
Promote vault from 'later' to a planned implementation alongside bitwarden
and infisical. Adds the vault provider section (KV v1/v2, token auth, sys/health
HealthCheck, scope_path mapping), vault config fields in the data model, and
inserts it as slice 4 in the PR delivery plan (API/Frontend/CLI shift to 5/6/7).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant