docs: add external secrets providers design RFC by manawenuz · Pull Request #2936 · getarcaneapp/arcane

manawenuz · 2026-06-14T10:25:59Z

Checklist

This PR is not opened from my fork's main branch

What This PR Implements

This PR adds a design document (RFC) only, with no code changes. It proposes pluggable external secrets providers so a Compose project environment can be resolved at deploy time from a secrets manager (Bitwarden Secrets Manager and Infisical first, Vault, OpenBao and SOPS later) instead of from plaintext on disk.

The document defines a single provider interface with multiple backends selected by config, a provider-agnostic reference syntax embedded in env values, an injection seam in the environment loader, encryption of provider credentials at rest using the existing AES-GCM crypto helper, and a hard requirement that resolved values stay in memory and never reach disk or logs. It also lays out a six-step incremental delivery plan so the feature can land in small reviewable slices.

Opening this as an RFC first to gather maintainer feedback on the interface, the reference syntax choice, and the security model before any implementation PRs.

Fixes:

Changes Made

Added docs/design/external-secrets.md describing the architecture, data model, security requirements, API surface, the bitwarden and infisical providers, and the incremental delivery plan. No source code is touched.

Testing Done

Documentation only, so no runtime testing applies. Verified the markdown renders correctly and that internal references are consistent.

AI Tool Used (if applicable)

Claude Code (Claude Opus 4.8) was used to study existing codebase patterns and to draft this design document. The author reviewed and edited the content before submission. Per AI_POLICY.md, any follow-up implementation PRs will be verified in the development environment with human testing before they are opened.

Additional Context

This is the first item in the delivery plan and intentionally contains no code. Subsequent PRs will implement the interface, the data model and migrations, the resolver seam, and then individual providers, each as its own focused PR.

Disclaimer Greptiles Reviews use AI, make sure to check over its work.

To better help train Greptile on our codebase, if the comment is useful and valid Like the comment, if its not helpful or invalid Dislike

To have Greptile Re-Review the changes, mention greptileai.

Greptile Summary

This PR adds a documentation-only RFC (docs/design/external-secrets.md) proposing pluggable external secrets providers for Arcane, enabling Docker Compose project environments to be resolved at deploy time from Bitwarden Secrets Manager or Infisical instead of from plaintext on disk.

Architecture: A single Provider interface with Resolve/HealthCheck methods, a ${{ secrets.<provider>.<KEY> }} reference syntax embedded in env values, and a resolver seam injected into the existing LoadEnvironment path after the .env merge and before compose-go receives the map.
Security model: Provider credentials are AES-GCM encrypted at rest; resolved values are held in memory only and must never reach disk or logs; a last-known-good cache provides resilience with fail_stale / fail_closed modes per binding.
Delivery plan: Six incremental PRs \u2014 interface + DB models first, then each provider, then API, frontend, and CLI.

Confidence Score: 3/5

Safe to merge as documentation, but two design gaps should be closed before the first implementation PR opens to avoid a security-violating cache implementation or implementer confusion over the reference syntax.

The reference syntax is used consistently throughout the document body but simultaneously listed as an unresolved open question. More critically, the last-known-good cache is described as holding plaintext resolved values with no explicit in-memory-only constraint, which directly conflicts with the hard no-plaintext-on-disk security requirement; a GORM-backed implementation would silently violate it.

docs/design/external-secrets.md — the Resilience section (cache storage medium) and the Open questions section (reference syntax decision) need updates before implementation begins.

Security Review

Cache storage gap (plaintext-at-rest risk): The resilience section specifies a last-known-good cache for resolved secret values without constraining it to in-memory storage. Security requirement chore(deps): bump dockerode from 4.0.5 to 4.0.6 #1 prohibits resolved values from being written to disk; an implementer choosing a GORM-backed persistent cache would violate this silently. The document must explicitly require the cache to be in-memory only.
No hardcoded secrets, injection risks, or auth/authz bypasses were identified in this documentation-only PR.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
docs/design/external-secrets.md:240
**Reference syntax is already committed but still listed as open**

The "Open questions" section asks to pick between `${{ secrets.x.y }}` and `arcane+secret://x/y`, but the entire document body — Goals, Reference syntax section, example `.env` snippets, injection seam steps, and the `secret-refs` API endpoint — has already committed to the `${{ secrets.<provider>.<KEY> }}` form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using `${{ }}`" to avoid ambiguity in follow-up PRs.

### Issue 2 of 3
docs/design/external-secrets.md:117-122
**Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict**

The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be **in-memory only and process-scoped**, and it should acknowledge the consequence: after a process restart with an unreachable provider, `fail_stale` behaves identically to `fail_closed` because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.

### Issue 3 of 3
docs/design/external-secrets.md:117-122
**`fail_stale` default has no defined maximum staleness, allowing indefinitely-old secrets**

`fail_stale` is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ `fail_closed` behaviour) or as valid until replaced.

_{Reviews (1): Last reviewed commit: "docs: add external secrets providers des..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

greptile-apps · 2026-06-14T10:28:48Z

+
+## Open questions
+
+- Reference syntax: `${{ secrets.x.y }}` vs `arcane+secret://x/y` — pick one and document.


Reference syntax is already committed but still listed as open

The "Open questions" section asks to pick between ${{ secrets.x.y }} and arcane+secret://x/y, but the entire document body — Goals, Reference syntax section, example .env snippets, injection seam steps, and the secret-refs API endpoint — has already committed to the ${{ secrets.<provider>.<KEY> }} form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using ${{ }}" to avoid ambiguity in follow-up PRs.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/design/external-secrets.md Line: 240 Comment: **Reference syntax is already committed but still listed as open** The "Open questions" section asks to pick between `${{ secrets.x.y }}` and `arcane+secret://x/y`, but the entire document body — Goals, Reference syntax section, example `.env` snippets, injection seam steps, and the `secret-refs` API endpoint — has already committed to the `${{ secrets.<provider>.<KEY> }}` form. An implementer starting PR #2 (Bitwarden provider) or PR #1 (resolver seam) will find conflicting signals and may not know whether this is genuinely open or accidentally left in. This open question should be struck and replaced with a record of the decision, or it should be clearly flagged as "resolved: using `${{ }}`" to avoid ambiguity in follow-up PRs. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-14T10:28:49Z

+Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL
+configurable). On a provider/network failure:
+
+- `fail_stale` (default for non-critical): serve cached values, mark deploy as using
+  stale secrets, surface staleness in UI/events.
+- `fail_closed`: abort the deploy with a clear error.


Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict

The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be in-memory only and process-scoped, and it should acknowledge the consequence: after a process restart with an unreachable provider, fail_stale behaves identically to fail_closed because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/design/external-secrets.md Line: 117-122 Comment: **Last-known-good cache storage medium is unspecified, creating a potential security-requirement conflict** The resilience section describes a "last-known-good cache (keyed by provider+scope+key, TTL configurable)" that holds resolved secret values (plaintext). Security requirement #1 says those values must never be written to disk. If an implementer builds this cache against the project's existing SQLite/Postgres GORM store — a natural choice for cross-restart resilience — they would silently violate that requirement. The document needs to explicitly say the cache must be **in-memory only and process-scoped**, and it should acknowledge the consequence: after a process restart with an unreachable provider, `fail_stale` behaves identically to `fail_closed` because the in-process cache is empty. That's a meaningful operational trade-off that belongs in the spec before implementation begins. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-14T10:28:50Z

+Each provider wraps a **last-known-good cache** (keyed by provider+scope+key, TTL
+configurable). On a provider/network failure:
+
+- `fail_stale` (default for non-critical): serve cached values, mark deploy as using
+  stale secrets, surface staleness in UI/events.
+- `fail_closed`: abort the deploy with a clear error.


fail_stale default has no defined maximum staleness, allowing indefinitely-old secrets

fail_stale is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ fail_closed behaviour) or as valid until replaced.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/design/external-secrets.md Line: 117-122 Comment: **`fail_stale` default has no defined maximum staleness, allowing indefinitely-old secrets** `fail_stale` is marked as the default for non-critical bindings, with "TTL configurable" as the only guard. Without a hard default TTL (or an explicit warning when the cache is older than N hours/days), a deployment could proceed using credentials fetched weeks ago: a rotated token, a revoked key, or a deleted secret would silently "succeed" until the cache is invalidated. The "Open questions" section flags TTL defaults but doesn't give a proposed safe default. Before PR #1 lands, the document should name a concrete default (e.g. 24 h) and specify whether an expired-but-still-cached entry is treated as missing (→ `fail_closed` behaviour) or as valid until replaced. How can I resolve this? If you propose a fix, please make it concise.

Adds a draft design document proposing pluggable external secrets providers that resolve a Compose project environment from a secrets manager at deploy time, with provider credentials encrypted at rest and resolved values kept in memory only. Documents the provider interface, data model, security requirements, the API, CLI and frontend surfaces, the bitwarden and infisical backends, and a six-step incremental delivery plan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Promote vault from 'later' to a planned implementation alongside bitwarden and infisical. Adds the vault provider section (KV v1/v2, token auth, sys/health HealthCheck, scope_path mapping), vault config fields in the data model, and inserts it as slice 4 in the PR delivery plan (API/Frontend/CLI shift to 5/6/7).

manawenuz requested a review from a team June 14, 2026 10:25

graphite-app Bot assigned manawenuz Jun 14, 2026

github-actions Bot added the quality/verified label Jun 14, 2026

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

manawenuz force-pushed the docs/external-secrets-rfc branch from 3a0ffc7 to f921599 Compare June 14, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add external secrets providers design RFC#2936

docs: add external secrets providers design RFC#2936
manawenuz wants to merge 2 commits into
getarcaneapp:mainfrom
manawenuz:docs/external-secrets-rfc

manawenuz commented Jun 14, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 14, 2026

Uh oh!

greptile-apps Bot Jun 14, 2026

Uh oh!

greptile-apps Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		## Open questions

		- Reference syntax: `${{ secrets.x.y }}` vs `arcane+secret://x/y` — pick one and document.

Uh oh!

Conversation

manawenuz commented Jun 14, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

What This PR Implements

Changes Made

Testing Done

AI Tool Used (if applicable)

Additional Context

Greptile Summary

Confidence Score: 3/5

Security Review

Uh oh!

greptile-apps Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

manawenuz commented Jun 14, 2026 •

edited by greptile-apps Bot

Loading