From f4034d8be66e0f077c2587d39c890614338957d0 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Fri, 22 May 2026 16:01:27 +0900 Subject: [PATCH 1/2] docs/design: rename 2026_05_16 admin purge queue proposal to implemented Phase 6 of the rollout plan: all milestones in docs/design/2026_05_16_*_admin_purge_queue.md have shipped on main: Phase 2 (#771) - AdminPurgeQueue + IsDLQ/DLQSources Phase 3 (#794) - AdminPeekQueue backend Phase 4 (#797) - HTTP handler + bridge Phase 5 (#798) - SPA Messages tab + Purge button + DLQ chips Rename the file via git mv per CLAUDE.md's lifecycle rule (*_proposed_*.md to *_implemented_*.md), bump Status, and add an Implementation history table so a future reader can trace each phase back to its PR. Out-of-scope follow-ups (throttle integration, audit + Prometheus counters, principalForReadSensitive live re-check, page-size selector) are listed in the doc but are tracked separately and do not gate this rename. --- ...026_05_16_implemented_admin_purge_queue.md} | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) rename docs/design/{2026_05_16_proposed_admin_purge_queue.md => 2026_05_16_implemented_admin_purge_queue.md} (98%) diff --git a/docs/design/2026_05_16_proposed_admin_purge_queue.md b/docs/design/2026_05_16_implemented_admin_purge_queue.md similarity index 98% rename from docs/design/2026_05_16_proposed_admin_purge_queue.md rename to docs/design/2026_05_16_implemented_admin_purge_queue.md index 48cd9759..73f1352a 100644 --- a/docs/design/2026_05_16_proposed_admin_purge_queue.md +++ b/docs/design/2026_05_16_implemented_admin_purge_queue.md @@ -1,9 +1,25 @@ # Admin Queue Peek and Purge (DLQ-Aware) for the SQS Web Console -**Status:** Proposed +**Status:** Implemented **Author:** bootjp **Date:** 2026-05-16 +## Implementation history + +| Phase | PR | Landed | +|-------|-----|--------| +| 1 (this doc, proposal) | #757 | 2026-05-16 | +| 2 (`AdminPurgeQueue` + `IsDLQ`/`DLQSources`) | #771 | 2026-05-17 | +| 3 (`AdminPeekQueue` backend) | #794 | 2026-05-20 | +| 4 (HTTP handler + bridge) | #797 | 2026-05-21 | +| 5 (SPA Messages tab + Purge button + DLQ chips) | #798 | 2026-05-21 | + +Out-of-scope follow-ups (tracked separately, not gating this rename): +- Throttle integration (`bucketActionAdminPeek` + dedicated per-queue admin-peek bucket per §3.1) +- Audit logging + Prometheus counters per §3.6 +- `principalForReadSensitive` live `RoleStore` re-check (Goal 8, blocked on wider RoleStore plumbing) +- Page-size selector (20 / 50 / 100) + response-size warning + --- ## 1. Background and Motivation From d723f516c8761007927fe9734f36668f34986777 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 01:41:03 +0900 Subject: [PATCH 2/2] docs/design: flag out-of-scope sections inline (r1 review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Gemini r1 on PR #800 flagged that the Status: Implemented banner and the Out-of-scope follow-ups list at the top were not aligned with the body: §2.1 Goal 8 (live RoleStore re-check), §3.1 Throttle, §3.5 page-size selector, and §3.6 audit + Prometheus counters still read like implemented features. Add inline 'not yet implemented' markers at the entry of each deferred section, with a one-line mitigation note so a future reader (a) understands what is missing, (b) sees what the current behaviour falls back to, and (c) can cross-reference the top-of-doc follow-ups list. Body text remains unchanged below the marker so the future-implementer can pick up the design intent verbatim. Claude r1's two informational notes (skipped _partial_ lifecycle step, Status header bold-format vs README template) are noted but not actioned: the _partial_ step was a process miss during the Phase 2-5 reviews that does not change the final state, and the Status format is pre-existing across the whole docs/design/ tree (Implemented and Proposed all use bold). --- docs/design/2026_05_16_implemented_admin_purge_queue.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_05_16_implemented_admin_purge_queue.md b/docs/design/2026_05_16_implemented_admin_purge_queue.md index 73f1352a..b5b05299 100644 --- a/docs/design/2026_05_16_implemented_admin_purge_queue.md +++ b/docs/design/2026_05_16_implemented_admin_purge_queue.md @@ -58,7 +58,7 @@ Both features work for any queue. The **UI is DLQ-aware** so the operator gets t - `DLQSources []string` — the source-queue names that point at this queue. The SPA renders these as a chip list on the detail page so the operator can confirm they understand what queue feeds the DLQ before purging. 6. **Same AWS-shaped error mapping** as the SigV4 path — purging more than once per 60 seconds returns the SQS `PurgeQueueInProgress` semantics that `tryPurgeQueueOnce` already enforces. The admin response surfaces it as a structured `429 Too Many Requests` JSON payload (`{"code":"PurgeQueueInProgress", "retry_after_seconds":N}`). 7. **Audit** — `admin.sqs.purge_queue` (subject, role, queue, generation_before, generation_after). Peek is a read and does NOT generate an audit line per call (the SPA polls; per-poll audit would drown the log) but the admin handler emits the standard request-log line with `route` / `subject` / `status_code` so the call is still traceable. -8. **Read-only role can peek but not purge.** Peek is gated by a **live `RoleStore` re-check** (not just the session-auth gate that List / Describe currently use), introduced as a new `principalForReadSensitive` helper alongside the existing `principalForWrite`. Purge stays gated by `principalForWrite` (live-role re-check), matching `AdminDeleteQueue` exactly. Codex r9 P1 flagged the security gap in the earlier draft: peek exposes full message bodies / attributes (not just metadata like List / Describe), so a session JWT that was revoked or whose role was downgraded after login could still read DLQ payloads via peek until the token's natural 1-hour TTL elapsed. The new `principalForReadSensitive` helper performs the same revocation check `principalForWrite` does, but classifies the call as a read in the audit pipeline — keeping the audit shape parallel to List / Describe while closing the confidentiality gap. List / Describe themselves remain on the session-auth-only gate because their output is metadata that is already shown on the SPA's queue list page; the divergence is intentional and is documented at the call site so a future reviewer does not "fix" the inconsistency by downgrading peek's gate. Claude r2 caught the earlier draft that implied a non-existent `principalForRead` helper; this paragraph spells out the actual gate with the security-class distinction. +8. **Read-only role can peek but not purge.** _The live `RoleStore` re-check is NOT yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Phase 4 shipped a `Role.AllowsRead()` gate that accepts the JWT-embedded role plus an optional RoleStore lookup; the wider live-revalidation plumbing the design calls for is blocked on a broader RoleStore refactor that affects every adapter's read path. Mitigation in absence: a revoked / downgraded key keeps peek access until the session JWT's natural 1-hour TTL expires._ Peek is gated by a **live `RoleStore` re-check** (not just the session-auth gate that List / Describe currently use), introduced as a new `principalForReadSensitive` helper alongside the existing `principalForWrite`. Purge stays gated by `principalForWrite` (live-role re-check), matching `AdminDeleteQueue` exactly. Codex r9 P1 flagged the security gap in the earlier draft: peek exposes full message bodies / attributes (not just metadata like List / Describe), so a session JWT that was revoked or whose role was downgraded after login could still read DLQ payloads via peek until the token's natural 1-hour TTL elapsed. The new `principalForReadSensitive` helper performs the same revocation check `principalForWrite` does, but classifies the call as a read in the audit pipeline — keeping the audit shape parallel to List / Describe while closing the confidentiality gap. List / Describe themselves remain on the session-auth-only gate because their output is metadata that is already shown on the SPA's queue list page; the divergence is intentional and is documented at the call site so a future reviewer does not "fix" the inconsistency by downgrading peek's gate. Claude r2 caught the earlier draft that implied a non-existent `principalForRead` helper; this paragraph spells out the actual gate with the security-class distinction. ### 2.2 Non-Goals @@ -188,7 +188,7 @@ The walk terminates when either `Partition` advances back to `StartPartition` (f Cost is `O(Limit)` round-trips against Pebble at peek time — tiny for the bounded result sets the SPA uses. The bound on `Limit` (max 100) prevents an operator script from accidentally issuing million-row peeks against the leader. -**Throttle.** Peek consults a **distinct per-queue admin-peek bucket**, *not* the per-queue `ReceiveMessage` budget. An earlier draft of this design merged the two; Claude r2 flagged that an operator paginating through a 10k-message DLQ could exhaust the budget that real consumers depend on. The separate admin-peek bucket defaults to a lower steady-rate (`adminPeekRPS = 5`, `adminPeekBurst = 20`) so a pagination loop cannot starve consumers. +**Throttle.** _Not yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Mitigation in absence: the Phase 3 implementation enforces a hard `Limit ≤ 100` per call and leader-only execution, which bounds the steady-state cost; per-queue throttling lands when the SPA wiring needs the rate-limit metric to have a real consumer._ Peek consults a **distinct per-queue admin-peek bucket**, *not* the per-queue `ReceiveMessage` budget. An earlier draft of this design merged the two; Claude r2 flagged that an operator paginating through a 10k-message DLQ could exhaust the budget that real consumers depend on. The separate admin-peek bucket defaults to a lower steady-rate (`adminPeekRPS = 5`, `adminPeekBurst = 20`) so a pagination loop cannot starve consumers. **Bucket key format.** The existing `bucketStore` (`adapter/sqs_throttle.go`) keys on a struct `bucketKey{queue, action, incarnation}`, not a string. The admin-peek bucket therefore uses `bucketStore.charge()` directly with `action = bucketActionAdminPeek` and the queue's current incarnation, exactly like the `SendMessage` / `ReceiveMessage` paths do. Claude r4 flagged an earlier draft that described the bucket as a free-standing string-keyed map; that would have required parallel rate-limiter infrastructure and would not have been swept by `invalidateQueue()` on queue re-creation. The `bucketStore.charge(adminPeekThrottle, queueName, bucketActionAdminPeek, meta.Incarnation, 1)` form participates in the existing incarnation reset machinery automatically. @@ -446,7 +446,7 @@ The queue detail page gains two new pieces of UI on top of the existing attribut | Body preview | `body` (already truncated by backend) | first 96 chars; "…" suffix when `body_truncated`. Row click opens detail modal. | | Size | `body_original_size` | human-readable ("1.4 kB") so operators can spot oversized messages | -Below the table: a page-size selector (20 / 50 / 100), a Refresh button, and Next / Previous controls driven by the cursor. Detail modal shows full body + every attribute + the timestamps; a "Copy as JSON" button copies the row's full record to the clipboard for manual export. +Below the table: _the page-size selector (20 / 50 / 100) is NOT yet implemented — see "Out-of-scope follow-ups" at the top. Phase 5 shipped a hard default of 20 rows; the worst-case response (20 × 256 KiB = 5 MiB) stays well under network / JSON-parse budgets even without operator-tunable sizes. Selector + size warning land in a follow-up if operators ask for larger pages._ A Refresh button, and Next / Previous controls driven by the cursor. Detail modal shows full body + every attribute + the timestamps; a "Copy as JSON" button copies the row's full record to the clipboard for manual export. **Copy as JSON payload schema.** The clipboard payload is the exact wire shape of a single `AdminPeekedMessage` entry (top-level keys: `message_id`, `body`, `body_truncated`, `body_original_size`, `sent_timestamp`, `receive_count`, `group_id`, `deduplication_id`, `attributes`) plus a wrapper `{"schema_version": 1, "queue": "", "exported_at": "", "message": { … }}`. The `schema_version` is what downstream tooling pins so a future change to the export format (e.g. multi-message JSONL bundle) does not silently break exporters. Operator workflows that pipe this into a recovery tool can rely on the schema not shifting under them. @@ -491,6 +491,8 @@ mirroring the existing `deleteQueue` / `describeQueue` shape. `peekQueue` is `si ### 3.6 Audit and observability +_Not yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Mitigation in absence: the admin handler still emits the standard request-log line with `route` / `subject` / `status_code` for both purge and peek calls, so an operator can correlate "who did what when" against the application logs at audit-review time. The structured `admin.sqs.purge_queue` audit line and the two Prometheus counters land alongside the SPA wiring so the metrics have a real consumer._ + New structured log line at `slog.Info` level (matches `AdminDeleteQueue`): ```