Skip to content

feat: memory survival, hot-path performance, and triage-first UI rewrite#103

Merged
aksOps merged 71 commits into
mainfrom
feat/memory-survival-and-triage-ui
Jun 12, 2026
Merged

feat: memory survival, hot-path performance, and triage-first UI rewrite#103
aksOps merged 71 commits into
mainfrom
feat/memory-survival-and-triage-ui

Conversation

@aksOps

@aksOps aksOps commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Three workstreams, 66 commits, no new data features — performance, memory survival, and a ground-up responsive UI.

A — Memory survival (fixes the production OOM restarts)

  • AnomalyStore O(N²) PRECEDED_BY edge mesh eliminated via stable per-(service,type) anomaly IDs (was 84% of live heap)
  • GOMEMLIMIT via new internal/membudget (cgroup v2 → v1 → /proc/meminfo, 75% of budget)
  • Byte-bounded ingest queue (INGEST_PIPELINE_MAX_BYTES, 128 MB SQLite default) — 429 instead of OOM
  • Budget-scaled SQLite PRAGMAs (cache = budget/32, mmap = budget/8, clamped; fail-closed kept)
  • GraphRAG bounds: 30 m trace TTL on SQLite, 500 k spans/tenant cap, 24 h idle-tenant eviction, SignalStore caps, anomaly-walk cap
  • TSDB ring: tenant|service|metric keys (cross-tenant isolation fix) + cardinality cap
  • Daily full VACUUM retired → PRAGMA optimize + incremental_vacuum (no more 10–60 min exclusive-lock 429 storms)
  • pprof on loopback (PPROF_ADDR), census gauges for every bounded structure

Soak proof (120 services, 1200 spans/s, 4 GB cgroup cap): baseline main OOMs in ~40 min (+92 MB/min); patched peaks at 705 MB with a decaying slope and a zero-growth pprof heap diff T+10→T+40.

B — Serving & hot paths

  • GetServiceMapMetrics: 500 k-row scan + per-row zstd decompress → SQL aggregation + 30 s TTL cache (dashboard 2–5 s → <100 ms)
  • Incremental GraphRAG rebuild (per-tenant high-water-mark), anomaly-scan gating on idle tenants
  • Build-time brotli/gzip precompression, immutable asset caching, index ETag/304, SPA fallback, stdlib gzip for GET /api/*, ETag on hot polled endpoints

C — Triage-first UI rewrite (C1–C7)

  • @ossrandom/design-system, cytoscape, uplot removed; React 19 + TanStack Query/Virtual + wouter + Radix primitives + cmdk + hand-rolled token CSS
  • Triage feed → root cause → blast radius → evidence loop; deterministic layered SVG flow map; Investigation Trail (shareable ?trail= URLs); ⌘K palette + keyboard chords
  • Fully responsive 360 px → ultrawide: bottom tabs / icon rail / labeled rail + docked inspector; iOS input-zoom guard
  • Budgets CI-enforced: initial JS 107.6 KB gz, CSS 2.7 KB gz, every lazy chunk ≤ 8 KB

Validation

  • 653 Go tests race-clean (28 packages); 398 UI tests (40 files), ~96.7 % line coverage on new UI code
  • golangci-lint 0 new findings; jscpd 0.35 %; npm audit clean; bundle budgets green
  • 45 min patched soak + live deployment soak — flat heap, zero drop counters

🤖 Generated with Claude Code

aksOps and others added 30 commits June 11, 2026 15:38
A 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing.
Heap profiling (gc=1) attributed 84% of the live heap to AnomalyStore
PRECEDED_BY edges: the 10s detector minted a NEW anomaly node every tick
per erroring service (UnixNano-suffixed ID), and correlateWithRecent then
created O(N^2) edges among them — unbounded until the 24h TTL.

- fix: stable per-(service,type) anomaly IDs so detection UPSERTS one
  evolving node instead of one-per-tick; this bounds both the node map and
  the edge mesh (AnomalyStore 272 MB -> 2.6 MB; peak RSS 1.8 GB -> 292 MB,
  now flat over the full 15 min). + regression test.
- feat: applyMemoryLimit() sets a soft GOMEMLIMIT at startup — honors an
  explicit env value, else 75% of the detected cgroup/host budget — so the
  GC paces against a ceiling instead of letting next_gc run away. Defense
  in depth; cgroup v2/v1 + /proc/meminfo detection, stdlib-only. + tests.

Validation: 3x 15-min soaks + heap profile; integrity ok, 0 drops/429s,
0 ERROR/panic, clean shutdown, goroutines/fds recover, 30k spans/120 svcs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- security(api): close cross-tenant read caused by middleware ordering —
  TenantMiddleware now passes through when auth already pinned a tenant
  (HasTenantContext), so a per-tenant key can't be escaped via X-Tenant-ID
- fix(ingest): correct token-bucket sampler math; the old cost (1/rate)
  exceeded the cap for rate<1.0 so ~100% of healthy spans were dropped
  (SQLite default 0.05 persisted almost no baseline traces)
- fix(api): clamp limit/offset on /api/logs and /api/traces (negative limit
  was passed to GORM as unlimited — heap/DB DoS)
- fix(ingest): sanitize X-Tenant-ID on the HTTP OTLP path (gRPC parity)
- fix(mcp): don't cache error tool results; enforce the response byte cap
  in resourceResult (trace_graph DB fallback was uncapped)
- fix(ui): correct ServiceSidePanel test for split design-system markup,
  mount ErrorBoundary, derive connected badge from ws.status
- chore: bump go directive to 1.25.11 to unblock the OSV-Scanner CI gate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PPROF_ADDR (default 127.0.0.1:6060, empty disables) serves net/http/pprof
from its own listener so profiling never reaches the public :8080 mux.
Heap attribution is the prerequisite for proving every memory fix in the
OOM-survival series (b1983f8 was only diagnosable via heap profiles).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
otelcontext_graphrag_store_entities{entity}/_edges{store},
otelcontext_tsdb_ring_series_active, otelcontext_drain_templates_active —
sampled every 15s via len()-under-RLock census so operators can attribute
RSS growth to a specific store without a heap profile.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
~100 MB of standing buffer at the Postgres default; on SQLite the single
writer starves the workers anyway, so buffer less and let the metered
drop path engage sooner.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both the embedded-UI server (precompressed asset negotiation, index.html
304s) and the API layer (gzip middleware, cached-payload ETags) need the
same two pure helpers; centralising them in httpconst avoids duplicating
header-parsing logic across packages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mbudget

Move detectMemoryBudget/readCgroupBytes/readMemTotal (cgroup v2 -> v1 ->
/proc/meminfo) out of the root-package memlimit.go into a reusable
internal/membudget package so internal/storage can budget-scale the SQLite
PRAGMA stanza without importing package main. applyMemoryLimit stays as a
thin wrapper with unchanged behavior; detection tests moved across verbatim
and the wrapper gained its own tests (operator GOMEMLIMIT honored, budget
fraction applied).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds GRAPHRAG_TRACE_TTL (default 1h, SQLite default 30m via
sqliteOverrides), GRAPHRAG_MAX_SPANS_PER_TENANT (default 500000) and
GRAPHRAG_TENANT_IDLE_TTL (default 24h). The TraceStore span window is
the largest GraphRAG heap consumer at 120 services; anomaly and
investigation lookbacks are <=5min so the 30min SQLite window is safe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the bare http.FileServer over the embedded dist with a purpose-
built spaHandler:

- /assets/* (Vite hashed filenames): Accept-Encoding negotiation against
  build-time .br/.gz siblings, Content-Type from the original extension,
  Cache-Control: public, max-age=31536000, immutable + Vary.
- index.html: Cache-Control: no-cache with a startup-computed sha256 ETag;
  If-None-Match returns 304, precompressed variants served when accepted.
- SPA fallback: unknown extensionless paths serve index.html for client-
  side routing, preserving the old spaFS dot-in-path 404 contract and
  explicitly refusing machine namespaces (/api, /v1, /ws, /mcp, /metrics).
- Source-only checkouts (stub dist) degrade to a descriptive 404.

Embed directives unchanged; siblings are emitted by the build-time
precompress script (wired separately) and picked up by all:dist.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-existing lint-gate failures (gofmt struct alignment, nolintlint on
seriesPerTenant which the unused linter no longer flags). No semantic
change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion

Two bugs in the RingBuffer dashboard accelerator:

(a) the rings map was uncapped and bypassed cardinality enforcement —
    Aggregator.Ingest fed the ring before the bucket-level cap check, so
    a cardinality flood grew Go heap without bound on SQLite deploys.
(b) rings were keyed service|metric, merging points from different
    tenants into one series — a data-isolation breach.

Fixes:
- key rings by tenant|service|metric; empty tenant coerces to
  storage.DefaultTenantID so single-tenant reads and writes agree
- RingBuffer gains maxSeries (0 = unlimited, backward compatible):
  NEW series past the cap are refused (Record returns false) while
  existing series keep recording; refusals fire onSeriesRejected
  outside the lock
- new counter otelcontext_tsdb_ring_series_rejected_total in
  internal/telemetry/metrics.go surfaces refusals
- startup wiring: NewRingBuffer now takes METRIC_MAX_CARDINALITY and
  the counter's Inc as the rejection callback (main.go, one line)

QueryRecent/Record/AllKeys callers: only Aggregator.Ingest and main.go —
no external QueryRecent callers exist in the tree.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New zero-dep ui/scripts/precompress.mjs (node:zlib only — air-gapped
build friendly) emits .br (quality 11) and .gz (level 9) siblings for
every dist *.{js,css,html,svg,json} >= 1KB, printing a size table.
Wired into the ui build script after vite build; the Go binary's
all:dist embed picks the siblings up and internal/ui negotiates them
via Accept-Encoding.

dist content stays uncommitted (source-only main) — the script runs at
release time via the existing build flow.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… auto_vacuum

The pure-Go SQLite driver's page cache and mmap window are Go-heap and
address-space costs, so the hardcoded 256 MB cache + 1 GB mmap pair starved
4 GB hosts. The stanza now scales both against the detected memory budget
(internal/membudget: cgroup v2 -> v1 -> /proc/meminfo): cache = budget/32
clamped to [64 MB, 256 MB], mmap = budget/8 clamped to [256 MB, 1 GB] —
a 4 GB host yields 128 MB cache + 512 MB mmap. Detection failure falls back
to the previous hardcoded values; operator overrides SQLITE_CACHE_SIZE_KB /
SQLITE_MMAP_SIZE_BYTES win unconditionally. The fail-closed Exec loop and
every other pragma are unchanged (round-trip-guarded by tests).

Also sets PRAGMA auto_vacuum=INCREMENTAL best-effort (log, never abort)
BEFORE journal_mode=WAL — the WAL switch initializes the file header, after
which the stored auto_vacuum mode is frozen. Only takes effect on databases
this process creates; prepares fresh deploys for incremental_vacuum-based
retention maintenance.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TraceStore gains MaxSpans (GRAPHRAG_MAX_SPANS_PER_TENANT, default
500k): at the cap, NEW spans are skipped — UpsertSpan returns false and
processSpan records the drop via the existing recordEventDrop seam as
signal="span_capacity" on otelcontext_graphrag_events_dropped_total.
Updates to resident span IDs still apply; the graph is best-effort and
the DB stays the source of truth. main.go wires MaxSpansPerTenant plus
the GRAPHRAG_TRACE_TTL / GRAPHRAG_TENANT_IDLE_TTL duration knobs
(unparsable values fall back to package defaults, DLQ-interval style).

Also lands the tenantStores lastAccess/lastEventAt/lastRebuildMax
bookkeeping fields and the storesForTenant/tenantStoresNoTouch split
consumed by the follow-up eviction and incremental-rebuild commits.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New internal/api/compress.go: pooled gzip.Writer (sync.Pool) wrapping
the ResponseWriter for GET /api/* when the client accepts gzip. Lazy
engagement on first write — 204/304 and already-encoded responses pass
through, Content-Length is dropped, Vary: Accept-Encoding is set on
every eligible path. Flush propagates through the gzip buffer so
streaming handlers keep incremental delivery; Unwrap supports
http.ResponseController.

Wired in main.go as the innermost wrapper (directly around the mux,
before TenantMiddleware) so only handler output is compressed and
/ws*, /v1/*, /metrics*, and the MCP/SSE path are untouched — WebSocket
hijacking keeps the raw writer by construction.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The queue gated only on item count; each Batch holds unbounded
Traces/Spans/Logs slices, so 50k queued batches could pin GBs of
heap. Add byte accounting:

- (*Batch).approxBytes() — O(records) heap estimate, computed once
  at Submit and released by process() (deferred; panic path included)
- hard cap INGEST_PIPELINE_MAX_BYTES (default 512MB, SQLite default
  128MB, 1MB floor) — at the cap Submit rejects with ErrQueueFull
  even for priority batches: a 429 is recoverable, an OOM kill is not
- soft backpressure now fires on max(itemFullness, byteFullness)
- new gauge otelcontext_ingest_pipeline_queue_bytes; drop reason
  "bytes_full" on otelcontext_ingest_pipeline_dropped_total
- Stats() exposes QueueBytes / MaxBytes / RejectedBytes

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…(P1.5)

Tenants with no ingest event or query for GRAPHRAG_TENANT_IDLE_TTL
(default 24h) are dropped from the coordinator map on the 60s refresh
tick, freeing all four per-tenant stores at once. storage.DefaultTenantID
is immune. Eviction is self-healing: rebuildAllTenantsFromDB re-creates
genuinely active tenants from recent spans within one tick, and any
ingest/query re-creates the slice instantly with a fresh idle window.
The DB rebuild path deliberately uses tenantStoresNoTouch so 60s
bookkeeping alone cannot keep a dormant tenant alive.

New counter otelcontext_graphrag_tenants_evicted_total in
internal/telemetry/metrics.go (shared file — promauto pattern).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The daily maintenance pass ran a full VACUUM, which holds a whole-file
exclusive lock for 10-60 minutes on multi-GB databases (admin_handlers.go
documents this for drop_fts) — ingest stalls into a 429 storm and the async
pipeline queue/RAM spiral. The SQLite branch now keeps PRAGMA optimize and
runs PRAGMA incremental_vacuum(10000) instead, releasing up to 10k freelist
pages per day without the exclusive lock. Fresh databases are provisioned
auto_vacuum=INCREMENTAL by NewDatabase; on legacy auto_vacuum=NONE files the
pragma is a harmless no-op.

incremental_vacuum is a row-returning pragma that frees one page per step —
Exec steps it once on glebarez/modernc, so the statement is queried and
drained (drainQuery) to reclaim the full batch.

Operators can restore the legacy behavior with RETENTION_FULL_VACUUM=true
(new config field, default false); POST /api/admin/vacuum remains for
on-demand full VACUUM. Postgres/MySQL maintenance is byte-identical.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…1.6)

SignalStore.Prune (modeled on TraceStore.Prune) runs on the 60s refresh
tick: MetricNodes idle >24h are dropped, the oldest-LastSeen overflow
past 2000/tenant is evicted, each removal takes its MEASURED_BY edge,
and remaining edges are swept by UpdatedAt. The metric/log-cluster
upsert paths now refresh edge UpdatedAt on every hit so the sweep never
severs a live correlation (previously edges kept their creation time
forever). LogClusters stay untouched — Drain bounds them upstream.

AnomaliesSince gains a bounded sibling AnomaliesSinceLimit;
correlateWithRecent walks at most 1000 recent anomalies per detection
so a pathological backlog cannot turn each 10s tick into an O(N) scan
plus O(N) PRECEDED_BY fan-out.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GetServiceMapMetrics loaded up to 500k full Span rows — zstd-decompressing
the CompressedText attributes column per row in Scan() — then aggregated in
Go, costing 2-5s and ~300MB transient on dashboard loads.

Node stats now come from one portable GROUP BY aggregate (COUNT/AVG/SUM CASE
WHEN, runs on sqlite/postgres/mysql/sqlserver; duration * 1.0 keeps AVG in
floating point where AVG(bigint) truncates). Edge stats keep the in-Go
parent-resolution pass but over a six-column projection so the attributes
column is never scanned. Output proven equivalent to the old implementation
by a reference-copy equality test on a fixture covering every branch.

One intentional delta: ServiceMapNode.ErrorCount is now populated from span
status (STATUS_CODE_ERROR). The old path left it permanently 0 — a latent
bug, since buildGraphFromDB divides ErrorCount/TotalTraces for error rates.
Node stats are also no longer subject to the 500k row cap (the aggregate is
unbounded; the cap now applies only to the edge scan).

Benchmark (5000 spans, in-memory SQLite): 22.1ms/5.1MB/150k allocs vs
37.4ms/14.6MB/310k allocs before — the gap widens with row count and
attribute payload size.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirror the 10s system-graph cache pattern from graph_handler.go (TTLCache +
X-Cache HIT/MISS header) on handleGetServiceMapMetrics with a 30s TTL. The
cache key uses the raw start/end query params, so the default rolling window
(no params) shares a single entry per tenant instead of being re-keyed on
every request timestamp; explicit windows are cached independently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ns Mono subsets

@tanstack/react-query@5, @tanstack/react-virtual@3, wouter@3, cmdk,
Radix dialog/tooltip/tabs/dropdown-menu, uplot (all MIT/ISC) and the
OFL latin woff2 subsets (Inter variable 47.1KB + JetBrains Mono 400
20.7KB = 67.8KB, under the 90KB font budget) vendored into
ui/public/fonts with their licenses. npm audit --audit-level=moderate
reports 0 vulnerabilities.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (P2.2)

rebuildFromDBForTenant tracked nothing between ticks and re-read the
full trailing 1h of spans per tenant every 60s. Each tenant slice now
records the max start_time merged (lastRebuildMax); subsequent ticks
query start_time > max(since, HWM-5min) — the 5min overlap re-merges
late arrivals. A fresh slice (first build, post-eviction) has HWM 0 and
takes the full window, so P1.5 eviction stays self-healing. The 50k row
LIMIT is kept and now logs a warning when hit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…intenance

drainQuery must surface query errors; a cancelled context drives both SQLite
maintenance error branches (PRAGMA optimize + vacuum step) and proves the
overlap guard is released so the next tick still runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/api/system/graph already had a 10s tenant-scoped cache but re-encoded
the cached struct on every hit. It now stores the rendered JSON plus a
sha256-derived strong ETag (hashed once per cache fill) and honours
If-None-Match with a bodyless 304. /api/metrics/dashboard and
/api/stats get the same pattern via a shared cachedJSON helper:

- dashboard: key scoped by (tenant, raw query) so explicit
  start/end/service_name windows never share an entry; queries over
  256 bytes bypass the cache to bound key cardinality.
- stats: key scoped by tenant; the UI footer polls this and the
  COUNT(*) scans behind it are not free on a multi-GB SQLite file.

Steady-state polling becomes a map lookup + hash compare instead of a
SQLite query + JSON encode; clients echoing If-None-Match transfer no
body at all. Also drops the unreachable /ws,/v1,/metrics skip loop in
gzipEligible (the /api/ prefix gate subsumes it; the configurable MCP
path keeps its explicit exclusion).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…P2.3)

Each event path (processSpan/processLog/processMetric) stamps the
tenant's lastEventAt; detectAnomalies records its own start time and
skips tenants whose lastEventAt predates the previous tick — no events
means their service/metric stats cannot have changed, so re-walking
them only burned CPU and refreshed anomaly timestamps spuriously. The
first scan after startup always runs so DB-rebuilt state is examined
at least once.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sonar S1192 flagged the literal duplicated three times in the UI asset
server; the gzip middleware used it twice more. Same treatment as the
existing HeaderContentType.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ient, WS singleton

- lib/format.ts: central percent/duration/count/size formatters fixing
  the two audit bugs (error_rate 0.042 -> "4.2%", health_score 0.73 ->
  "73%") with explicit ratio/percent/auto units.
- lib/apiFetch.ts: the one fetch wrapper — AbortSignal passthrough,
  JSON parse, ApiError normalization (status 0 = network), aborts
  re-thrown untouched.
- lib/queryClient.ts: staleTime 10s (matches server graph TTL), gcTime
  5min, refetchIntervalInBackground:false, retry 2 with jittered
  exponential backoff.
- lib/wsManager.ts: module-level /ws singleton porting the
  backoff/heartbeat/visibility logic out of useWebSocket, adding +20%
  reconnect jitter, a 5000-cap log ring buffer, a 250ms-coalesced
  version counter, and useSyncExternalStore subscribe/snapshot pairs.
  Fixes a latent hook bug: the dead-connection watchdog was re-armed on
  every ping, so its deadline slid forever and never fired.

46 tests (fake timers + mock WebSocket).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aths

Drives a real refreshLoop tick (20ms cadence) to prove trace prune,
signal prune and idle-tenant eviction fire from the loop itself, seeds
exactly rebuildRowLimit rows to exercise the limit-hit branch, and
wires a shared telemetry.Metrics (sync.Once — promauto panics on
duplicate registration) so the Prometheus increments in
evictIdleTenants and recordEventDrop are executed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
aksOps and others added 18 commits June 11, 2026 18:13
/map now renders the layered DAG (FlowMap + dagLayout): semantic-zoom
err%% labels at k>=0.8, log-width edges, dash animation only on failing
edges under no-preference motion, transform-only pan/wheel/pinch with
touch-action:none scoped to the SVG, re-layout gated on the node/edge
set hash, accent-ring selection with 1-hop dimming, and full keyboard
walking (arrows/Enter/Esc/f//). Toolbar: filter with clear, status
pills, fit, legend popover, dataUpdatedAt freshness. xs defaults to a
status-grouped card list (shared ServiceRow) with a Flow toggle.

App mounts the lazy ServiceInspector (?service=) and the TrailBar
globally. Deletes ServicesView/ServiceGraph/ServiceList/
ServiceSidePanel/StatRow/useWindowHeight and the cytoscape +
cytoscape-cose-bilkent deps; the map route budget exception drops
170KB -> 35KB cap (measured 5.41KB gz vs 156.33KB before, -150.9KB).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/ now mounts TriageView: a 24h anomaly strip fed by the MCP
get_anomaly_timeline triage tool (the UI's first human-facing MCP call;
severity-colored clustered ticks, tap opens the Inspector) over the
service feed ranked critical → degraded → alerted with healthy services
collapsed (shared ServiceGroups, also the map's xs list). The global
error sparkline renders only when /api/metrics/traffic is already in
the query cache (enabled:false observer — zero new polling); useTraffic
is ported to TanStack Query so the Dashboard fills that shared key.
Empty state inlines the copyable OTLP/MCP connect endpoints (extracted
from ConnectPopover). Shell nav gains Triage as the first tab; unknown
routes now land on / instead of /map.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The C5 evidence pages need two fields the backend already stores but
never serialized:

- views.Span gains `status` (OTLP code, e.g. STATUS_CODE_ERROR) so the
  trace waterfall can color error spans --crit. The storage model has
  carried Status since ingest day one; only the view dropped it.
- GET /api/logs honors the `trace_id` query param (exact match through
  the existing storage.LogFilter.TraceID). The traces→logs cross-link
  needs a deterministic filter on every driver — body `search` only
  matches trace IDs on the LIKE fallback, not under FTS5, and carries a
  24h clamp that an indexed exact match should not inherit.

ui/src/types/api.ts mirrors both: Span.status, and the LogsResponse
envelope corrected to the actual {data, total} handler shape.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pure, exhaustively unit-tested logic for the C5 evidence pages
(TDD-first; components consume these without re-deriving anything):

- lib/waterfall.ts — time-positioned trace waterfall layout: DFS row
  order with start-time/span_id-stable sorting, depth via
  parent_span_id (orphans become roots, cycles guarded), offset/width
  as fractions of the trace extent, deterministic FNV-1a service hue.
- lib/traceRows.ts — nearest-rank percentile + visible-set p99 for the
  inline duration bars, OTLP status→tone mapping, and the /traces URL
  filter round-trip (service/status/q/trace).
- lib/logRows.ts — severity normalization for pills/badges,
  HH:mm:ss.SSS time gutter, per-flush-tick buffer filter/counts, and
  lazy attributes parsing for both producer shapes (plain object and
  the OTLP []KeyValue array the ingester marshals).
- hooks/useDebouncedValue.ts — 300ms trailing debounce for the server
  search inputs.
- lib/wsManager.ts — getLogsTotal(): monotonic appended count that
  survives ring eviction; consumers diff readings for "N new" pills.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/logs — live tail over the wsManager /ws ring buffer:
- virtualized fixed-height rows (28px / 44px on xs+coarse, overscan 8),
  severity badge, HH:mm:ss.SSS time, mono service, single-line body;
  xs collapses the gutter to a 3px severity left-border
- auto-follow pinned to bottom; scroll-up pauses follow with a
  "N new" pill that jumps back down (never yanks scroll); explicit
  Pause freezes a buffer snapshot and counts missed entries
- row click expands in place: full body, lazily-parsed attributes,
  ai_insight, "Show context" (GET /api/logs/context) and "Open
  trace" cross-link; severity pills with live per-tick counts
- search input (labelled, 300ms debounce) switches to the server
  GET /api/logs mode with offset paging and a "Back to live"
  affordance; ?trace= deep link scopes the history to one trace

/traces — virtualized table + real time-positioned waterfall:
- status badge, service, mono operation, span count, duration with an
  inline bar relative to the visible-set p99; xs two-line cards
- service (from /api/metadata/services), status and trace-ID search
  filters all in URL params; cursor-style "Load more" paging
- detail: single-SVG waterfall (offset = start − trace start, width =
  duration, depth-indented, service-hash hue at low saturation, error
  spans --crit, keyboard-walkable bars); span click reveals lazy
  attributes + correlated logs; "Open in Logs" cross-link
- master-detail ~55/45 split on lg+, full-screen push below

Both routes ship skeleton/empty/error/refetch states, 44px coarse
targets, hover:hover gating and token-only colors. Shell nav gains
Traces/Logs entries; budgets.json pins the new chunks (LogsView 25KB,
TracesView 30KB gz — measured 3.4/4.7KB).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, connect inline

Drives the FlowMap wheel/pointer handlers (zoom transform, semantic-zoom
label hiding, drag pan, two-pointer pinch) and the inspector bottom
sheet's drag-to-snap/dismiss through fireEvent; adds the TanStack
useTraffic adapter and ConnectInline clipboard tests. setPointerCapture
joins the guarded jsdom stubs. Authored-file line coverage 98.7%, no
file below 85%.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… dagLayout

Pull the back-edge-dropping DFS out of breakCycles into a named helper and
replace bare .sort() calls with an explicit compareIds — same UTF-16
code-unit ordering, but the determinism contract is now stated in code
rather than implied by Array.prototype.sort defaults.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
parseRankedCauses/parseImpactResult/parseAnomalies centralised in
lib/mcpResults (tolerant of Go's null-marshal and Error: text payloads);
lib/triageVerbs exposes per-service query options for the Why/Impact
inspector tabs with a 5-minute cache window shared with the palette.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erbs

root_cause_analysis and impact_analysis land in the tab registry behind a
run/cancel hook (no RPC until asked, per-service 5-min cache so tab flips
are instant, AbortSignal cancel). Why renders ranked causes with score
bars, evidence and error-chain spans deep-linking to /traces via a new
openTrace trail push; Impact renders the depth-grouped blast radius with
a Show-on-map handoff. ?tab= deep-links a registry tab for the palette.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Inspector's Show-on-map handoff lands here: /map?impact=<service>
BFS-walks the loaded edge set client-side (no extra RPC), tints the
downstream cone --crit with depth-faded opacity, rings the root, dims
everything outside the cone, and announces the overlay in a clearable
banner. Impact mode forces the SVG rendering on xs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cmdk palette (lazy chunk, zero cost until first ⌘K) with Navigate /
Actions / Services / Utilities sections. The triage verbs are two-step:
pick the action, pick a service — root-cause and blast-radius prefetch
the MCP RPC and land on the Inspector's Why/Impact tab through the shared
query key; search-logs deep-links the new /logs?service= scoped search.
Global keys ride a pure tested state machine: ⌘/Ctrl+K everywhere, 'g
m/t/l/h' chords, '?' shortcut sheet, '/' left to page-local filters.
Palette buttons land in the pulse bar (md+) and tab-bar center (xs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…P console

The Triage home + Pulse bar replaced the dashboard; the MCP console's
function lives in the Connect popover and the palette's triage verbs.
Routes, nav items and the VIEW_PATHS indirection are gone — nav is
exactly Triage / Flow Map / Traces / Logs (the xs bottom-bar spec) and
retired paths redirect home. The Suspense fallback is a token-CSS
spinner; main.tsx drops ThemeProvider/ToastRegion (data-theme tokens are
the theme system). Removed orphans: useMcpCall/useMcpTools/useAnomalies/
useTraffic/useReady hooks, Truncate, lib/breakpoints, lib/utils, and the
unused listMcpTools surface. Deps uninstalled: @ossrandom/design-system,
uplot, clsx, both @fontsource packages (fonts are vendored in
ui/public/fonts). tsconfig lib/target now declares the ES2022 the code
already uses (the DS types were masking Array.prototype.at).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Budgets reset to post-exit actuals + ~10% headroom: initialJs 160→118 KB
(measured 107.6), initialCss 24→6 (2.7), fonts 90→75 (67.8),
lazyChunkDefault 40→10 (largest chunk 7.9); the per-route exceptions are
gone — everything fits the default. check-budgets now strips Vite's
8-char hash correctly even when it contains a dash, so future exceptions
match. ErrorBoundary drops the dead design-system var names for our
tokens (hex fallbacks are the tokens' dark values, for the
stylesheet-missing case) and its inline transitions ride --dur-2 so
reduced-motion zeroes them. Palette input signals focus on its hairline;
RouteFallback and the LogsView mode selection clear sonar S6772/S3358.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, design-system exit

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Pin form controls to 16px on coarse pointers (iOS zooms any focused
  control below 16px; the product scale tops out at 13.5px).
- body owns surface continuity: token background with a subtle top
  luminance ramp (--bg-app), no white rubber-band overscroll, font
  smoothing, tightened Inter tracking, tap-highlight removed.
- theme-color meta synced to the active theme (browser chrome matches
  the surface), SVG favicon, token-colored thin scrollbars, ::selection,
  contained focus treatment for text fields, --shadow-pop token.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@aksOps aksOps enabled auto-merge (squash) June 12, 2026 02:23
@socket-security

socket-security Bot commented Jun 12, 2026

Copy link
Copy Markdown

Comment thread ui/scripts/precompress.mjs Dismissed
Comment thread ui/scripts/precompress.mjs Dismissed
Comment thread ui/scripts/precompress.mjs Dismissed
Comment thread ui/scripts/precompress.mjs Dismissed
Comment thread internal/api/compress.go Dismissed
Comment thread internal/graphrag/builder.go Dismissed
aksOps and others added 5 commits June 12, 2026 02:27
path.Clean alone is not a traversal sanitizer; root the path first so it
can never climb above /. Reads already go through fs.FS (fs.ValidPath
rejects ..) — this is defense in depth and unblocks the SAST gate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- explicit sort comparators (S2871) in FlowMap shape key and two tests;
  the FlowMap edge comparator also never returned 0 for equal keys
- linear-time trailing-zero trim in format.ts (S5852 backtracking hotspot)
- backoff jitter now draws from Web Crypto via lib/random.ts (S2245);
  wsManager tests stub the module instead of Math.random
- range-validate GRAPHRAG_EVENT_QUEUE_SIZE 1..1e6 (CodeQL
  go/uncontrolled-allocation-size: env value flowed unchecked into
  make(chan event, n) where each event is ~0.5-2 KB)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/0+$/ still backtracks polynomially when retried from every input
position; no regex means no hotspot and strictly less work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…select

runWithTimeout's handler goroutine cancels the call context (deferred)
after sending its result, so a fast handler leaves both select cases
ready and Go's random pick returned -32001 ~50% of the time the
goroutine won the schedule — seen as instant 'exceeded 30s deadline'
failures in CI's race job. Drain the buffered result channel before
declaring a timeout: the send happens-before the completion cancel, so
a queued result is always found and only a still-running handler times
out.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
config.Validate already rejects out-of-range GRAPHRAG_EVENT_QUEUE_SIZE,
but the bound is invisible to callers constructing graphrag.Config
directly — and to CodeQL's go/uncontrolled-allocation-size flow, which
loses the config-boundary check across the struct-field round trip.
Clamp nonsense sizes to the default where the buffer is allocated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 5d338e0 into main Jun 12, 2026
17 checks passed
@aksOps aksOps deleted the feat/memory-survival-and-triage-ui branch June 12, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants