feat: memory survival, hot-path performance, and triage-first UI rewrite#103
Merged
Conversation
A 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing. Heap profiling (gc=1) attributed 84% of the live heap to AnomalyStore PRECEDED_BY edges: the 10s detector minted a NEW anomaly node every tick per erroring service (UnixNano-suffixed ID), and correlateWithRecent then created O(N^2) edges among them — unbounded until the 24h TTL. - fix: stable per-(service,type) anomaly IDs so detection UPSERTS one evolving node instead of one-per-tick; this bounds both the node map and the edge mesh (AnomalyStore 272 MB -> 2.6 MB; peak RSS 1.8 GB -> 292 MB, now flat over the full 15 min). + regression test. - feat: applyMemoryLimit() sets a soft GOMEMLIMIT at startup — honors an explicit env value, else 75% of the detected cgroup/host budget — so the GC paces against a ceiling instead of letting next_gc run away. Defense in depth; cgroup v2/v1 + /proc/meminfo detection, stdlib-only. + tests. Validation: 3x 15-min soaks + heap profile; integrity ok, 0 drops/429s, 0 ERROR/panic, clean shutdown, goroutines/fds recover, 30k spans/120 svcs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- security(api): close cross-tenant read caused by middleware ordering — TenantMiddleware now passes through when auth already pinned a tenant (HasTenantContext), so a per-tenant key can't be escaped via X-Tenant-ID - fix(ingest): correct token-bucket sampler math; the old cost (1/rate) exceeded the cap for rate<1.0 so ~100% of healthy spans were dropped (SQLite default 0.05 persisted almost no baseline traces) - fix(api): clamp limit/offset on /api/logs and /api/traces (negative limit was passed to GORM as unlimited — heap/DB DoS) - fix(ingest): sanitize X-Tenant-ID on the HTTP OTLP path (gRPC parity) - fix(mcp): don't cache error tool results; enforce the response byte cap in resourceResult (trace_graph DB fallback was uncapped) - fix(ui): correct ServiceSidePanel test for split design-system markup, mount ErrorBoundary, derive connected badge from ws.status - chore: bump go directive to 1.25.11 to unblock the OSV-Scanner CI gate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PPROF_ADDR (default 127.0.0.1:6060, empty disables) serves net/http/pprof from its own listener so profiling never reaches the public :8080 mux. Heap attribution is the prerequisite for proving every memory fix in the OOM-survival series (b1983f8 was only diagnosable via heap profiles). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
otelcontext_graphrag_store_entities{entity}/_edges{store},
otelcontext_tsdb_ring_series_active, otelcontext_drain_templates_active —
sampled every 15s via len()-under-RLock census so operators can attribute
RSS growth to a specific store without a heap profile.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
~100 MB of standing buffer at the Postgres default; on SQLite the single writer starves the workers anyway, so buffer less and let the metered drop path engage sooner. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both the embedded-UI server (precompressed asset negotiation, index.html 304s) and the API layer (gzip middleware, cached-payload ETags) need the same two pure helpers; centralising them in httpconst avoids duplicating header-parsing logic across packages. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mbudget Move detectMemoryBudget/readCgroupBytes/readMemTotal (cgroup v2 -> v1 -> /proc/meminfo) out of the root-package memlimit.go into a reusable internal/membudget package so internal/storage can budget-scale the SQLite PRAGMA stanza without importing package main. applyMemoryLimit stays as a thin wrapper with unchanged behavior; detection tests moved across verbatim and the wrapper gained its own tests (operator GOMEMLIMIT honored, budget fraction applied). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds GRAPHRAG_TRACE_TTL (default 1h, SQLite default 30m via sqliteOverrides), GRAPHRAG_MAX_SPANS_PER_TENANT (default 500000) and GRAPHRAG_TENANT_IDLE_TTL (default 24h). The TraceStore span window is the largest GraphRAG heap consumer at 120 services; anomaly and investigation lookbacks are <=5min so the 30min SQLite window is safe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the bare http.FileServer over the embedded dist with a purpose- built spaHandler: - /assets/* (Vite hashed filenames): Accept-Encoding negotiation against build-time .br/.gz siblings, Content-Type from the original extension, Cache-Control: public, max-age=31536000, immutable + Vary. - index.html: Cache-Control: no-cache with a startup-computed sha256 ETag; If-None-Match returns 304, precompressed variants served when accepted. - SPA fallback: unknown extensionless paths serve index.html for client- side routing, preserving the old spaFS dot-in-path 404 contract and explicitly refusing machine namespaces (/api, /v1, /ws, /mcp, /metrics). - Source-only checkouts (stub dist) degrade to a descriptive 404. Embed directives unchanged; siblings are emitted by the build-time precompress script (wired separately) and picked up by all:dist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-existing lint-gate failures (gofmt struct alignment, nolintlint on seriesPerTenant which the unused linter no longer flags). No semantic change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion
Two bugs in the RingBuffer dashboard accelerator:
(a) the rings map was uncapped and bypassed cardinality enforcement —
Aggregator.Ingest fed the ring before the bucket-level cap check, so
a cardinality flood grew Go heap without bound on SQLite deploys.
(b) rings were keyed service|metric, merging points from different
tenants into one series — a data-isolation breach.
Fixes:
- key rings by tenant|service|metric; empty tenant coerces to
storage.DefaultTenantID so single-tenant reads and writes agree
- RingBuffer gains maxSeries (0 = unlimited, backward compatible):
NEW series past the cap are refused (Record returns false) while
existing series keep recording; refusals fire onSeriesRejected
outside the lock
- new counter otelcontext_tsdb_ring_series_rejected_total in
internal/telemetry/metrics.go surfaces refusals
- startup wiring: NewRingBuffer now takes METRIC_MAX_CARDINALITY and
the counter's Inc as the rejection callback (main.go, one line)
QueryRecent/Record/AllKeys callers: only Aggregator.Ingest and main.go —
no external QueryRecent callers exist in the tree.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New zero-dep ui/scripts/precompress.mjs (node:zlib only — air-gapped
build friendly) emits .br (quality 11) and .gz (level 9) siblings for
every dist *.{js,css,html,svg,json} >= 1KB, printing a size table.
Wired into the ui build script after vite build; the Go binary's
all:dist embed picks the siblings up and internal/ui negotiates them
via Accept-Encoding.
dist content stays uncommitted (source-only main) — the script runs at
release time via the existing build flow.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… auto_vacuum The pure-Go SQLite driver's page cache and mmap window are Go-heap and address-space costs, so the hardcoded 256 MB cache + 1 GB mmap pair starved 4 GB hosts. The stanza now scales both against the detected memory budget (internal/membudget: cgroup v2 -> v1 -> /proc/meminfo): cache = budget/32 clamped to [64 MB, 256 MB], mmap = budget/8 clamped to [256 MB, 1 GB] — a 4 GB host yields 128 MB cache + 512 MB mmap. Detection failure falls back to the previous hardcoded values; operator overrides SQLITE_CACHE_SIZE_KB / SQLITE_MMAP_SIZE_BYTES win unconditionally. The fail-closed Exec loop and every other pragma are unchanged (round-trip-guarded by tests). Also sets PRAGMA auto_vacuum=INCREMENTAL best-effort (log, never abort) BEFORE journal_mode=WAL — the WAL switch initializes the file header, after which the stored auto_vacuum mode is frozen. Only takes effect on databases this process creates; prepares fresh deploys for incremental_vacuum-based retention maintenance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TraceStore gains MaxSpans (GRAPHRAG_MAX_SPANS_PER_TENANT, default 500k): at the cap, NEW spans are skipped — UpsertSpan returns false and processSpan records the drop via the existing recordEventDrop seam as signal="span_capacity" on otelcontext_graphrag_events_dropped_total. Updates to resident span IDs still apply; the graph is best-effort and the DB stays the source of truth. main.go wires MaxSpansPerTenant plus the GRAPHRAG_TRACE_TTL / GRAPHRAG_TENANT_IDLE_TTL duration knobs (unparsable values fall back to package defaults, DLQ-interval style). Also lands the tenantStores lastAccess/lastEventAt/lastRebuildMax bookkeeping fields and the storesForTenant/tenantStoresNoTouch split consumed by the follow-up eviction and incremental-rebuild commits. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New internal/api/compress.go: pooled gzip.Writer (sync.Pool) wrapping the ResponseWriter for GET /api/* when the client accepts gzip. Lazy engagement on first write — 204/304 and already-encoded responses pass through, Content-Length is dropped, Vary: Accept-Encoding is set on every eligible path. Flush propagates through the gzip buffer so streaming handlers keep incremental delivery; Unwrap supports http.ResponseController. Wired in main.go as the innermost wrapper (directly around the mux, before TenantMiddleware) so only handler output is compressed and /ws*, /v1/*, /metrics*, and the MCP/SSE path are untouched — WebSocket hijacking keeps the raw writer by construction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The queue gated only on item count; each Batch holds unbounded Traces/Spans/Logs slices, so 50k queued batches could pin GBs of heap. Add byte accounting: - (*Batch).approxBytes() — O(records) heap estimate, computed once at Submit and released by process() (deferred; panic path included) - hard cap INGEST_PIPELINE_MAX_BYTES (default 512MB, SQLite default 128MB, 1MB floor) — at the cap Submit rejects with ErrQueueFull even for priority batches: a 429 is recoverable, an OOM kill is not - soft backpressure now fires on max(itemFullness, byteFullness) - new gauge otelcontext_ingest_pipeline_queue_bytes; drop reason "bytes_full" on otelcontext_ingest_pipeline_dropped_total - Stats() exposes QueueBytes / MaxBytes / RejectedBytes Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…(P1.5) Tenants with no ingest event or query for GRAPHRAG_TENANT_IDLE_TTL (default 24h) are dropped from the coordinator map on the 60s refresh tick, freeing all four per-tenant stores at once. storage.DefaultTenantID is immune. Eviction is self-healing: rebuildAllTenantsFromDB re-creates genuinely active tenants from recent spans within one tick, and any ingest/query re-creates the slice instantly with a fresh idle window. The DB rebuild path deliberately uses tenantStoresNoTouch so 60s bookkeeping alone cannot keep a dormant tenant alive. New counter otelcontext_graphrag_tenants_evicted_total in internal/telemetry/metrics.go (shared file — promauto pattern). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The daily maintenance pass ran a full VACUUM, which holds a whole-file exclusive lock for 10-60 minutes on multi-GB databases (admin_handlers.go documents this for drop_fts) — ingest stalls into a 429 storm and the async pipeline queue/RAM spiral. The SQLite branch now keeps PRAGMA optimize and runs PRAGMA incremental_vacuum(10000) instead, releasing up to 10k freelist pages per day without the exclusive lock. Fresh databases are provisioned auto_vacuum=INCREMENTAL by NewDatabase; on legacy auto_vacuum=NONE files the pragma is a harmless no-op. incremental_vacuum is a row-returning pragma that frees one page per step — Exec steps it once on glebarez/modernc, so the statement is queried and drained (drainQuery) to reclaim the full batch. Operators can restore the legacy behavior with RETENTION_FULL_VACUUM=true (new config field, default false); POST /api/admin/vacuum remains for on-demand full VACUUM. Postgres/MySQL maintenance is byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…1.6) SignalStore.Prune (modeled on TraceStore.Prune) runs on the 60s refresh tick: MetricNodes idle >24h are dropped, the oldest-LastSeen overflow past 2000/tenant is evicted, each removal takes its MEASURED_BY edge, and remaining edges are swept by UpdatedAt. The metric/log-cluster upsert paths now refresh edge UpdatedAt on every hit so the sweep never severs a live correlation (previously edges kept their creation time forever). LogClusters stay untouched — Drain bounds them upstream. AnomaliesSince gains a bounded sibling AnomaliesSinceLimit; correlateWithRecent walks at most 1000 recent anomalies per detection so a pathological backlog cannot turn each 10s tick into an O(N) scan plus O(N) PRECEDED_BY fan-out. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GetServiceMapMetrics loaded up to 500k full Span rows — zstd-decompressing the CompressedText attributes column per row in Scan() — then aggregated in Go, costing 2-5s and ~300MB transient on dashboard loads. Node stats now come from one portable GROUP BY aggregate (COUNT/AVG/SUM CASE WHEN, runs on sqlite/postgres/mysql/sqlserver; duration * 1.0 keeps AVG in floating point where AVG(bigint) truncates). Edge stats keep the in-Go parent-resolution pass but over a six-column projection so the attributes column is never scanned. Output proven equivalent to the old implementation by a reference-copy equality test on a fixture covering every branch. One intentional delta: ServiceMapNode.ErrorCount is now populated from span status (STATUS_CODE_ERROR). The old path left it permanently 0 — a latent bug, since buildGraphFromDB divides ErrorCount/TotalTraces for error rates. Node stats are also no longer subject to the 500k row cap (the aggregate is unbounded; the cap now applies only to the edge scan). Benchmark (5000 spans, in-memory SQLite): 22.1ms/5.1MB/150k allocs vs 37.4ms/14.6MB/310k allocs before — the gap widens with row count and attribute payload size. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirror the 10s system-graph cache pattern from graph_handler.go (TTLCache + X-Cache HIT/MISS header) on handleGetServiceMapMetrics with a 30s TTL. The cache key uses the raw start/end query params, so the default rolling window (no params) shares a single entry per tenant instead of being re-keyed on every request timestamp; explicit windows are cached independently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ns Mono subsets @tanstack/react-query@5, @tanstack/react-virtual@3, wouter@3, cmdk, Radix dialog/tooltip/tabs/dropdown-menu, uplot (all MIT/ISC) and the OFL latin woff2 subsets (Inter variable 47.1KB + JetBrains Mono 400 20.7KB = 67.8KB, under the 90KB font budget) vendored into ui/public/fonts with their licenses. npm audit --audit-level=moderate reports 0 vulnerabilities. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (P2.2) rebuildFromDBForTenant tracked nothing between ticks and re-read the full trailing 1h of spans per tenant every 60s. Each tenant slice now records the max start_time merged (lastRebuildMax); subsequent ticks query start_time > max(since, HWM-5min) — the 5min overlap re-merges late arrivals. A fresh slice (first build, post-eviction) has HWM 0 and takes the full window, so P1.5 eviction stays self-healing. The 50k row LIMIT is kept and now logs a warning when hit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…intenance drainQuery must surface query errors; a cancelled context drives both SQLite maintenance error branches (PRAGMA optimize + vacuum step) and proves the overlap guard is released so the next tick still runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/api/system/graph already had a 10s tenant-scoped cache but re-encoded the cached struct on every hit. It now stores the rendered JSON plus a sha256-derived strong ETag (hashed once per cache fill) and honours If-None-Match with a bodyless 304. /api/metrics/dashboard and /api/stats get the same pattern via a shared cachedJSON helper: - dashboard: key scoped by (tenant, raw query) so explicit start/end/service_name windows never share an entry; queries over 256 bytes bypass the cache to bound key cardinality. - stats: key scoped by tenant; the UI footer polls this and the COUNT(*) scans behind it are not free on a multi-GB SQLite file. Steady-state polling becomes a map lookup + hash compare instead of a SQLite query + JSON encode; clients echoing If-None-Match transfer no body at all. Also drops the unreachable /ws,/v1,/metrics skip loop in gzipEligible (the /api/ prefix gate subsumes it; the configurable MCP path keeps its explicit exclusion). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…P2.3) Each event path (processSpan/processLog/processMetric) stamps the tenant's lastEventAt; detectAnomalies records its own start time and skips tenants whose lastEventAt predates the previous tick — no events means their service/metric stats cannot have changed, so re-walking them only burned CPU and refreshed anomaly timestamps spuriously. The first scan after startup always runs so DB-rebuilt state is examined at least once. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sonar S1192 flagged the literal duplicated three times in the UI asset server; the gzip middleware used it twice more. Same treatment as the existing HeaderContentType. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ient, WS singleton - lib/format.ts: central percent/duration/count/size formatters fixing the two audit bugs (error_rate 0.042 -> "4.2%", health_score 0.73 -> "73%") with explicit ratio/percent/auto units. - lib/apiFetch.ts: the one fetch wrapper — AbortSignal passthrough, JSON parse, ApiError normalization (status 0 = network), aborts re-thrown untouched. - lib/queryClient.ts: staleTime 10s (matches server graph TTL), gcTime 5min, refetchIntervalInBackground:false, retry 2 with jittered exponential backoff. - lib/wsManager.ts: module-level /ws singleton porting the backoff/heartbeat/visibility logic out of useWebSocket, adding +20% reconnect jitter, a 5000-cap log ring buffer, a 250ms-coalesced version counter, and useSyncExternalStore subscribe/snapshot pairs. Fixes a latent hook bug: the dead-connection watchdog was re-armed on every ping, so its deadline slid forever and never fired. 46 tests (fake timers + mock WebSocket). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…aths Drives a real refreshLoop tick (20ms cadence) to prove trace prune, signal prune and idle-tenant eviction fire from the loop itself, seeds exactly rebuildRowLimit rows to exercise the limit-hit branch, and wires a shared telemetry.Metrics (sync.Once — promauto panics on duplicate registration) so the Prometheus increments in evictIdleTenants and recordEventDrop are executed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/map now renders the layered DAG (FlowMap + dagLayout): semantic-zoom err%% labels at k>=0.8, log-width edges, dash animation only on failing edges under no-preference motion, transform-only pan/wheel/pinch with touch-action:none scoped to the SVG, re-layout gated on the node/edge set hash, accent-ring selection with 1-hop dimming, and full keyboard walking (arrows/Enter/Esc/f//). Toolbar: filter with clear, status pills, fit, legend popover, dataUpdatedAt freshness. xs defaults to a status-grouped card list (shared ServiceRow) with a Flow toggle. App mounts the lazy ServiceInspector (?service=) and the TrailBar globally. Deletes ServicesView/ServiceGraph/ServiceList/ ServiceSidePanel/StatRow/useWindowHeight and the cytoscape + cytoscape-cose-bilkent deps; the map route budget exception drops 170KB -> 35KB cap (measured 5.41KB gz vs 156.33KB before, -150.9KB). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/ now mounts TriageView: a 24h anomaly strip fed by the MCP get_anomaly_timeline triage tool (the UI's first human-facing MCP call; severity-colored clustered ticks, tap opens the Inspector) over the service feed ranked critical → degraded → alerted with healthy services collapsed (shared ServiceGroups, also the map's xs list). The global error sparkline renders only when /api/metrics/traffic is already in the query cache (enabled:false observer — zero new polling); useTraffic is ported to TanStack Query so the Dashboard fills that shared key. Empty state inlines the copyable OTLP/MCP connect endpoints (extracted from ConnectPopover). Shell nav gains Triage as the first tab; unknown routes now land on / instead of /map. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The C5 evidence pages need two fields the backend already stores but
never serialized:
- views.Span gains `status` (OTLP code, e.g. STATUS_CODE_ERROR) so the
trace waterfall can color error spans --crit. The storage model has
carried Status since ingest day one; only the view dropped it.
- GET /api/logs honors the `trace_id` query param (exact match through
the existing storage.LogFilter.TraceID). The traces→logs cross-link
needs a deterministic filter on every driver — body `search` only
matches trace IDs on the LIKE fallback, not under FTS5, and carries a
24h clamp that an indexed exact match should not inherit.
ui/src/types/api.ts mirrors both: Span.status, and the LogsResponse
envelope corrected to the actual {data, total} handler shape.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pure, exhaustively unit-tested logic for the C5 evidence pages (TDD-first; components consume these without re-deriving anything): - lib/waterfall.ts — time-positioned trace waterfall layout: DFS row order with start-time/span_id-stable sorting, depth via parent_span_id (orphans become roots, cycles guarded), offset/width as fractions of the trace extent, deterministic FNV-1a service hue. - lib/traceRows.ts — nearest-rank percentile + visible-set p99 for the inline duration bars, OTLP status→tone mapping, and the /traces URL filter round-trip (service/status/q/trace). - lib/logRows.ts — severity normalization for pills/badges, HH:mm:ss.SSS time gutter, per-flush-tick buffer filter/counts, and lazy attributes parsing for both producer shapes (plain object and the OTLP []KeyValue array the ingester marshals). - hooks/useDebouncedValue.ts — 300ms trailing debounce for the server search inputs. - lib/wsManager.ts — getLogsTotal(): monotonic appended count that survives ring eviction; consumers diff readings for "N new" pills. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/logs — live tail over the wsManager /ws ring buffer: - virtualized fixed-height rows (28px / 44px on xs+coarse, overscan 8), severity badge, HH:mm:ss.SSS time, mono service, single-line body; xs collapses the gutter to a 3px severity left-border - auto-follow pinned to bottom; scroll-up pauses follow with a "N new" pill that jumps back down (never yanks scroll); explicit Pause freezes a buffer snapshot and counts missed entries - row click expands in place: full body, lazily-parsed attributes, ai_insight, "Show context" (GET /api/logs/context) and "Open trace" cross-link; severity pills with live per-tick counts - search input (labelled, 300ms debounce) switches to the server GET /api/logs mode with offset paging and a "Back to live" affordance; ?trace= deep link scopes the history to one trace /traces — virtualized table + real time-positioned waterfall: - status badge, service, mono operation, span count, duration with an inline bar relative to the visible-set p99; xs two-line cards - service (from /api/metadata/services), status and trace-ID search filters all in URL params; cursor-style "Load more" paging - detail: single-SVG waterfall (offset = start − trace start, width = duration, depth-indented, service-hash hue at low saturation, error spans --crit, keyboard-walkable bars); span click reveals lazy attributes + correlated logs; "Open in Logs" cross-link - master-detail ~55/45 split on lg+, full-screen push below Both routes ship skeleton/empty/error/refetch states, 44px coarse targets, hover:hover gating and token-only colors. Shell nav gains Traces/Logs entries; budgets.json pins the new chunks (LogsView 25KB, TracesView 30KB gz — measured 3.4/4.7KB). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, connect inline Drives the FlowMap wheel/pointer handlers (zoom transform, semantic-zoom label hiding, drag pan, two-pointer pinch) and the inspector bottom sheet's drag-to-snap/dismiss through fireEvent; adds the TanStack useTraffic adapter and ConnectInline clipboard tests. setPointerCapture joins the guarded jsdom stubs. Authored-file line coverage 98.7%, no file below 85%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… dagLayout Pull the back-edge-dropping DFS out of breakCycles into a named helper and replace bare .sort() calls with an explicit compareIds — same UTF-16 code-unit ordering, but the determinism contract is now stated in code rather than implied by Array.prototype.sort defaults. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
parseRankedCauses/parseImpactResult/parseAnomalies centralised in lib/mcpResults (tolerant of Go's null-marshal and Error: text payloads); lib/triageVerbs exposes per-service query options for the Why/Impact inspector tabs with a 5-minute cache window shared with the palette. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erbs root_cause_analysis and impact_analysis land in the tab registry behind a run/cancel hook (no RPC until asked, per-service 5-min cache so tab flips are instant, AbortSignal cancel). Why renders ranked causes with score bars, evidence and error-chain spans deep-linking to /traces via a new openTrace trail push; Impact renders the depth-grouped blast radius with a Show-on-map handoff. ?tab= deep-links a registry tab for the palette. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Inspector's Show-on-map handoff lands here: /map?impact=<service> BFS-walks the loaded edge set client-side (no extra RPC), tints the downstream cone --crit with depth-faded opacity, rings the root, dims everything outside the cone, and announces the overlay in a clearable banner. Impact mode forces the SVG rendering on xs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cmdk palette (lazy chunk, zero cost until first ⌘K) with Navigate / Actions / Services / Utilities sections. The triage verbs are two-step: pick the action, pick a service — root-cause and blast-radius prefetch the MCP RPC and land on the Inspector's Why/Impact tab through the shared query key; search-logs deep-links the new /logs?service= scoped search. Global keys ride a pure tested state machine: ⌘/Ctrl+K everywhere, 'g m/t/l/h' chords, '?' shortcut sheet, '/' left to page-local filters. Palette buttons land in the pulse bar (md+) and tab-bar center (xs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…P console The Triage home + Pulse bar replaced the dashboard; the MCP console's function lives in the Connect popover and the palette's triage verbs. Routes, nav items and the VIEW_PATHS indirection are gone — nav is exactly Triage / Flow Map / Traces / Logs (the xs bottom-bar spec) and retired paths redirect home. The Suspense fallback is a token-CSS spinner; main.tsx drops ThemeProvider/ToastRegion (data-theme tokens are the theme system). Removed orphans: useMcpCall/useMcpTools/useAnomalies/ useTraffic/useReady hooks, Truncate, lib/breakpoints, lib/utils, and the unused listMcpTools surface. Deps uninstalled: @ossrandom/design-system, uplot, clsx, both @fontsource packages (fonts are vendored in ui/public/fonts). tsconfig lib/target now declares the ES2022 the code already uses (the DS types were masking Array.prototype.at). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Budgets reset to post-exit actuals + ~10% headroom: initialJs 160→118 KB (measured 107.6), initialCss 24→6 (2.7), fonts 90→75 (67.8), lazyChunkDefault 40→10 (largest chunk 7.9); the per-route exceptions are gone — everything fits the default. check-budgets now strips Vite's 8-char hash correctly even when it contains a dash, so future exceptions match. ErrorBoundary drops the dead design-system var names for our tokens (hex fallbacks are the tokens' dark values, for the stylesheet-missing case) and its inline transitions ride --dur-2 so reduced-motion zeroes them. Palette input signals focus on its hairline; RouteFallback and the LogsView mode selection clear sonar S6772/S3358. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, design-system exit Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Pin form controls to 16px on coarse pointers (iOS zooms any focused control below 16px; the product scale tops out at 13.5px). - body owns surface continuity: token background with a subtle top luminance ramp (--bg-app), no white rubber-band overscroll, font smoothing, tightened Inter tracking, tap-highlight removed. - theme-color meta synced to the active theme (browser chrome matches the surface), SVG favicon, token-colored thin scrollbars, ::selection, contained focus treatment for text fields, --shadow-pop token. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
path.Clean alone is not a traversal sanitizer; root the path first so it can never climb above /. Reads already go through fs.FS (fs.ValidPath rejects ..) — this is defense in depth and unblocks the SAST gate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- explicit sort comparators (S2871) in FlowMap shape key and two tests; the FlowMap edge comparator also never returned 0 for equal keys - linear-time trailing-zero trim in format.ts (S5852 backtracking hotspot) - backoff jitter now draws from Web Crypto via lib/random.ts (S2245); wsManager tests stub the module instead of Math.random - range-validate GRAPHRAG_EVENT_QUEUE_SIZE 1..1e6 (CodeQL go/uncontrolled-allocation-size: env value flowed unchecked into make(chan event, n) where each event is ~0.5-2 KB) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/0+$/ still backtracks polynomially when retried from every input position; no regex means no hotspot and strictly less work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…select runWithTimeout's handler goroutine cancels the call context (deferred) after sending its result, so a fast handler leaves both select cases ready and Go's random pick returned -32001 ~50% of the time the goroutine won the schedule — seen as instant 'exceeded 30s deadline' failures in CI's race job. Drain the buffered result channel before declaring a timeout: the send happens-before the completion cancel, so a queued result is always found and only a still-running handler times out. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
config.Validate already rejects out-of-range GRAPHRAG_EVENT_QUEUE_SIZE, but the bound is invisible to callers constructing graphrag.Config directly — and to CodeQL's go/uncontrolled-allocation-size flow, which loses the config-boundary check across the struct-field round trip. Clamp nonsense sizes to the default where the buffer is allocated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Three workstreams, 66 commits, no new data features — performance, memory survival, and a ground-up responsive UI.
A — Memory survival (fixes the production OOM restarts)
GOMEMLIMITvia newinternal/membudget(cgroup v2 → v1 → /proc/meminfo, 75% of budget)INGEST_PIPELINE_MAX_BYTES, 128 MB SQLite default) — 429 instead of OOMtenant|service|metrickeys (cross-tenant isolation fix) + cardinality capPRAGMA optimize+incremental_vacuum(no more 10–60 min exclusive-lock 429 storms)PPROF_ADDR), census gauges for every bounded structureSoak proof (120 services, 1200 spans/s, 4 GB cgroup cap): baseline main OOMs in ~40 min (+92 MB/min); patched peaks at 705 MB with a decaying slope and a zero-growth pprof heap diff T+10→T+40.
B — Serving & hot paths
GetServiceMapMetrics: 500 k-row scan + per-row zstd decompress → SQL aggregation + 30 s TTL cache (dashboard 2–5 s → <100 ms)GET /api/*, ETag on hot polled endpointsC — Triage-first UI rewrite (C1–C7)
@ossrandom/design-system, cytoscape, uplot removed; React 19 + TanStack Query/Virtual + wouter + Radix primitives + cmdk + hand-rolled token CSS?trail=URLs); ⌘K palette + keyboard chordsValidation
🤖 Generated with Claude Code