Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
d2527a3
fix(graphrag): bound anomaly-store memory blowup + GOMEMLIMIT safety net
aksOps Jun 4, 2026
2d02b19
fix: security, CI-unblock, and reliability quick-wins
aksOps Jun 4, 2026
d35986d
feat(observability): localhost pprof endpoint on dedicated listener
aksOps Jun 11, 2026
36375b4
feat(observability): in-memory store census gauges
aksOps Jun 11, 2026
fc58b62
perf(config): drop GraphRAG event queue SQLite default 100k -> 10k
aksOps Jun 11, 2026
8907413
feat(httpconst): shared Accept-Encoding and ETag negotiation helpers
aksOps Jun 11, 2026
3a079eb
refactor(membudget): extract memory-budget detection into internal/me…
aksOps Jun 11, 2026
62b63f9
feat(config): GraphRAG memory-bound env knobs with SQLite 30m trace TTL
aksOps Jun 11, 2026
8cf04db
feat(ui): precompressed asset serving, index ETag/304, SPA fallback
aksOps Jun 11, 2026
2996222
chore(tsdb): gofmt aggregator.go and drop stale nolint:unused directive
aksOps Jun 11, 2026
756e397
fix(tsdb): tenant-scope ring buffer series keys and cap new ring crea…
aksOps Jun 11, 2026
abbc508
feat(ui-build): build-time brotli/gzip precompression of dist assets
aksOps Jun 11, 2026
8a11ede
perf(sqlite): budget-scale cache_size/mmap_size PRAGMAs + incremental…
aksOps Jun 11, 2026
b061c23
feat(graphrag): per-tenant span cap in TraceStore (P1.4)
aksOps Jun 11, 2026
9f6e245
feat(api): stdlib gzip middleware for GET /api/* responses
aksOps Jun 11, 2026
c7d7c1a
feat(ingest): byte-bounded async ingest queue (P1.2)
aksOps Jun 11, 2026
4993037
feat(graphrag): evict idle tenant store slices from the refresh tick …
aksOps Jun 11, 2026
f018d8c
perf(retention): retire automatic daily full VACUUM on SQLite
aksOps Jun 11, 2026
c54aa23
perf(graphrag): bound SignalStore and the anomaly correlation walk (P…
aksOps Jun 11, 2026
83f86d4
perf(storage): aggregate service map nodes in SQL + narrow edge scan
aksOps Jun 11, 2026
17bbafc
perf(api): cache /api/metrics/service-map for 30s per tenant+window
aksOps Jun 11, 2026
99d8427
build(ui): add query/router/radix/uplot deps + vendored Inter/JetBrai…
aksOps Jun 11, 2026
5ae9fef
perf(graphrag): incremental DB rebuild via per-tenant high-water-mark…
aksOps Jun 11, 2026
b7090a3
test(retention): cover drainQuery error path and cancelled-context ma…
aksOps Jun 11, 2026
179996c
perf(api): ETag/304 + 10s TTL cache on hot polled endpoints
aksOps Jun 11, 2026
d0b337b
perf(graphrag): gate the 10s anomaly scan on tenant ingest activity (…
aksOps Jun 11, 2026
4918c2a
fix(storage): gofmt factory_test table, drop unused nolint directive
aksOps Jun 11, 2026
65a83b3
refactor(httpconst): hoist Content-Encoding header name to shared const
aksOps Jun 11, 2026
7ea6c2f
feat(ui): data-layer foundation — formatters, fetch wrapper, query cl…
aksOps Jun 11, 2026
945518d
test(graphrag): cover refresh-tick wiring, row-limit branch, metric p…
aksOps Jun 11, 2026
58a938a
style(config): gofmt alignment in driver-defaults test table
aksOps Jun 11, 2026
83eda2a
feat(ui): design tokens, system-aware theme, TanStack Query adapters
aksOps Jun 11, 2026
d971337
refactor(graphrag): extract incrementalSince to keep rebuild complexi…
aksOps Jun 11, 2026
18d97d0
feat(ui): responsive shell — pulse bar, adaptive nav, live dot, conne…
aksOps Jun 11, 2026
7e7ee91
feat(ui): gzip bundle budget gate (npm run check-budgets)
aksOps Jun 11, 2026
67c59ab
refactor(ui): drop the superseded useWebSocket hook
aksOps Jun 11, 2026
9c2bfaf
Merge branch 'worktree-wf_07cb4dcf-440-1' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
d93b2f7
Merge branch 'worktree-wf_07cb4dcf-440-4' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
947ac17
Merge branch 'worktree-wf_07cb4dcf-440-2' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
d234c0f
Merge branch 'worktree-wf_07cb4dcf-440-3' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
fa1748d
Merge branch 'worktree-wf_07cb4dcf-440-5' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
39e0460
Merge branch 'worktree-wf_07cb4dcf-440-6' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
1b769fb
Merge branch 'worktree-wf_07cb4dcf-440-7' into worktree-oteliq-perf-pr1
aksOps Jun 11, 2026
ddef334
fix(observability): pprof startup is best-effort; justify G108 suppre…
aksOps Jun 11, 2026
9bf578e
docs: document memory-survival series, serving changes, UI foundation
aksOps Jun 11, 2026
f82fe9f
feat(ui): investigation trail — URL-encoded breadcrumb stack lib, hoo…
aksOps Jun 11, 2026
7cca314
feat(ui): deterministic layered DAG layout lib for the flow map
aksOps Jun 11, 2026
3db5e6c
feat(ui): service inspector panel/sheet with extensible tab registry
aksOps Jun 11, 2026
020193b
feat(ui): deterministic SVG flow map replaces the cytoscape ServicesView
aksOps Jun 11, 2026
e8d49e7
feat(ui): triage home — anomaly strip, ranked feed, nav flip to /
aksOps Jun 11, 2026
dc43974
feat(api): expose span status and trace_id log filter on the wire
aksOps Jun 11, 2026
c0dcc35
feat(ui): waterfall layout math and evidence-page view models
aksOps Jun 11, 2026
45bd8e2
feat(ui): /logs live tail and /traces waterfall evidence pages (C5)
aksOps Jun 11, 2026
b845a6e
test(ui): cover map pan/zoom/pinch, sheet drag snaps, traffic adapter…
aksOps Jun 11, 2026
54404f6
Merge branch 'worktree-wf_98a2f4ef-3ec-2' into worktree-oteliq-perf-pr1
aksOps Jun 12, 2026
555bc4d
refactor(ui): extract DFS helper and explicit code-unit comparator in…
aksOps Jun 12, 2026
869f5f6
Merge branch 'worktree-wf_98a2f4ef-3ec-1' into worktree-oteliq-perf-pr1
aksOps Jun 12, 2026
a6356d3
feat(ui): MCP result parsers and triage-verb query options
aksOps Jun 12, 2026
550c6d5
feat(ui): inspector Why and Impact tabs — MCP triage verbs as human v…
aksOps Jun 12, 2026
1b3d415
feat(ui): flow-map blast-radius overlay via ?impact=
aksOps Jun 12, 2026
1562fae
feat(ui): command palette, go-to chords and shortcut sheet
aksOps Jun 12, 2026
0b78556
refactor(ui): exit @ossrandom/design-system — retire dashboard and MC…
aksOps Jun 12, 2026
0fae804
chore(ui): tighten bundle budgets and token/a11y polish
aksOps Jun 12, 2026
76b448d
Merge branch 'worktree-wf_5ff92b14-496-1' into worktree-oteliq-perf-pr1
aksOps Jun 12, 2026
c3a7c1d
docs(changelog): frontend rewrite C3-C7 — triage UI, flow map, palett…
aksOps Jun 12, 2026
0fd6f95
fix(ui): stop iOS input auto-zoom; browser-level premium polish
aksOps Jun 12, 2026
6cc5756
fix(ui): rooted-clean SPA request paths (semgrep filepath-clean-misuse)
aksOps Jun 12, 2026
986592a
fix: resolve SonarCloud quality-gate and CodeQL findings on PR #103
aksOps Jun 12, 2026
070da70
fix(ui): trim trailing zeros with an index walk, not regex (S5852)
aksOps Jun 12, 2026
dc95a47
fix(mcp): don't report spurious timeouts when a tool finishes during …
aksOps Jun 12, 2026
70ab254
fix(graphrag): clamp event channel size at the allocation site
aksOps Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# LOG_LEVEL=INFO # DEBUG|INFO|WARN|ERROR
# HTTP_PORT=8080 # HTTP API + OTLP HTTP + WebSocket + UI
# GRPC_PORT=4317 # OTLP gRPC
# PPROF_ADDR=127.0.0.1:6060 # net/http/pprof on a dedicated loopback listener; empty disables

# ---- Database ---------------------------------------------------------------
# DB_DRIVER=sqlite # sqlite|postgres|mysql|sqlserver
Expand Down Expand Up @@ -48,6 +49,8 @@
# DB_MAX_IDLE_CONNS 10 → 1
# INGEST_PIPELINE_WORKERS 8 → 2
# INGEST_PIPELINE_QUEUE_SIZE 50000 → 10000
# INGEST_PIPELINE_MAX_BYTES 536870912 → 134217728 (512MB → 128MB byte cap on queued batches;
# at the cap ingest gets 429 even for error batches)
# METRIC_MAX_CARDINALITY 10000 → 3000
# STORE_MIN_SEVERITY "" → "WARN" (INFO/DEBUG still flow to GraphRAG/Drain, just not persisted)
# SAMPLING_RATE 1.0 → 0.05 (errors and slow spans always kept)
Expand All @@ -57,6 +60,14 @@
# Override by setting the env var explicitly. See
# docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for
# per-default rationale and the SQLite PRAGMA stanza applied at startup.
#
# The PRAGMA stanza budget-scales its two memory knobs against the detected
# memory budget (cgroup v2 → cgroup v1 → /proc/meminfo): page cache =
# budget/32 clamped to [64 MB, 256 MB], mmap window = budget/8 clamped to
# [256 MB, 1 GB]. A 4 GB host gets 128 MB cache + 512 MB mmap; detection
# failure falls back to the 256 MB / 1 GB ceilings. Explicit overrides win:
# SQLITE_CACHE_SIZE_KB=131072 # Page cache in KB (> 0). Pure-Go driver: this is Go-heap memory.
# SQLITE_MMAP_SIZE_BYTES=536870912 # mmap window in bytes (>= 0; 0 disables mmap).

# ---- Azure Entra (passwordless Postgres) ------------------------------------
# DB_AZURE_AUTH=false # Enables DefaultAzureCredential for Postgres. Requires strict TLS
Expand Down Expand Up @@ -120,6 +131,10 @@

# ---- Retention --------------------------------------------------------------
# HOT_RETENTION_DAYS=7 # RetentionScheduler purge cutoff. Range 1..36500. Set explicitly in prod.
# RETENTION_FULL_VACUUM=false # Daily SQLite maintenance runs PRAGMA incremental_vacuum(10000) by default.
# # true restores the legacy daily full VACUUM (exclusive lock, 10-60 min on
# # multi-GB DBs — expect a 429 storm while it holds the writer lock).
# # On-demand full VACUUM: POST /api/admin/vacuum. Ignored on non-SQLite drivers.

# ---- OTel Self-Instrumentation ----------------------------------------------
# OTEL_EXPORTER_OTLP_ENDPOINT= # When set, OtelContext exports its own spans to this OTLP gRPC endpoint.
Expand Down
91 changes: 91 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,97 @@ last published pre-release tag (`v0.2.0-beta.6`).

## [Unreleased]

### Fixed — production OOM restarts (memory-survival series)

- **AnomalyStore memory blowup** (merged from `fix/sqlite-survival-hardening`): stable
per-(service,type) anomaly IDs replace one-node-per-10s-tick; the O(N²)
PRECEDED_BY edge mesh that heap profiling attributed 84% of live heap to is
gone (AnomalyStore 272 MB → 2.6 MB in soak).
- **GOMEMLIMIT safety net**: startup sets a soft limit (env honored, else 75%
of the cgroup/host budget via new `internal/membudget`).
- **Byte-bounded ingest queue**: `INGEST_PIPELINE_MAX_BYTES` (512 MB; 128 MB on
SQLite) — the item-count queue could hold GBs; at the cap even error/slow
batches get 429/`RESOURCE_EXHAUSTED` (reason `bytes_full`).
- **GraphRAG bounds**: per-tenant span cap (`GRAPHRAG_MAX_SPANS_PER_TENANT`,
500k), SQLite trace TTL 1h→30m (`GRAPHRAG_TRACE_TTL`), idle-tenant store
eviction (`GRAPHRAG_TENANT_IDLE_TTL`, 24h; default tenant immune),
SignalStore metric nodes bounded (2000/tenant + 24h TTL), anomaly
correlation walk capped at 1000.
- **TSDB ring buffers**: keys are now tenant-scoped (`tenant|service|metric`
— fixes a cross-tenant data-isolation breach) and ring creation is capped at
`METRIC_MAX_CARDINALITY` (previously bypassed the cardinality check).
- **SQLite maintenance**: the automatic daily full `VACUUM` (10–60 min
exclusive lock → 429 storm → queue/RAM spiral) is replaced by
`PRAGMA optimize` + `incremental_vacuum(10000)`; restore via
`RETENTION_FULL_VACUUM=true` or `POST /api/admin/vacuum`. New DB files are
created `auto_vacuum=INCREMENTAL`.
- **Budget-scaled SQLite PRAGMAs**: page cache = budget/32 ∈ [64 MB, 256 MB],
mmap = budget/8 ∈ [256 MB, 1 GB] (4 GB host → 128 MB + 512 MB); overrides
`SQLITE_CACHE_SIZE_KB` / `SQLITE_MMAP_SIZE_BYTES`; fail-closed stanza kept.
- Cherry-picked security/correctness quick-wins: cross-tenant read escape via
`X-Tenant-ID` closed, token-bucket sampler math fixed (rate < 1.0 dropped
~100% of healthy spans — SQLite's 0.05 default persisted almost nothing),
negative limit/offset clamps on `/api/logs` + `/api/traces`, MCP error
results no longer cached and `trace_graph` response capped.

### Added — observability

- `PPROF_ADDR` (default `127.0.0.1:6060`): `net/http/pprof` on a dedicated
loopback listener.
- Store census gauges: `otelcontext_graphrag_store_entities{entity}`,
`otelcontext_graphrag_store_edges{store}`, `otelcontext_tsdb_ring_series_active`,
`otelcontext_drain_templates_active`, `otelcontext_ingest_pipeline_queue_bytes`,
`otelcontext_graphrag_tenants_evicted_total`, `otelcontext_tsdb_ring_series_rejected_total`.

### Changed — performance

- `GetServiceMapMetrics`: node stats aggregate in SQL and the edge pass scans
a narrow projection — the per-row zstd decompression of span attributes is
gone (benchmark: 22 ms/5 MB vs 37 ms/15 MB on 5k spans; scales with row
count). Also fixes node `error_count` being permanently 0. The
`/api/metrics/service-map` response is cached 30s per tenant+window.
- GraphRAG DB rebuild is incremental via per-tenant high-water-mark (was a
full 1h-window re-read every 60s); the 10s anomaly scan skips tenants with
no new ingest events.
- HTTP serving: UI assets ship brotli/gzip precompressed with
`Cache-Control: immutable` + content hashing; `index.html` gets `no-cache`
+ ETag; SPA fallback for client-side routes; GET `/api/*` responses are
gzipped; `/api/system/graph`, `/api/metrics/dashboard`, `/api/stats` honor
`If-None-Match` → 304 with a shared 10s render cache.

### Changed — frontend rewrite complete (phases C3–C7)

- New Triage home at `/`: anomaly strip (MCP `get_anomaly_timeline`) + ranked
worst-first service feed. The cytoscape physics graph is replaced by a
deterministic layered SVG flow map (own ~300-line layout; −500 KB raw of
graph-lib payload; map route now ~5 KB gz), keyboard-walkable, with a
blast-radius overlay (`/map?impact=svc`).
- Service Inspector (docked panel / bottom sheet) with Overview, Dependencies,
**Why** (`root_cause_analysis`) and **Impact** (`impact_analysis`) tabs —
the MCP triage verbs as human-clickable actions. Investigation Trail:
URL-encoded drill-down breadcrumbs (`?trail=`), shareable and reload-safe.
- `/traces` (virtualized table + real time-positioned SVG waterfall) and
`/logs` (live tail over the bounded WS ring buffer, virtualized,
severity pills, context/trace cross-links).
- ⌘K command palette (navigate / services / triage actions / utilities),
`g m/t/l/h` chords, `?` shortcut sheet.
- `@ossrandom/design-system`, the legacy Dashboard and the MCP Console are
removed (uplot/clsx/fontsource deps too); nav is exactly
Triage / Flow Map / Traces / Logs. Bundle budgets tightened to actuals+10%:
initial JS ≤118 KB gz, initial CSS ≤6 KB gz, every lazy chunk ≤10 KB gz.

### Changed — frontend foundation (rewrite phases C1–C2)

- New data layer: TanStack Query (visibility-aware polling — hidden tabs stop
hitting SQLite), single WebSocket manager with jittered backoff and a
bounded 5k log ring buffer, `apiFetch` with AbortSignal, percent-formatting
bugs fixed centrally.
- New responsive shell: System Pulse bar (health/err/p99/DB size, 3-state
live indicator), bottom tab bar (<768px) / icon rail / labeled rail
(≥1440px), Connect popover, token CSS (`tokens.css`) with dark/light themes,
reduced-motion + contrast support. Routing via wouter; deep links served by
the SPA fallback. CI-able bundle budget gate (`npm run check-budgets`).

## [v0.2.0-beta.6] — 2026-06-05

This is the first release cut with the **source-only + build-on-tag** flow: the
Expand Down
Loading