OtelContext — AI Agent Instructions

Project Overview

OtelContext is a self-hosted OTLP observability platform. Single Go binary with embedded React frontend.

Backend: Go 1.25, native net/http (no frameworks), GORM ORM, gRPC + HTTP for OTLP ingestion
Frontend: React 19 + TypeScript + TanStack Query/Virtual + wouter + Radix primitives + cmdk palette + hand-rolled token CSS (ui/src/styles/tokens.css) with CSS Modules; the flow map is own deterministic SVG layout code. No UI framework: @ossrandom/design-system, cytoscape, uplot all removed (rewrite completed 2026-06-12, phases C1–C7)
Ports: gRPC :4317 (OTLP), HTTP :8080 (API + HTTP OTLP + WebSocket + UI)

Strict Rules

NO Express.js/Gin/Echo — use native Go net/http
NO Tailwind CSS, NO Mantine, NO component frameworks — UI styling is the hand-rolled token sheet (ui/src/styles/tokens.css) + per-component CSS Modules; Radix primitives (unstyled) only for the a11y-hard parts (dialog/tabs/tooltip/dropdown). Token values only — no raw hex outside tokens.css.
Single-service architecture (no microservices split)
All internal DBs must be embedded (no external processes)
Relational DB (SQLite/MySQL/PostgreSQL/MSSQL) is the single source of truth
Prioritize self-hosted, open-source solutions
The internal/graph/ package is legacy — use internal/graphrag/ for all new graph work

Architecture

gRPC :4317 (OTLP Ingest) ──► Ingestion Layer ──► Storage (GORM)
HTTP :8080/v1/* (OTLP HTTP)─┘       │                    │
                                     ▼                    ▼
                               In-Memory Accel.      Relational DB
                               (TSDB Ring,           (Source of Truth,
                                GraphRAG)             7-15 day retention)
                                     │
HTTP :8080 ◄── REST API ◄───────────┘
           ◄── WebSocket (real-time)
           ◄── MCP Server (AI agents, 7-tool triage surface)
           ◄── Prometheus /metrics

Ingestion Paths

Path	Endpoint	Content Types	Notes
gRPC	`:4317`	protobuf	Traces, Logs, Metrics via OTLP gRPC
HTTP	`/v1/traces`, `/v1/logs`, `/v1/metrics`	`application/x-protobuf`, `application/json`	OTLP HTTP spec compliant, gzip support, 4MB limit. Returns `429 Too Many Requests` + `Retry-After: 1` when the async pipeline queue is full (parity with gRPC `RESOURCE_EXHAUSTED`).

Both paths delegate to the same Export() methods — zero business logic duplication. By default Export() parses the OTLP request and hands a Batch to the async ingest Pipeline (internal/ingest/pipeline.go); a worker pool persists Trace→Span→Log in order. With INGEST_ASYNC_ENABLED=false the pipeline is bypassed and Export() writes inline (legacy path).

Multi-tenancy

Tenant identity flows into the request context on every write and read:

HTTP: X-Tenant-ID header (see internal/api/tenant_middleware.go).
gRPC: x-tenant-id metadata key (see internal/ingest/otlp.go).
OTLP resource attribute: tenant.id on the resource overrides the header/metadata.

When none are present, DEFAULT_TENANT (default "default") is assigned. Every row in the relational DB carries a tenant_id column; every read method in internal/storage/ scopes by the tenant in the request context (Where("tenant_id = ?", tenant)). Retention (RetentionScheduler) is cross-tenant — it purges by age, not by tenant.

Storage Architecture

Layer	Package	Purpose
GraphRAG (in-memory)	`internal/graphrag/`	Layered graph: 4 typed stores, error chains, root cause analysis, anomaly detection
Time Series (in-memory)	`internal/tsdb/`	Ring buffer, sliding windows, pre-computed percentiles
Graph (in-memory, legacy)	`internal/graph/`	Simple service topology — being replaced by GraphRAG
Relational (persistent)	`internal/storage/`	GORM-based, multi-DB, single source of truth. Driven by `RetentionScheduler` (hourly batched purge + daily VACUUM/ANALYZE). `logs.body` is plain TEXT. Log search: SQLite FTS5 (`logs_fts`, porter+unicode61, ordered by `bm25()`, AFTER INSERT/DELETE/UPDATE triggers) is the default path — `LOG_FTS_ENABLED` defaults to `true` when `DB_DRIVER=sqlite` and `false` otherwise. Operators who want the ~30% disk savings can set `LOG_FTS_ENABLED=false` and reclaim the FTS table + indexes via `POST /api/admin/drop_fts`. Postgres uses `pg_trgm` GIN on `logs.body` and `logs.service_name`. `AttributesJSON` and `AIInsight` remain `CompressedText`. The `search_logs` MCP tool and the API `/api/logs?q=…` filter are clamped to the last 24 hours to bound the LIKE-fallback worst case. The `vectordb` package (TF-IDF semantic search) was removed on 2026-05-24 alongside the `find_similar_logs` MCP tool — `data/vectordb.snapshot` is left on disk for operators to delete by hand.

GraphRAG Architecture

The internal/graphrag/ package is the core intelligence layer. It replaces the simple internal/graph/ for advanced observability queries.

Layered Stores (each with own `sync.RWMutex`)

Store	Nodes	Edges	TTL
`ServiceStore`	ServiceNode, OperationNode	CALLS, EXPOSES	Permanent
`TraceStore`	TraceNode, SpanNode	CONTAINS, CHILD_OF	Configurable (default 1h)
`SignalStore`	LogClusterNode, MetricNode	EMITTED_BY, MEASURED_BY, LOGGED_DURING	Permanent
`AnomalyStore`	AnomalyNode	PRECEDED_BY, TRIGGERED_BY	24h

Node Types (7)

ServiceNode, OperationNode, TraceNode, SpanNode, LogClusterNode, MetricNode, AnomalyNode

Edge Types (9)

CALLS, EXPOSES, CONTAINS, CHILD_OF, EMITTED_BY, LOGGED_DURING, MEASURED_BY, PRECEDED_BY, TRIGGERED_BY

Query Functions

Function	Algorithm	Purpose
`ErrorChain(service, timeRange)`	BFS upstream via CHILD_OF + CALLS	Trace error to responsible service
`ImpactAnalysis(service, depth)`	BFS downstream via CALLS	Blast radius
`RootCauseAnalysis(service, timeRange)`	ErrorChain + anomaly correlation	Ranked probable causes with evidence
`DependencyChain(traceID)`	Tree from CONTAINS + CHILD_OF	Full trace visualization
`CorrelatedSignals(service, timeRange)`	Gather all edges	Related logs/metrics/traces
`ShortestPath(from, to)`	Dijkstra weighted by inverse call freq	Service communication path
`AnomalyTimeline(since)`	Time-sorted anomalies + PRECEDED_BY	Recent anomaly overview
`ServiceMap(depth)`	Full topology dump	Service topology + health

Background Processes

4 event workers consume from a 10,000-capacity buffered channel (best-effort; DB is source of truth)
Refresh loop (60s) — rebuilds from DB, prunes expired TraceStore nodes, cleans old anomalies
Snapshot loop (15min) — persists Drain templates so cluster IDs survive restart (the graph_snapshots write side was removed on 2026-05-24; the loop name is retained for wiring stability)
Anomaly loop (10s) — detects error spikes, latency degradation, metric z-score anomalies

Persistence Models (GORM)

Investigation — automated error analysis records (trigger, root cause, causal chain, evidence)
DrainTemplateRow — persisted Drain log templates (table drain_templates), loaded on startup to warm the miner

Note: GraphSnapshot (table graph_snapshots) was removed on 2026-05-24. AutoMigrate no longer creates the table on fresh deploys; existing populated tables are left in place — operators can DROP TABLE graph_snapshots; VACUUM; to reclaim disk.

Log Clustering (Drain)

Log clustering uses Drain template mining (internal/graphrag/drain.go) — a deterministic fixed-depth prefix tree with O(1) LRU via container/list. Templates are persisted to the drain_templates table and reloaded on startup so cluster IDs stay stable across restarts.

Ingestion Callbacks

TraceServer.Export() → DB persist → spanCallback → GraphRAG.OnSpanIngested()
LogsServer.Export()  → DB persist → logCallback  → GraphRAG.OnLogIngested()
MetricsServer.Export() → TSDB    → metricCallback → GraphRAG.OnMetricIngested()

MCP Server — 7-Tool Triage Surface

The MCP server (internal/mcp/) exposes a focused 7-tool triage surface via HTTP Streamable MCP (JSON-RPC 2.0 POST + SSE GET). The surface was reduced from 21 → 7 on 2026-05-24 so the platform survives 120 services on SQLite — see docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for the full rationale.

Tool	Input	Source
`get_anomaly_timeline`	`{since?, service?}`	In-memory (instant) — triage entry point
`get_service_map`	`{depth?, service?}`	In-memory (instant) — topology + health overlay
`get_service_health`	`{service_name}`	In-memory (instant) — per-service drill-down
`root_cause_analysis`	`{service, time_range?}`	In-memory (instant) — ranked probable causes
`impact_analysis`	`{service, depth?}`	In-memory (instant) — blast radius
`trace_graph`	`{trace_id}`	In-memory + DB fallback — trace tree visualisation
`search_logs`	`{query?, severity?, service?, trace_id?, start?, end?, limit?, page?}`	DB (FTS5 default on SQLite, LIKE fallback, 24h-clamped)

Cut tools (clients now receive an unknown tool RPC error): get_system_graph, tail_logs, get_trace, search_traces, get_metrics, get_dashboard_stats, get_storage_status, find_similar_logs, get_alerts, correlated_signals, get_error_chains, get_investigations, get_investigation, get_graph_snapshot.

Cacheable surface (5s TTL via MCP_CACHE_TTL_MS): get_anomaly_timeline, get_service_map, get_service_health, root_cause_analysis, impact_analysis.

Every error-identifying tool returns a root_cause block:

{"root_cause": {"service": "...", "operation": "...", "error_message": "...", "span_id": "...", "trace_id": "..."}}

DLQ (Dead Letter Queue)

Uses typed envelopes for all data types:

{"type": "logs|spans|traces|metrics", "data": [...]}

Legacy format (raw []storage.Log JSON) is supported for backward compatibility.

Shutdown Order

Proper LIFO ordering to prevent data loss:

gRPC GracefulStop() + HTTP Shutdown() — stop ingestion
WebSocket Hub + Event Hub + AI Service — stop real-time
TSDB + Graph + GraphRAG — stop processing
DLQ — stop replay
RetentionScheduler Stop() — halt purge/maintenance ticks
DB Close() — close database last

Key Directories

internal/
  ai/           # AI service integration
  api/          # HTTP handlers, middleware, rate limiting, graph_handler
  cache/        # TTL cache with synchronized Stop()
  compress/     # Zstd compression utilities
  config/       # Environment configuration (40+ fields)
  graph/        # LEGACY in-memory service graph — use graphrag/ for new work
  graphrag/     # GraphRAG: layered graph, error chains, anomaly detection, investigations
    schema.go       # 7 node types, 9 edge types, query result types
    store.go        # 4 typed stores (Service, Trace, Signal, Anomaly)
    builder.go      # Event workers, ingestion callbacks, GraphRAG coordinator
    queries.go      # ErrorChain, ImpactAnalysis, RootCause, ShortestPath, etc.
    investigation.go # GORM Investigation model + persistence
    anomaly.go      # Z-score, error spike, latency degradation detection
    drain.go        # Log clustering via Drain template mining — pure-Go, stdlib-only, deterministic fixed-depth prefix tree
    refresh.go      # Periodic DB rebuild + pruning + Drain template persistence
  ingest/       # OTLP receivers (gRPC + HTTP), adaptive sampling
    otlp.go         # gRPC TraceServer, LogsServer, MetricsServer
    otlp_http.go    # HTTP OTLP handler (protobuf + JSON, gzip, 4MB limit)
    sampler.go      # Per-service token bucket sampler
  mcp/          # MCP server (7-tool triage surface, JSON-RPC 2.0 + SSE)
  queue/        # Dead Letter Queue (typed envelopes, bounded disk, exp backoff)
  realtime/     # WebSocket hub + event streaming
  storage/      # GORM repository, models, migrations, Close() method, SQLite PRAGMA stanza
  telemetry/    # Prometheus metrics + health (19 metrics)
  tsdb/         # Time series aggregator + ring buffer (lock-free Windows())
  ui/           # Embedded React frontend
ui/             # React frontend (Vite + token CSS Modules, no UI framework)
test/           # Microservice simulation (7 services)
docs/           # Specifications and plans

Configuration (Environment Variables)

Key settings in internal/config/config.go:

HTTP_PORT (8080), GRPC_PORT (4317), DB_DRIVER (sqlite), DB_DSN
DB_AUTOMIGRATE (true), DB_MAX_OPEN_CONNS, DB_MAX_IDLE_CONNS, DB_CONN_MAX_LIFETIME (internally capped to 30m when DB_AZURE_AUTH=true)
DB_AZURE_AUTH (false) — see Authentication below
TLS_CERT_FILE, TLS_KEY_FILE — explicit TLS (both or neither)
TLS_AUTO_SELFSIGNED (false), TLS_CACHE_DIR (./data/tls) — self-signed bootstrap, ignored if cert files set
API_KEY — Bearer token gate for /api/*, /v1/*, /mcp. Empty = auth disabled
OTEL_EXPORTER_OTLP_ENDPOINT — enables self-instrumentation (empty = off)
DEFAULT_TENANT (default) — assigned to rows ingested without explicit tenant
HOT_RETENTION_DAYS (7) — drives RetentionScheduler; range 1..36500
SAMPLING_RATE (1.0), SAMPLING_ALWAYS_ON_ERRORS (true), SAMPLING_LATENCY_THRESHOLD_MS (500)
METRIC_MAX_CARDINALITY (10000), METRIC_MAX_CARDINALITY_PER_TENANT (0 = unlimited), API_RATE_LIMIT_RPS (100). The per-tenant cap is checked first; when set, a noisy tenant cannot exhaust the global pool. Overflow is labeled by tenant via otelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id} (__global__ sentinel when the global cap was the trigger).
MCP_ENABLED (true), MCP_PATH (/mcp)
MCP_MAX_CONCURRENT (32), MCP_CALL_TIMEOUT_MS (30000), MCP_CACHE_TTL_MS (5000) — MCP HTTP streamable robustness. Counting semaphore gates concurrent tools/call (JSON-RPC -32000 past the cap), per-call deadlines abort runaway handlers (JSON-RPC -32001), and a 5s TTL cache memoizes the cheap in-memory GraphRAG tools (get_service_map, impact_analysis, root_cause_analysis, get_anomaly_timeline, get_service_health). SSE GET sends a : keep-alive\n\n comment every 25s to keep the stream alive across reverse-proxy idle timeouts. Set any to 0 to disable.
LOG_FTS_ENABLED — when truthy (true/yes/on/1), provisions the SQLite FTS5 logs_fts virtual table + sync triggers at startup; when false, log-search uses a 24h-clamped LIKE fallback. Defaults to true when DB_DRIVER=sqlite (BM25 is dramatically faster than LIKE on the kept search_logs MCP tool) and false otherwise. Toggle off and reclaim the ~30% disk overhead via POST /api/admin/drop_fts (refused while the flag is on). The vectordb-backed semantic-search path was removed on 2026-05-24.
DLQ_MAX_FILES (1000), DLQ_MAX_DISK_MB (500), DLQ_MAX_RETRIES (10)
GRAPHRAG_WORKER_COUNT (16), GRAPHRAG_EVENT_QUEUE_SIZE (100000; 10000 on SQLite) — sized for 100–200 services; raise further if otelcontext_graphrag_events_dropped_total climbs
GRAPHRAG_TRACE_TTL (1h; 30m on SQLite), GRAPHRAG_MAX_SPANS_PER_TENANT (500000), GRAPHRAG_TENANT_IDLE_TTL (24h) — in-memory GraphRAG memory bounds. Spans past the per-tenant cap are skipped from the graph only (DB unaffected; metered as otelcontext_graphrag_events_dropped_total{signal="span_capacity"}); tenant store slices idle past the TTL are evicted (default tenant immune, self-healing via the 60s rebuild). SignalStore metrics are bounded to 2000/tenant + 24h TTL (constants).
PPROF_ADDR (127.0.0.1:6060) — net/http/pprof on a dedicated loopback listener (never the public mux); empty disables. Startup also sets a soft GOMEMLIMIT (honors the env var, else 75% of the cgroup/host budget via internal/membudget).
INGEST_MIN_SEVERITY (INFO), STORE_MIN_SEVERITY ("" = same as ingest; defaults to "WARN" when DB_DRIVER=sqlite) — two-tier log severity gate. The ingest gate runs at the OTLP receiver and drops the log entirely below the threshold (no in-memory enrichment either). The store gate runs at the persist boundary inside the async pipeline (internal/ingest/pipeline.go:process) and only skips the DB row write — the log still flows through LogCallback so GraphRAG Drain template mining and span/trace correlation see it. Use case: INGEST_MIN_SEVERITY=DEBUG STORE_MIN_SEVERITY=WARN keeps SQLite small while letting in-memory anomaly detection benefit from the verbose stream. Setting STORE_MIN_SEVERITY ≤ INGEST_MIN_SEVERITY is a no-op (logged as a warning at startup). Drops surface via Pipeline.Stats().StoreFiltered.
INGEST_ASYNC_ENABLED (true), INGEST_PIPELINE_QUEUE_SIZE (50000), INGEST_PIPELINE_WORKERS (8), INGEST_PIPELINE_MAX_BYTES (536870912 = 512 MB; 128 MB on SQLite) — async ingest pipeline (internal/ingest/pipeline.go). Hybrid backpressure: <90% accept all, 90–100% drop healthy batches (errors/slow always pass), 100% return gRPC RESOURCE_EXHAUSTED. The byte cap bounds queue memory regardless of item count — at the cap even priority batches get RESOURCE_EXHAUSTED/429 (a 429 is recoverable, an OOM kill is not); watch otelcontext_ingest_pipeline_queue_bytes and reason bytes_full. Set INGEST_ASYNC_ENABLED=false to revert to synchronous DB writes inside Export(). Drops surface as otelcontext_ingest_pipeline_dropped_total{signal,reason}.
GRPC_MAX_RECV_MB (16), GRPC_MAX_CONCURRENT_STREAMS (1000) — OTLP gRPC server caps, validated to 1..256 and 1..1_000_000
RETENTION_BATCH_SIZE (50000), RETENTION_BATCH_SLEEP_MS (1) — purge pacing; raise the sleep on busy production DBs
DB_POSTGRES_PARTITIONING (""), DB_PARTITION_LOOKAHEAD_DAYS (3) — opt-in Postgres declarative range partitioning of the logs table by day. When daily, logs is provisioned as a partitioned parent (greenfield only — refuses to start if logs already exists unpartitioned), the PartitionScheduler maintains lookahead partitions and drops expired ones via DROP TABLE, and RetentionScheduler skips the row-level DELETE for logs. Watch otelcontext_partitions_dropped_total and otelcontext_partitions_active.
APP_ENV ("development"), OTELCONTEXT_ALLOW_SQLITE_PROD (false) — SQLite is refused when APP_ENV=production unless the allow flag is set

SQLite per-driver defaults (auto-flipped when DB_DRIVER=sqlite)

So a 100+ service deployment on SQLite survives without OOM, config.Load() overrides nine defaults at the end of the Load() pass — but only when the operator did not explicitly set the env var (detected via os.LookupEnv presence, not value comparison). Postgres/MSSQL/MySQL paths are untouched.

Env var	SQLite default	Postgres default	Rationale
`DB_MAX_OPEN_CONNS`	1	50	SQLite is single-writer; extra conns are wasted slots.
`DB_MAX_IDLE_CONNS`	1	10	Match open conns.
`INGEST_PIPELINE_WORKERS`	2	8	Workers all serialise through the SQLite writer lock; 2 is enough to keep the queue non-empty.
`INGEST_PIPELINE_QUEUE_SIZE`	10000	50000	Lower heap watermark; backpressure kicks in earlier so OTLP clients back off.
`INGEST_PIPELINE_MAX_BYTES`	128 MB	512 MB	Item count alone cannot bound queue memory; one batch may carry MBs of spans/logs.
`GRAPHRAG_EVENT_QUEUE_SIZE`	10000	100000	Each queued event embeds a Span/Log by value (~0.5–2 KB); buffer less, drop sooner (metered).
`GRAPHRAG_TRACE_TTL`	30m	1h	The in-memory span window is the largest legitimate GraphRAG consumer; anomaly/investigation lookbacks are ≤5min.
`METRIC_MAX_CARDINALITY`	3000	10000	Bound the in-memory TSDB series map.
`STORE_MIN_SEVERITY`	`"WARN"`	`""`	Skip INFO/DEBUG persists; in-memory GraphRAG/Drain still sees them.
`SAMPLING_RATE`	0.05	1.0	Errors and slow spans are always kept by `SAMPLING_ALWAYS_ON_ERRORS`.
`GRPC_MAX_CONCURRENT_STREAMS`	240	1000	~2 streams per service at 120 services with headroom.
`LOG_FTS_ENABLED`	`true`	n/a	FTS5 BM25 is dramatically faster than LIKE on the kept `search_logs` path.

Also at SQLite startup, internal/storage/factory.go applies a fail-closed PRAGMA stanza: journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, wal_autocheckpoint=10000, journal_size_limit=67108864 (64 MB WAL cap), busy_timeout=5000, plus budget-scaled memory knobs: page cache = budget/32 clamped to [64 MB, 256 MB] and mmap = budget/8 clamped to [256 MB, 1 GB], where the budget comes from internal/membudget (cgroup v2 → v1 → /proc/meminfo; a 4 GB host gets 128 MB cache + 512 MB mmap, detection failure falls back to the 256 MB/1 GB ceilings). Operators override with SQLITE_CACHE_SIZE_KB / SQLITE_MMAP_SIZE_BYTES. With the pure-Go driver the page cache is Go-heap memory and competes with GOMEMLIMIT. PRAGMA auto_vacuum=INCREMENTAL is attempted best-effort before the WAL switch (the WAL header freezes the stored mode; only affects newly created DB files). Any pragma failure in the fail-closed stanza aborts startup with a wrapped error — these are not optional. See docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for per-default reasoning.

Authentication

API auth (platform). API_KEY gates /api/*, OTLP HTTP (/v1/*), and the MCP endpoint via Authorization: Bearer <API_KEY>. When empty, the middleware is a pass-through (dev only). Unprotected paths: /live, /ready, /metrics*, /ws*. A shared API_KEY grants access to every tenant — there is no per-tenant-key file in the current code; isolate tenants at the network/auth layer if that matters. (If an API_TENANT_KEYS_FILE override lands later, re-check internal/api/auth.go for the flag name.)

Database auth (Azure Entra). Setting DB_AZURE_AUTH=true enables Azure Entra ID (AAD) authentication for PostgreSQL. The driver uses DefaultAzureCredential, which resolves identity via the standard probe order (env vars → workload identity → managed identity → Azure CLI → developer credentials). When Azure auth is enabled, strict TLS (sslmode=require, verify-ca, or verify-full) is mandatory; weaker modes are rejected at startup. DB_CONN_MAX_LIFETIME is internally capped to 30 minutes to stay inside the token TTL.

Retention & Maintenance

The RetentionScheduler in internal/storage/ runs an hourly batched purge of data older than HOT_RETENTION_DAYS via PurgeLogsBatched, PurgeTracesBatched, and PurgeMetricBucketsBatched, plus a daily maintenance pass: PRAGMA optimize and PRAGMA incremental_vacuum(10000) on SQLite (the historical full VACUUM held an exclusive whole-DB lock for 10–60 min on multi-GB files, starving ingest into a 429 storm; restore it with RETENTION_FULL_VACUUM=true or run POST /api/admin/vacuum on demand — note pre-existing DB files keep their auto_vacuum mode until a manual full VACUUM rewrites them, so incremental_vacuum no-ops harmlessly there), ANALYZE-equivalent maintenance on other drivers as before. Purge is cross-tenant — it scopes by age, not tenant_id. Valid HOT_RETENTION_DAYS is clamped to the range 1..36500.

Failure-mode gauges (prefix OtelContext_):

retention_consecutive_failures — reset to 0 on success; alert when > 3
retention_last_success_timestamp — Unix seconds; alert when stale relative to the hourly tick
retention_rows_purged_total, retention_purge_duration_seconds, retention_vacuum_duration_seconds — throughput and latency

Security & Supply Chain

OtelContext targets the OpenSSF Best Practices passing badge (project 12646) and ships a six-job OSS-CLI security stack, supplemented by SonarCloud SAST as a required gate (board reversal 2026-04-28). No CodeQL, no NVD-direct tooling. Cost: $0 for the OSS-CLI tier; SonarCloud is free for public repos.

OSS-CLI security stack (`.github/workflows/security.yml`)

Concern	Tool	Gate
SCA (Go modules + npm)	OSV-Scanner against `go.mod` + `ui/package-lock.json` (OSV.dev / GHSA / ecosystem feeds; not NVD)	Block merge on High/Critical
SCA (filesystem + OS) + container scan	Trivy filesystem scan; Dependabot surfaces advisories on the Security tab	Block merge on `severity: HIGH,CRITICAL`, `exit-code: 1`, `ignore-unfixed: true`
SAST	Semgrep (`p/security-audit` + `p/owasp-top-ten` + `p/golang`)	Block merge on `--severity ERROR`
Secret scan	Gitleaks (full git history)	Block merge on any finding
Duplication	jscpd, threshold 3%, `--min-tokens 100`, scoped to `internal/` + `ui/src/`, excludes tests, vendor, build artifacts, and the legacy `internal/graph/` package	Block merge above threshold
SBOM	`anchore/sbom-action` (SPDX + CycloneDX)	Surface as 90-day artifact; do not gate merge
Lint (Go)	`golangci-lint` (existing `.golangci.yml`)	Wired into `ci.yml`, not security.yml

All actions are SHA-pinned per Scorecard Pinned-Dependencies. Top-level permissions: read-all; jobs scope up only when needed (gitleaks needs full history; sbom uploads).

Required external gate: SonarCloud Code Analysis. Runs as the SonarCloud GitHub App (no in-repo workflow); listed in main branch protection's required_status_checks since 2026-04-28. Reinstated by board reversal — earlier docs that said "do not re-introduce" are superseded.

Not used (do not re-introduce without an explicit board reversal): CodeQL (GHAS-paid for non-public repos), OWASP Dependency-Check (or any NVD-direct tool — NVD has analysis-backlog and rate-limit reliability problems).

OpenSSF Scorecard (`.github/workflows/scorecard.yml`)

Schedule: push to main + Mondays 06:00 UTC + manual workflow_dispatch.
Output: SARIF → Security tab; results published to public Scorecard dashboard.
Hardening: step-security/harden-runner (egress: audit), actions/checkout with persist-credentials: false.
Baseline: to be measured after first push to main. Track via the Scorecard dashboard linked from the README badge.
Stretch target: ≥ 8.0/10. Best-effort — Scorecard does not gate merge per the board ruling. The passing Best Practices badge is the only hard supply-chain gate.

Vulnerability reporting

See SECURITY.md. Preferred channel: GitHub Security Advisories at https://github.com/RandomCodeSpace/otelcontext/security/advisories/new. Email fallback: ak.nitrr13@gmail.com with subject prefix [otelcontext security].

Signed commits & branch protection

Repo-local config helper: scripts/setup-git-signed.sh — supports ssh, openpgp, and x509 signing; honours the contributor's existing global git identity.
Branch protection on main requiring signed commits is configured at the GitHub repo level (board-admin action; not file-driven). When toggled on, every commit landing on main must verify.

Self-assessment evidence

.bestpractices.json — OpenSSF Best Practices evidence map (project 12646, level passing, six categories self-assessed). The badge level transition from in_progress → passing requires a board admin to log into bestpractices.dev with the OSS-Random identity.

Build & Run

go build -o otelcontext .        # Build
./otelcontext                     # Run (default: SQLite, ports 4317/8080)
go vet ./...                      # Lint
go test ./...                     # Test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OtelContext — AI Agent Instructions

Project Overview

Strict Rules

Architecture

Ingestion Paths

Multi-tenancy

Storage Architecture

GraphRAG Architecture

Layered Stores (each with own `sync.RWMutex`)

Node Types (7)

Edge Types (9)

Query Functions

Background Processes

Persistence Models (GORM)

Log Clustering (Drain)

Ingestion Callbacks

MCP Server — 7-Tool Triage Surface

DLQ (Dead Letter Queue)

Shutdown Order

Key Directories

Configuration (Environment Variables)

SQLite per-driver defaults (auto-flipped when DB_DRIVER=sqlite)

Authentication

Retention & Maintenance

Security & Supply Chain

OSS-CLI security stack (`.github/workflows/security.yml`)

OpenSSF Scorecard (`.github/workflows/scorecard.yml`)

Vulnerability reporting

Signed commits & branch protection

Self-assessment evidence

Build & Run

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

OtelContext — AI Agent Instructions

Project Overview

Strict Rules

Architecture

Ingestion Paths

Multi-tenancy

Storage Architecture

GraphRAG Architecture

Layered Stores (each with own sync.RWMutex)

Node Types (7)

Edge Types (9)

Query Functions

Background Processes

Persistence Models (GORM)

Log Clustering (Drain)

Ingestion Callbacks

MCP Server — 7-Tool Triage Surface

DLQ (Dead Letter Queue)

Shutdown Order

Key Directories

Configuration (Environment Variables)

SQLite per-driver defaults (auto-flipped when DB_DRIVER=sqlite)

Authentication

Retention & Maintenance

Security & Supply Chain

OSS-CLI security stack (.github/workflows/security.yml)

OpenSSF Scorecard (.github/workflows/scorecard.yml)

Vulnerability reporting

Signed commits & branch protection

Self-assessment evidence

Build & Run

Layered Stores (each with own `sync.RWMutex`)

OSS-CLI security stack (`.github/workflows/security.yml`)

OpenSSF Scorecard (`.github/workflows/scorecard.yml`)