Skip to content

Latest commit

 

History

History
315 lines (247 loc) · 27.6 KB

File metadata and controls

315 lines (247 loc) · 27.6 KB

OtelContext — AI Agent Instructions

Project Overview

OtelContext is a self-hosted OTLP observability platform. Single Go binary with embedded React frontend.

  • Backend: Go 1.25, native net/http (no frameworks), GORM ORM, gRPC + HTTP for OTLP ingestion
  • Frontend: React 19 + TypeScript + TanStack Query/Virtual + wouter + Radix primitives + cmdk palette + hand-rolled token CSS (ui/src/styles/tokens.css) with CSS Modules; the flow map is own deterministic SVG layout code. No UI framework: @ossrandom/design-system, cytoscape, uplot all removed (rewrite completed 2026-06-12, phases C1–C7)
  • Ports: gRPC :4317 (OTLP), HTTP :8080 (API + HTTP OTLP + WebSocket + UI)

Strict Rules

  • NO Express.js/Gin/Echo — use native Go net/http
  • NO Tailwind CSS, NO Mantine, NO component frameworks — UI styling is the hand-rolled token sheet (ui/src/styles/tokens.css) + per-component CSS Modules; Radix primitives (unstyled) only for the a11y-hard parts (dialog/tabs/tooltip/dropdown). Token values only — no raw hex outside tokens.css.
  • Single-service architecture (no microservices split)
  • All internal DBs must be embedded (no external processes)
  • Relational DB (SQLite/MySQL/PostgreSQL/MSSQL) is the single source of truth
  • Prioritize self-hosted, open-source solutions
  • The internal/graph/ package is legacy — use internal/graphrag/ for all new graph work

Architecture

gRPC :4317 (OTLP Ingest) ──► Ingestion Layer ──► Storage (GORM)
HTTP :8080/v1/* (OTLP HTTP)─┘       │                    │
                                     ▼                    ▼
                               In-Memory Accel.      Relational DB
                               (TSDB Ring,           (Source of Truth,
                                GraphRAG)             7-15 day retention)
                                     │
HTTP :8080 ◄── REST API ◄───────────┘
           ◄── WebSocket (real-time)
           ◄── MCP Server (AI agents, 7-tool triage surface)
           ◄── Prometheus /metrics

Ingestion Paths

Path Endpoint Content Types Notes
gRPC :4317 protobuf Traces, Logs, Metrics via OTLP gRPC
HTTP /v1/traces, /v1/logs, /v1/metrics application/x-protobuf, application/json OTLP HTTP spec compliant, gzip support, 4MB limit. Returns 429 Too Many Requests + Retry-After: 1 when the async pipeline queue is full (parity with gRPC RESOURCE_EXHAUSTED).

Both paths delegate to the same Export() methods — zero business logic duplication. By default Export() parses the OTLP request and hands a Batch to the async ingest Pipeline (internal/ingest/pipeline.go); a worker pool persists Trace→Span→Log in order. With INGEST_ASYNC_ENABLED=false the pipeline is bypassed and Export() writes inline (legacy path).

Multi-tenancy

Tenant identity flows into the request context on every write and read:

  • HTTP: X-Tenant-ID header (see internal/api/tenant_middleware.go).
  • gRPC: x-tenant-id metadata key (see internal/ingest/otlp.go).
  • OTLP resource attribute: tenant.id on the resource overrides the header/metadata.

When none are present, DEFAULT_TENANT (default "default") is assigned. Every row in the relational DB carries a tenant_id column; every read method in internal/storage/ scopes by the tenant in the request context (Where("tenant_id = ?", tenant)). Retention (RetentionScheduler) is cross-tenant — it purges by age, not by tenant.

Storage Architecture

Layer Package Purpose
GraphRAG (in-memory) internal/graphrag/ Layered graph: 4 typed stores, error chains, root cause analysis, anomaly detection
Time Series (in-memory) internal/tsdb/ Ring buffer, sliding windows, pre-computed percentiles
Graph (in-memory, legacy) internal/graph/ Simple service topology — being replaced by GraphRAG
Relational (persistent) internal/storage/ GORM-based, multi-DB, single source of truth. Driven by RetentionScheduler (hourly batched purge + daily VACUUM/ANALYZE). logs.body is plain TEXT. Log search: SQLite FTS5 (logs_fts, porter+unicode61, ordered by bm25(), AFTER INSERT/DELETE/UPDATE triggers) is the default path — LOG_FTS_ENABLED defaults to true when DB_DRIVER=sqlite and false otherwise. Operators who want the ~30% disk savings can set LOG_FTS_ENABLED=false and reclaim the FTS table + indexes via POST /api/admin/drop_fts. Postgres uses pg_trgm GIN on logs.body and logs.service_name. AttributesJSON and AIInsight remain CompressedText. The search_logs MCP tool and the API /api/logs?q=… filter are clamped to the last 24 hours to bound the LIKE-fallback worst case. The vectordb package (TF-IDF semantic search) was removed on 2026-05-24 alongside the find_similar_logs MCP tool — data/vectordb.snapshot is left on disk for operators to delete by hand.

GraphRAG Architecture

The internal/graphrag/ package is the core intelligence layer. It replaces the simple internal/graph/ for advanced observability queries.

Layered Stores (each with own sync.RWMutex)

Store Nodes Edges TTL
ServiceStore ServiceNode, OperationNode CALLS, EXPOSES Permanent
TraceStore TraceNode, SpanNode CONTAINS, CHILD_OF Configurable (default 1h)
SignalStore LogClusterNode, MetricNode EMITTED_BY, MEASURED_BY, LOGGED_DURING Permanent
AnomalyStore AnomalyNode PRECEDED_BY, TRIGGERED_BY 24h

Node Types (7)

ServiceNode, OperationNode, TraceNode, SpanNode, LogClusterNode, MetricNode, AnomalyNode

Edge Types (9)

CALLS, EXPOSES, CONTAINS, CHILD_OF, EMITTED_BY, LOGGED_DURING, MEASURED_BY, PRECEDED_BY, TRIGGERED_BY

Query Functions

Function Algorithm Purpose
ErrorChain(service, timeRange) BFS upstream via CHILD_OF + CALLS Trace error to responsible service
ImpactAnalysis(service, depth) BFS downstream via CALLS Blast radius
RootCauseAnalysis(service, timeRange) ErrorChain + anomaly correlation Ranked probable causes with evidence
DependencyChain(traceID) Tree from CONTAINS + CHILD_OF Full trace visualization
CorrelatedSignals(service, timeRange) Gather all edges Related logs/metrics/traces
ShortestPath(from, to) Dijkstra weighted by inverse call freq Service communication path
AnomalyTimeline(since) Time-sorted anomalies + PRECEDED_BY Recent anomaly overview
ServiceMap(depth) Full topology dump Service topology + health

Background Processes

  • 4 event workers consume from a 10,000-capacity buffered channel (best-effort; DB is source of truth)
  • Refresh loop (60s) — rebuilds from DB, prunes expired TraceStore nodes, cleans old anomalies
  • Snapshot loop (15min) — persists Drain templates so cluster IDs survive restart (the graph_snapshots write side was removed on 2026-05-24; the loop name is retained for wiring stability)
  • Anomaly loop (10s) — detects error spikes, latency degradation, metric z-score anomalies

Persistence Models (GORM)

  • Investigation — automated error analysis records (trigger, root cause, causal chain, evidence)
  • DrainTemplateRow — persisted Drain log templates (table drain_templates), loaded on startup to warm the miner

Note: GraphSnapshot (table graph_snapshots) was removed on 2026-05-24. AutoMigrate no longer creates the table on fresh deploys; existing populated tables are left in place — operators can DROP TABLE graph_snapshots; VACUUM; to reclaim disk.

Log Clustering (Drain)

Log clustering uses Drain template mining (internal/graphrag/drain.go) — a deterministic fixed-depth prefix tree with O(1) LRU via container/list. Templates are persisted to the drain_templates table and reloaded on startup so cluster IDs stay stable across restarts.

Ingestion Callbacks

TraceServer.Export() → DB persist → spanCallback → GraphRAG.OnSpanIngested()
LogsServer.Export()  → DB persist → logCallback  → GraphRAG.OnLogIngested()
MetricsServer.Export() → TSDB    → metricCallback → GraphRAG.OnMetricIngested()

MCP Server — 7-Tool Triage Surface

The MCP server (internal/mcp/) exposes a focused 7-tool triage surface via HTTP Streamable MCP (JSON-RPC 2.0 POST + SSE GET). The surface was reduced from 21 → 7 on 2026-05-24 so the platform survives 120 services on SQLite — see docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for the full rationale.

Tool Input Source
get_anomaly_timeline {since?, service?} In-memory (instant) — triage entry point
get_service_map {depth?, service?} In-memory (instant) — topology + health overlay
get_service_health {service_name} In-memory (instant) — per-service drill-down
root_cause_analysis {service, time_range?} In-memory (instant) — ranked probable causes
impact_analysis {service, depth?} In-memory (instant) — blast radius
trace_graph {trace_id} In-memory + DB fallback — trace tree visualisation
search_logs {query?, severity?, service?, trace_id?, start?, end?, limit?, page?} DB (FTS5 default on SQLite, LIKE fallback, 24h-clamped)

Cut tools (clients now receive an unknown tool RPC error): get_system_graph, tail_logs, get_trace, search_traces, get_metrics, get_dashboard_stats, get_storage_status, find_similar_logs, get_alerts, correlated_signals, get_error_chains, get_investigations, get_investigation, get_graph_snapshot.

Cacheable surface (5s TTL via MCP_CACHE_TTL_MS): get_anomaly_timeline, get_service_map, get_service_health, root_cause_analysis, impact_analysis.

Every error-identifying tool returns a root_cause block:

{"root_cause": {"service": "...", "operation": "...", "error_message": "...", "span_id": "...", "trace_id": "..."}}

DLQ (Dead Letter Queue)

Uses typed envelopes for all data types:

{"type": "logs|spans|traces|metrics", "data": [...]}

Legacy format (raw []storage.Log JSON) is supported for backward compatibility.

Shutdown Order

Proper LIFO ordering to prevent data loss:

  1. gRPC GracefulStop() + HTTP Shutdown() — stop ingestion
  2. WebSocket Hub + Event Hub + AI Service — stop real-time
  3. TSDB + Graph + GraphRAG — stop processing
  4. DLQ — stop replay
  5. RetentionScheduler Stop() — halt purge/maintenance ticks
  6. DB Close() — close database last

Key Directories

internal/
  ai/           # AI service integration
  api/          # HTTP handlers, middleware, rate limiting, graph_handler
  cache/        # TTL cache with synchronized Stop()
  compress/     # Zstd compression utilities
  config/       # Environment configuration (40+ fields)
  graph/        # LEGACY in-memory service graph — use graphrag/ for new work
  graphrag/     # GraphRAG: layered graph, error chains, anomaly detection, investigations
    schema.go       # 7 node types, 9 edge types, query result types
    store.go        # 4 typed stores (Service, Trace, Signal, Anomaly)
    builder.go      # Event workers, ingestion callbacks, GraphRAG coordinator
    queries.go      # ErrorChain, ImpactAnalysis, RootCause, ShortestPath, etc.
    investigation.go # GORM Investigation model + persistence
    anomaly.go      # Z-score, error spike, latency degradation detection
    drain.go        # Log clustering via Drain template mining — pure-Go, stdlib-only, deterministic fixed-depth prefix tree
    refresh.go      # Periodic DB rebuild + pruning + Drain template persistence
  ingest/       # OTLP receivers (gRPC + HTTP), adaptive sampling
    otlp.go         # gRPC TraceServer, LogsServer, MetricsServer
    otlp_http.go    # HTTP OTLP handler (protobuf + JSON, gzip, 4MB limit)
    sampler.go      # Per-service token bucket sampler
  mcp/          # MCP server (7-tool triage surface, JSON-RPC 2.0 + SSE)
  queue/        # Dead Letter Queue (typed envelopes, bounded disk, exp backoff)
  realtime/     # WebSocket hub + event streaming
  storage/      # GORM repository, models, migrations, Close() method, SQLite PRAGMA stanza
  telemetry/    # Prometheus metrics + health (19 metrics)
  tsdb/         # Time series aggregator + ring buffer (lock-free Windows())
  ui/           # Embedded React frontend
ui/             # React frontend (Vite + token CSS Modules, no UI framework)
test/           # Microservice simulation (7 services)
docs/           # Specifications and plans

Configuration (Environment Variables)

Key settings in internal/config/config.go:

  • HTTP_PORT (8080), GRPC_PORT (4317), DB_DRIVER (sqlite), DB_DSN
  • DB_AUTOMIGRATE (true), DB_MAX_OPEN_CONNS, DB_MAX_IDLE_CONNS, DB_CONN_MAX_LIFETIME (internally capped to 30m when DB_AZURE_AUTH=true)
  • DB_AZURE_AUTH (false) — see Authentication below
  • TLS_CERT_FILE, TLS_KEY_FILE — explicit TLS (both or neither)
  • TLS_AUTO_SELFSIGNED (false), TLS_CACHE_DIR (./data/tls) — self-signed bootstrap, ignored if cert files set
  • API_KEY — Bearer token gate for /api/*, /v1/*, /mcp. Empty = auth disabled
  • OTEL_EXPORTER_OTLP_ENDPOINT — enables self-instrumentation (empty = off)
  • DEFAULT_TENANT (default) — assigned to rows ingested without explicit tenant
  • HOT_RETENTION_DAYS (7) — drives RetentionScheduler; range 1..36500
  • SAMPLING_RATE (1.0), SAMPLING_ALWAYS_ON_ERRORS (true), SAMPLING_LATENCY_THRESHOLD_MS (500)
  • METRIC_MAX_CARDINALITY (10000), METRIC_MAX_CARDINALITY_PER_TENANT (0 = unlimited), API_RATE_LIMIT_RPS (100). The per-tenant cap is checked first; when set, a noisy tenant cannot exhaust the global pool. Overflow is labeled by tenant via otelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id} (__global__ sentinel when the global cap was the trigger).
  • MCP_ENABLED (true), MCP_PATH (/mcp)
  • MCP_MAX_CONCURRENT (32), MCP_CALL_TIMEOUT_MS (30000), MCP_CACHE_TTL_MS (5000) — MCP HTTP streamable robustness. Counting semaphore gates concurrent tools/call (JSON-RPC -32000 past the cap), per-call deadlines abort runaway handlers (JSON-RPC -32001), and a 5s TTL cache memoizes the cheap in-memory GraphRAG tools (get_service_map, impact_analysis, root_cause_analysis, get_anomaly_timeline, get_service_health). SSE GET sends a : keep-alive\n\n comment every 25s to keep the stream alive across reverse-proxy idle timeouts. Set any to 0 to disable.
  • LOG_FTS_ENABLED — when truthy (true/yes/on/1), provisions the SQLite FTS5 logs_fts virtual table + sync triggers at startup; when false, log-search uses a 24h-clamped LIKE fallback. Defaults to true when DB_DRIVER=sqlite (BM25 is dramatically faster than LIKE on the kept search_logs MCP tool) and false otherwise. Toggle off and reclaim the ~30% disk overhead via POST /api/admin/drop_fts (refused while the flag is on). The vectordb-backed semantic-search path was removed on 2026-05-24.
  • DLQ_MAX_FILES (1000), DLQ_MAX_DISK_MB (500), DLQ_MAX_RETRIES (10)
  • GRAPHRAG_WORKER_COUNT (16), GRAPHRAG_EVENT_QUEUE_SIZE (100000; 10000 on SQLite) — sized for 100–200 services; raise further if otelcontext_graphrag_events_dropped_total climbs
  • GRAPHRAG_TRACE_TTL (1h; 30m on SQLite), GRAPHRAG_MAX_SPANS_PER_TENANT (500000), GRAPHRAG_TENANT_IDLE_TTL (24h) — in-memory GraphRAG memory bounds. Spans past the per-tenant cap are skipped from the graph only (DB unaffected; metered as otelcontext_graphrag_events_dropped_total{signal="span_capacity"}); tenant store slices idle past the TTL are evicted (default tenant immune, self-healing via the 60s rebuild). SignalStore metrics are bounded to 2000/tenant + 24h TTL (constants).
  • PPROF_ADDR (127.0.0.1:6060) — net/http/pprof on a dedicated loopback listener (never the public mux); empty disables. Startup also sets a soft GOMEMLIMIT (honors the env var, else 75% of the cgroup/host budget via internal/membudget).
  • INGEST_MIN_SEVERITY (INFO), STORE_MIN_SEVERITY ("" = same as ingest; defaults to "WARN" when DB_DRIVER=sqlite) — two-tier log severity gate. The ingest gate runs at the OTLP receiver and drops the log entirely below the threshold (no in-memory enrichment either). The store gate runs at the persist boundary inside the async pipeline (internal/ingest/pipeline.go:process) and only skips the DB row write — the log still flows through LogCallback so GraphRAG Drain template mining and span/trace correlation see it. Use case: INGEST_MIN_SEVERITY=DEBUG STORE_MIN_SEVERITY=WARN keeps SQLite small while letting in-memory anomaly detection benefit from the verbose stream. Setting STORE_MIN_SEVERITYINGEST_MIN_SEVERITY is a no-op (logged as a warning at startup). Drops surface via Pipeline.Stats().StoreFiltered.
  • INGEST_ASYNC_ENABLED (true), INGEST_PIPELINE_QUEUE_SIZE (50000), INGEST_PIPELINE_WORKERS (8), INGEST_PIPELINE_MAX_BYTES (536870912 = 512 MB; 128 MB on SQLite) — async ingest pipeline (internal/ingest/pipeline.go). Hybrid backpressure: <90% accept all, 90–100% drop healthy batches (errors/slow always pass), 100% return gRPC RESOURCE_EXHAUSTED. The byte cap bounds queue memory regardless of item count — at the cap even priority batches get RESOURCE_EXHAUSTED/429 (a 429 is recoverable, an OOM kill is not); watch otelcontext_ingest_pipeline_queue_bytes and reason bytes_full. Set INGEST_ASYNC_ENABLED=false to revert to synchronous DB writes inside Export(). Drops surface as otelcontext_ingest_pipeline_dropped_total{signal,reason}.
  • GRPC_MAX_RECV_MB (16), GRPC_MAX_CONCURRENT_STREAMS (1000) — OTLP gRPC server caps, validated to 1..256 and 1..1_000_000
  • RETENTION_BATCH_SIZE (50000), RETENTION_BATCH_SLEEP_MS (1) — purge pacing; raise the sleep on busy production DBs
  • DB_POSTGRES_PARTITIONING (""), DB_PARTITION_LOOKAHEAD_DAYS (3) — opt-in Postgres declarative range partitioning of the logs table by day. When daily, logs is provisioned as a partitioned parent (greenfield only — refuses to start if logs already exists unpartitioned), the PartitionScheduler maintains lookahead partitions and drops expired ones via DROP TABLE, and RetentionScheduler skips the row-level DELETE for logs. Watch otelcontext_partitions_dropped_total and otelcontext_partitions_active.
  • APP_ENV ("development"), OTELCONTEXT_ALLOW_SQLITE_PROD (false) — SQLite is refused when APP_ENV=production unless the allow flag is set

SQLite per-driver defaults (auto-flipped when DB_DRIVER=sqlite)

So a 100+ service deployment on SQLite survives without OOM, config.Load() overrides nine defaults at the end of the Load() pass — but only when the operator did not explicitly set the env var (detected via os.LookupEnv presence, not value comparison). Postgres/MSSQL/MySQL paths are untouched.

Env var SQLite default Postgres default Rationale
DB_MAX_OPEN_CONNS 1 50 SQLite is single-writer; extra conns are wasted slots.
DB_MAX_IDLE_CONNS 1 10 Match open conns.
INGEST_PIPELINE_WORKERS 2 8 Workers all serialise through the SQLite writer lock; 2 is enough to keep the queue non-empty.
INGEST_PIPELINE_QUEUE_SIZE 10000 50000 Lower heap watermark; backpressure kicks in earlier so OTLP clients back off.
INGEST_PIPELINE_MAX_BYTES 128 MB 512 MB Item count alone cannot bound queue memory; one batch may carry MBs of spans/logs.
GRAPHRAG_EVENT_QUEUE_SIZE 10000 100000 Each queued event embeds a Span/Log by value (~0.5–2 KB); buffer less, drop sooner (metered).
GRAPHRAG_TRACE_TTL 30m 1h The in-memory span window is the largest legitimate GraphRAG consumer; anomaly/investigation lookbacks are ≤5min.
METRIC_MAX_CARDINALITY 3000 10000 Bound the in-memory TSDB series map.
STORE_MIN_SEVERITY "WARN" "" Skip INFO/DEBUG persists; in-memory GraphRAG/Drain still sees them.
SAMPLING_RATE 0.05 1.0 Errors and slow spans are always kept by SAMPLING_ALWAYS_ON_ERRORS.
GRPC_MAX_CONCURRENT_STREAMS 240 1000 ~2 streams per service at 120 services with headroom.
LOG_FTS_ENABLED true n/a FTS5 BM25 is dramatically faster than LIKE on the kept search_logs path.

Also at SQLite startup, internal/storage/factory.go applies a fail-closed PRAGMA stanza: journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, wal_autocheckpoint=10000, journal_size_limit=67108864 (64 MB WAL cap), busy_timeout=5000, plus budget-scaled memory knobs: page cache = budget/32 clamped to [64 MB, 256 MB] and mmap = budget/8 clamped to [256 MB, 1 GB], where the budget comes from internal/membudget (cgroup v2 → v1 → /proc/meminfo; a 4 GB host gets 128 MB cache + 512 MB mmap, detection failure falls back to the 256 MB/1 GB ceilings). Operators override with SQLITE_CACHE_SIZE_KB / SQLITE_MMAP_SIZE_BYTES. With the pure-Go driver the page cache is Go-heap memory and competes with GOMEMLIMIT. PRAGMA auto_vacuum=INCREMENTAL is attempted best-effort before the WAL switch (the WAL header freezes the stored mode; only affects newly created DB files). Any pragma failure in the fail-closed stanza aborts startup with a wrapped error — these are not optional. See docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for per-default reasoning.

Authentication

API auth (platform). API_KEY gates /api/*, OTLP HTTP (/v1/*), and the MCP endpoint via Authorization: Bearer <API_KEY>. When empty, the middleware is a pass-through (dev only). Unprotected paths: /live, /ready, /metrics*, /ws*. A shared API_KEY grants access to every tenant — there is no per-tenant-key file in the current code; isolate tenants at the network/auth layer if that matters. (If an API_TENANT_KEYS_FILE override lands later, re-check internal/api/auth.go for the flag name.)

Database auth (Azure Entra). Setting DB_AZURE_AUTH=true enables Azure Entra ID (AAD) authentication for PostgreSQL. The driver uses DefaultAzureCredential, which resolves identity via the standard probe order (env vars → workload identity → managed identity → Azure CLI → developer credentials). When Azure auth is enabled, strict TLS (sslmode=require, verify-ca, or verify-full) is mandatory; weaker modes are rejected at startup. DB_CONN_MAX_LIFETIME is internally capped to 30 minutes to stay inside the token TTL.

Retention & Maintenance

The RetentionScheduler in internal/storage/ runs an hourly batched purge of data older than HOT_RETENTION_DAYS via PurgeLogsBatched, PurgeTracesBatched, and PurgeMetricBucketsBatched, plus a daily maintenance pass: PRAGMA optimize and PRAGMA incremental_vacuum(10000) on SQLite (the historical full VACUUM held an exclusive whole-DB lock for 10–60 min on multi-GB files, starving ingest into a 429 storm; restore it with RETENTION_FULL_VACUUM=true or run POST /api/admin/vacuum on demand — note pre-existing DB files keep their auto_vacuum mode until a manual full VACUUM rewrites them, so incremental_vacuum no-ops harmlessly there), ANALYZE-equivalent maintenance on other drivers as before. Purge is cross-tenant — it scopes by age, not tenant_id. Valid HOT_RETENTION_DAYS is clamped to the range 1..36500.

Failure-mode gauges (prefix OtelContext_):

  • retention_consecutive_failures — reset to 0 on success; alert when > 3
  • retention_last_success_timestamp — Unix seconds; alert when stale relative to the hourly tick
  • retention_rows_purged_total, retention_purge_duration_seconds, retention_vacuum_duration_seconds — throughput and latency

Security & Supply Chain

OtelContext targets the OpenSSF Best Practices passing badge (project 12646) and ships a six-job OSS-CLI security stack, supplemented by SonarCloud SAST as a required gate (board reversal 2026-04-28). No CodeQL, no NVD-direct tooling. Cost: $0 for the OSS-CLI tier; SonarCloud is free for public repos.

OSS-CLI security stack (.github/workflows/security.yml)

Concern Tool Gate
SCA (Go modules + npm) OSV-Scanner against go.mod + ui/package-lock.json (OSV.dev / GHSA / ecosystem feeds; not NVD) Block merge on High/Critical
SCA (filesystem + OS) + container scan Trivy filesystem scan; Dependabot surfaces advisories on the Security tab Block merge on severity: HIGH,CRITICAL, exit-code: 1, ignore-unfixed: true
SAST Semgrep (p/security-audit + p/owasp-top-ten + p/golang) Block merge on --severity ERROR
Secret scan Gitleaks (full git history) Block merge on any finding
Duplication jscpd, threshold 3%, --min-tokens 100, scoped to internal/ + ui/src/, excludes tests, vendor, build artifacts, and the legacy internal/graph/ package Block merge above threshold
SBOM anchore/sbom-action (SPDX + CycloneDX) Surface as 90-day artifact; do not gate merge
Lint (Go) golangci-lint (existing .golangci.yml) Wired into ci.yml, not security.yml

All actions are SHA-pinned per Scorecard Pinned-Dependencies. Top-level permissions: read-all; jobs scope up only when needed (gitleaks needs full history; sbom uploads).

Required external gate: SonarCloud Code Analysis. Runs as the SonarCloud GitHub App (no in-repo workflow); listed in main branch protection's required_status_checks since 2026-04-28. Reinstated by board reversal — earlier docs that said "do not re-introduce" are superseded.

Not used (do not re-introduce without an explicit board reversal): CodeQL (GHAS-paid for non-public repos), OWASP Dependency-Check (or any NVD-direct tool — NVD has analysis-backlog and rate-limit reliability problems).

OpenSSF Scorecard (.github/workflows/scorecard.yml)

  • Schedule: push to main + Mondays 06:00 UTC + manual workflow_dispatch.
  • Output: SARIF → Security tab; results published to public Scorecard dashboard.
  • Hardening: step-security/harden-runner (egress: audit), actions/checkout with persist-credentials: false.
  • Baseline: to be measured after first push to main. Track via the Scorecard dashboard linked from the README badge.
  • Stretch target: ≥ 8.0/10. Best-effort — Scorecard does not gate merge per the board ruling. The passing Best Practices badge is the only hard supply-chain gate.

Vulnerability reporting

See SECURITY.md. Preferred channel: GitHub Security Advisories at https://github.com/RandomCodeSpace/otelcontext/security/advisories/new. Email fallback: ak.nitrr13@gmail.com with subject prefix [otelcontext security].

Signed commits & branch protection

  • Repo-local config helper: scripts/setup-git-signed.sh — supports ssh, openpgp, and x509 signing; honours the contributor's existing global git identity.
  • Branch protection on main requiring signed commits is configured at the GitHub repo level (board-admin action; not file-driven). When toggled on, every commit landing on main must verify.

Self-assessment evidence

  • .bestpractices.json — OpenSSF Best Practices evidence map (project 12646, level passing, six categories self-assessed). The badge level transition from in_progresspassing requires a board admin to log into bestpractices.dev with the OSS-Random identity.

Build & Run

go build -o otelcontext .        # Build
./otelcontext                     # Run (default: SQLite, ports 4317/8080)
go vet ./...                      # Lint
go test ./...                     # Test