OtelContext is a self-hosted OTLP observability platform. Single Go binary with embedded React frontend.
- Backend: Go 1.25, native
net/http(no frameworks), GORM ORM, gRPC + HTTP for OTLP ingestion - Frontend: React 19 + TypeScript + TanStack Query/Virtual + wouter + Radix primitives + cmdk palette + hand-rolled token CSS (
ui/src/styles/tokens.css) with CSS Modules; the flow map is own deterministic SVG layout code. No UI framework:@ossrandom/design-system, cytoscape, uplot all removed (rewrite completed 2026-06-12, phases C1–C7) - Ports: gRPC
:4317(OTLP), HTTP:8080(API + HTTP OTLP + WebSocket + UI)
- NO Express.js/Gin/Echo — use native Go
net/http - NO Tailwind CSS, NO Mantine, NO component frameworks — UI styling is the hand-rolled token sheet (
ui/src/styles/tokens.css) + per-component CSS Modules; Radix primitives (unstyled) only for the a11y-hard parts (dialog/tabs/tooltip/dropdown). Token values only — no raw hex outside tokens.css. - Single-service architecture (no microservices split)
- All internal DBs must be embedded (no external processes)
- Relational DB (SQLite/MySQL/PostgreSQL/MSSQL) is the single source of truth
- Prioritize self-hosted, open-source solutions
- The
internal/graph/package is legacy — useinternal/graphrag/for all new graph work
gRPC :4317 (OTLP Ingest) ──► Ingestion Layer ──► Storage (GORM)
HTTP :8080/v1/* (OTLP HTTP)─┘ │ │
▼ ▼
In-Memory Accel. Relational DB
(TSDB Ring, (Source of Truth,
GraphRAG) 7-15 day retention)
│
HTTP :8080 ◄── REST API ◄───────────┘
◄── WebSocket (real-time)
◄── MCP Server (AI agents, 7-tool triage surface)
◄── Prometheus /metrics
| Path | Endpoint | Content Types | Notes |
|---|---|---|---|
| gRPC | :4317 |
protobuf | Traces, Logs, Metrics via OTLP gRPC |
| HTTP | /v1/traces, /v1/logs, /v1/metrics |
application/x-protobuf, application/json |
OTLP HTTP spec compliant, gzip support, 4MB limit. Returns 429 Too Many Requests + Retry-After: 1 when the async pipeline queue is full (parity with gRPC RESOURCE_EXHAUSTED). |
Both paths delegate to the same Export() methods — zero business logic duplication. By default Export() parses the OTLP request and hands a Batch to the async ingest Pipeline (internal/ingest/pipeline.go); a worker pool persists Trace→Span→Log in order. With INGEST_ASYNC_ENABLED=false the pipeline is bypassed and Export() writes inline (legacy path).
Tenant identity flows into the request context on every write and read:
- HTTP:
X-Tenant-IDheader (seeinternal/api/tenant_middleware.go). - gRPC:
x-tenant-idmetadata key (seeinternal/ingest/otlp.go). - OTLP resource attribute:
tenant.idon the resource overrides the header/metadata.
When none are present, DEFAULT_TENANT (default "default") is assigned. Every row in the relational DB carries a tenant_id column; every read method in internal/storage/ scopes by the tenant in the request context (Where("tenant_id = ?", tenant)). Retention (RetentionScheduler) is cross-tenant — it purges by age, not by tenant.
| Layer | Package | Purpose |
|---|---|---|
| GraphRAG (in-memory) | internal/graphrag/ |
Layered graph: 4 typed stores, error chains, root cause analysis, anomaly detection |
| Time Series (in-memory) | internal/tsdb/ |
Ring buffer, sliding windows, pre-computed percentiles |
| Graph (in-memory, legacy) | internal/graph/ |
Simple service topology — being replaced by GraphRAG |
| Relational (persistent) | internal/storage/ |
GORM-based, multi-DB, single source of truth. Driven by RetentionScheduler (hourly batched purge + daily VACUUM/ANALYZE). logs.body is plain TEXT. Log search: SQLite FTS5 (logs_fts, porter+unicode61, ordered by bm25(), AFTER INSERT/DELETE/UPDATE triggers) is the default path — LOG_FTS_ENABLED defaults to true when DB_DRIVER=sqlite and false otherwise. Operators who want the ~30% disk savings can set LOG_FTS_ENABLED=false and reclaim the FTS table + indexes via POST /api/admin/drop_fts. Postgres uses pg_trgm GIN on logs.body and logs.service_name. AttributesJSON and AIInsight remain CompressedText. The search_logs MCP tool and the API /api/logs?q=… filter are clamped to the last 24 hours to bound the LIKE-fallback worst case. The vectordb package (TF-IDF semantic search) was removed on 2026-05-24 alongside the find_similar_logs MCP tool — data/vectordb.snapshot is left on disk for operators to delete by hand. |
The internal/graphrag/ package is the core intelligence layer. It replaces the simple internal/graph/ for advanced observability queries.
| Store | Nodes | Edges | TTL |
|---|---|---|---|
ServiceStore |
ServiceNode, OperationNode | CALLS, EXPOSES | Permanent |
TraceStore |
TraceNode, SpanNode | CONTAINS, CHILD_OF | Configurable (default 1h) |
SignalStore |
LogClusterNode, MetricNode | EMITTED_BY, MEASURED_BY, LOGGED_DURING | Permanent |
AnomalyStore |
AnomalyNode | PRECEDED_BY, TRIGGERED_BY | 24h |
ServiceNode, OperationNode, TraceNode, SpanNode, LogClusterNode, MetricNode, AnomalyNode
CALLS, EXPOSES, CONTAINS, CHILD_OF, EMITTED_BY, LOGGED_DURING, MEASURED_BY, PRECEDED_BY, TRIGGERED_BY
| Function | Algorithm | Purpose |
|---|---|---|
ErrorChain(service, timeRange) |
BFS upstream via CHILD_OF + CALLS | Trace error to responsible service |
ImpactAnalysis(service, depth) |
BFS downstream via CALLS | Blast radius |
RootCauseAnalysis(service, timeRange) |
ErrorChain + anomaly correlation | Ranked probable causes with evidence |
DependencyChain(traceID) |
Tree from CONTAINS + CHILD_OF | Full trace visualization |
CorrelatedSignals(service, timeRange) |
Gather all edges | Related logs/metrics/traces |
ShortestPath(from, to) |
Dijkstra weighted by inverse call freq | Service communication path |
AnomalyTimeline(since) |
Time-sorted anomalies + PRECEDED_BY | Recent anomaly overview |
ServiceMap(depth) |
Full topology dump | Service topology + health |
- 4 event workers consume from a 10,000-capacity buffered channel (best-effort; DB is source of truth)
- Refresh loop (60s) — rebuilds from DB, prunes expired TraceStore nodes, cleans old anomalies
- Snapshot loop (15min) — persists Drain templates so cluster IDs survive restart (the
graph_snapshotswrite side was removed on 2026-05-24; the loop name is retained for wiring stability) - Anomaly loop (10s) — detects error spikes, latency degradation, metric z-score anomalies
Investigation— automated error analysis records (trigger, root cause, causal chain, evidence)DrainTemplateRow— persisted Drain log templates (tabledrain_templates), loaded on startup to warm the miner
Note:
GraphSnapshot(tablegraph_snapshots) was removed on 2026-05-24. AutoMigrate no longer creates the table on fresh deploys; existing populated tables are left in place — operators canDROP TABLE graph_snapshots; VACUUM;to reclaim disk.
Log clustering uses Drain template mining (internal/graphrag/drain.go) — a deterministic fixed-depth prefix tree with O(1) LRU via container/list. Templates are persisted to the drain_templates table and reloaded on startup so cluster IDs stay stable across restarts.
TraceServer.Export() → DB persist → spanCallback → GraphRAG.OnSpanIngested()
LogsServer.Export() → DB persist → logCallback → GraphRAG.OnLogIngested()
MetricsServer.Export() → TSDB → metricCallback → GraphRAG.OnMetricIngested()
The MCP server (internal/mcp/) exposes a focused 7-tool triage surface via
HTTP Streamable MCP (JSON-RPC 2.0 POST + SSE GET). The surface was reduced
from 21 → 7 on 2026-05-24 so the platform survives 120 services on SQLite —
see docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
for the full rationale.
| Tool | Input | Source |
|---|---|---|
get_anomaly_timeline |
{since?, service?} |
In-memory (instant) — triage entry point |
get_service_map |
{depth?, service?} |
In-memory (instant) — topology + health overlay |
get_service_health |
{service_name} |
In-memory (instant) — per-service drill-down |
root_cause_analysis |
{service, time_range?} |
In-memory (instant) — ranked probable causes |
impact_analysis |
{service, depth?} |
In-memory (instant) — blast radius |
trace_graph |
{trace_id} |
In-memory + DB fallback — trace tree visualisation |
search_logs |
{query?, severity?, service?, trace_id?, start?, end?, limit?, page?} |
DB (FTS5 default on SQLite, LIKE fallback, 24h-clamped) |
Cut tools (clients now receive an unknown tool RPC error): get_system_graph,
tail_logs, get_trace, search_traces, get_metrics, get_dashboard_stats,
get_storage_status, find_similar_logs, get_alerts, correlated_signals,
get_error_chains, get_investigations, get_investigation, get_graph_snapshot.
Cacheable surface (5s TTL via MCP_CACHE_TTL_MS): get_anomaly_timeline,
get_service_map, get_service_health, root_cause_analysis, impact_analysis.
Every error-identifying tool returns a root_cause block:
{"root_cause": {"service": "...", "operation": "...", "error_message": "...", "span_id": "...", "trace_id": "..."}}Uses typed envelopes for all data types:
{"type": "logs|spans|traces|metrics", "data": [...]}Legacy format (raw []storage.Log JSON) is supported for backward compatibility.
Proper LIFO ordering to prevent data loss:
- gRPC
GracefulStop()+ HTTPShutdown()— stop ingestion - WebSocket Hub + Event Hub + AI Service — stop real-time
- TSDB + Graph + GraphRAG — stop processing
- DLQ — stop replay
- RetentionScheduler
Stop()— halt purge/maintenance ticks - DB
Close()— close database last
internal/
ai/ # AI service integration
api/ # HTTP handlers, middleware, rate limiting, graph_handler
cache/ # TTL cache with synchronized Stop()
compress/ # Zstd compression utilities
config/ # Environment configuration (40+ fields)
graph/ # LEGACY in-memory service graph — use graphrag/ for new work
graphrag/ # GraphRAG: layered graph, error chains, anomaly detection, investigations
schema.go # 7 node types, 9 edge types, query result types
store.go # 4 typed stores (Service, Trace, Signal, Anomaly)
builder.go # Event workers, ingestion callbacks, GraphRAG coordinator
queries.go # ErrorChain, ImpactAnalysis, RootCause, ShortestPath, etc.
investigation.go # GORM Investigation model + persistence
anomaly.go # Z-score, error spike, latency degradation detection
drain.go # Log clustering via Drain template mining — pure-Go, stdlib-only, deterministic fixed-depth prefix tree
refresh.go # Periodic DB rebuild + pruning + Drain template persistence
ingest/ # OTLP receivers (gRPC + HTTP), adaptive sampling
otlp.go # gRPC TraceServer, LogsServer, MetricsServer
otlp_http.go # HTTP OTLP handler (protobuf + JSON, gzip, 4MB limit)
sampler.go # Per-service token bucket sampler
mcp/ # MCP server (7-tool triage surface, JSON-RPC 2.0 + SSE)
queue/ # Dead Letter Queue (typed envelopes, bounded disk, exp backoff)
realtime/ # WebSocket hub + event streaming
storage/ # GORM repository, models, migrations, Close() method, SQLite PRAGMA stanza
telemetry/ # Prometheus metrics + health (19 metrics)
tsdb/ # Time series aggregator + ring buffer (lock-free Windows())
ui/ # Embedded React frontend
ui/ # React frontend (Vite + token CSS Modules, no UI framework)
test/ # Microservice simulation (7 services)
docs/ # Specifications and plans
Key settings in internal/config/config.go:
HTTP_PORT(8080),GRPC_PORT(4317),DB_DRIVER(sqlite),DB_DSNDB_AUTOMIGRATE(true),DB_MAX_OPEN_CONNS,DB_MAX_IDLE_CONNS,DB_CONN_MAX_LIFETIME(internally capped to 30m whenDB_AZURE_AUTH=true)DB_AZURE_AUTH(false) — see Authentication belowTLS_CERT_FILE,TLS_KEY_FILE— explicit TLS (both or neither)TLS_AUTO_SELFSIGNED(false),TLS_CACHE_DIR(./data/tls) — self-signed bootstrap, ignored if cert files setAPI_KEY— Bearer token gate for/api/*,/v1/*,/mcp. Empty = auth disabledOTEL_EXPORTER_OTLP_ENDPOINT— enables self-instrumentation (empty = off)DEFAULT_TENANT(default) — assigned to rows ingested without explicit tenantHOT_RETENTION_DAYS(7) — drivesRetentionScheduler; range 1..36500SAMPLING_RATE(1.0),SAMPLING_ALWAYS_ON_ERRORS(true),SAMPLING_LATENCY_THRESHOLD_MS(500)METRIC_MAX_CARDINALITY(10000),METRIC_MAX_CARDINALITY_PER_TENANT(0 = unlimited),API_RATE_LIMIT_RPS(100). The per-tenant cap is checked first; when set, a noisy tenant cannot exhaust the global pool. Overflow is labeled by tenant viaotelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id}(__global__sentinel when the global cap was the trigger).MCP_ENABLED(true),MCP_PATH(/mcp)MCP_MAX_CONCURRENT(32),MCP_CALL_TIMEOUT_MS(30000),MCP_CACHE_TTL_MS(5000) — MCP HTTP streamable robustness. Counting semaphore gates concurrenttools/call(JSON-RPC-32000past the cap), per-call deadlines abort runaway handlers (JSON-RPC-32001), and a 5s TTL cache memoizes the cheap in-memory GraphRAG tools (get_service_map,impact_analysis,root_cause_analysis,get_anomaly_timeline,get_service_health). SSE GET sends a: keep-alive\n\ncomment every 25s to keep the stream alive across reverse-proxy idle timeouts. Set any to 0 to disable.LOG_FTS_ENABLED— when truthy (true/yes/on/1), provisions the SQLite FTS5logs_ftsvirtual table + sync triggers at startup; when false, log-search uses a 24h-clamped LIKE fallback. Defaults totruewhenDB_DRIVER=sqlite(BM25 is dramatically faster than LIKE on the keptsearch_logsMCP tool) andfalseotherwise. Toggle off and reclaim the ~30% disk overhead viaPOST /api/admin/drop_fts(refused while the flag is on). The vectordb-backed semantic-search path was removed on 2026-05-24.DLQ_MAX_FILES(1000),DLQ_MAX_DISK_MB(500),DLQ_MAX_RETRIES(10)GRAPHRAG_WORKER_COUNT(16),GRAPHRAG_EVENT_QUEUE_SIZE(100000; 10000 on SQLite) — sized for 100–200 services; raise further ifotelcontext_graphrag_events_dropped_totalclimbsGRAPHRAG_TRACE_TTL(1h;30mon SQLite),GRAPHRAG_MAX_SPANS_PER_TENANT(500000),GRAPHRAG_TENANT_IDLE_TTL(24h) — in-memory GraphRAG memory bounds. Spans past the per-tenant cap are skipped from the graph only (DB unaffected; metered asotelcontext_graphrag_events_dropped_total{signal="span_capacity"}); tenant store slices idle past the TTL are evicted (default tenant immune, self-healing via the 60s rebuild). SignalStore metrics are bounded to 2000/tenant + 24h TTL (constants).PPROF_ADDR(127.0.0.1:6060) —net/http/pprofon a dedicated loopback listener (never the public mux); empty disables. Startup also sets a softGOMEMLIMIT(honors the env var, else 75% of the cgroup/host budget viainternal/membudget).INGEST_MIN_SEVERITY(INFO),STORE_MIN_SEVERITY(""= same as ingest; defaults to"WARN"whenDB_DRIVER=sqlite) — two-tier log severity gate. The ingest gate runs at the OTLP receiver and drops the log entirely below the threshold (no in-memory enrichment either). The store gate runs at the persist boundary inside the async pipeline (internal/ingest/pipeline.go:process) and only skips the DB row write — the log still flows throughLogCallbackso GraphRAG Drain template mining and span/trace correlation see it. Use case:INGEST_MIN_SEVERITY=DEBUG STORE_MIN_SEVERITY=WARNkeeps SQLite small while letting in-memory anomaly detection benefit from the verbose stream. SettingSTORE_MIN_SEVERITY≤INGEST_MIN_SEVERITYis a no-op (logged as a warning at startup). Drops surface viaPipeline.Stats().StoreFiltered.INGEST_ASYNC_ENABLED(true),INGEST_PIPELINE_QUEUE_SIZE(50000),INGEST_PIPELINE_WORKERS(8),INGEST_PIPELINE_MAX_BYTES(536870912 = 512 MB; 128 MB on SQLite) — async ingest pipeline (internal/ingest/pipeline.go). Hybrid backpressure: <90% accept all, 90–100% drop healthy batches (errors/slow always pass), 100% return gRPCRESOURCE_EXHAUSTED. The byte cap bounds queue memory regardless of item count — at the cap even priority batches getRESOURCE_EXHAUSTED/429 (a 429 is recoverable, an OOM kill is not); watchotelcontext_ingest_pipeline_queue_bytesand reasonbytes_full. SetINGEST_ASYNC_ENABLED=falseto revert to synchronous DB writes insideExport(). Drops surface asotelcontext_ingest_pipeline_dropped_total{signal,reason}.GRPC_MAX_RECV_MB(16),GRPC_MAX_CONCURRENT_STREAMS(1000) — OTLP gRPC server caps, validated to 1..256 and 1..1_000_000RETENTION_BATCH_SIZE(50000),RETENTION_BATCH_SLEEP_MS(1) — purge pacing; raise the sleep on busy production DBsDB_POSTGRES_PARTITIONING(""),DB_PARTITION_LOOKAHEAD_DAYS(3) — opt-in Postgres declarative range partitioning of thelogstable by day. Whendaily,logsis provisioned as a partitioned parent (greenfield only — refuses to start iflogsalready exists unpartitioned), thePartitionSchedulermaintains lookahead partitions and drops expired ones viaDROP TABLE, andRetentionSchedulerskips the row-level DELETE forlogs. Watchotelcontext_partitions_dropped_totalandotelcontext_partitions_active.APP_ENV("development"),OTELCONTEXT_ALLOW_SQLITE_PROD(false) — SQLite is refused whenAPP_ENV=productionunless the allow flag is set
So a 100+ service deployment on SQLite survives without OOM, config.Load() overrides nine defaults at the end of the Load() pass — but only when the operator did not explicitly set the env var (detected via os.LookupEnv presence, not value comparison). Postgres/MSSQL/MySQL paths are untouched.
| Env var | SQLite default | Postgres default | Rationale |
|---|---|---|---|
DB_MAX_OPEN_CONNS |
1 | 50 | SQLite is single-writer; extra conns are wasted slots. |
DB_MAX_IDLE_CONNS |
1 | 10 | Match open conns. |
INGEST_PIPELINE_WORKERS |
2 | 8 | Workers all serialise through the SQLite writer lock; 2 is enough to keep the queue non-empty. |
INGEST_PIPELINE_QUEUE_SIZE |
10000 | 50000 | Lower heap watermark; backpressure kicks in earlier so OTLP clients back off. |
INGEST_PIPELINE_MAX_BYTES |
128 MB | 512 MB | Item count alone cannot bound queue memory; one batch may carry MBs of spans/logs. |
GRAPHRAG_EVENT_QUEUE_SIZE |
10000 | 100000 | Each queued event embeds a Span/Log by value (~0.5–2 KB); buffer less, drop sooner (metered). |
GRAPHRAG_TRACE_TTL |
30m | 1h | The in-memory span window is the largest legitimate GraphRAG consumer; anomaly/investigation lookbacks are ≤5min. |
METRIC_MAX_CARDINALITY |
3000 | 10000 | Bound the in-memory TSDB series map. |
STORE_MIN_SEVERITY |
"WARN" |
"" |
Skip INFO/DEBUG persists; in-memory GraphRAG/Drain still sees them. |
SAMPLING_RATE |
0.05 | 1.0 | Errors and slow spans are always kept by SAMPLING_ALWAYS_ON_ERRORS. |
GRPC_MAX_CONCURRENT_STREAMS |
240 | 1000 | ~2 streams per service at 120 services with headroom. |
LOG_FTS_ENABLED |
true |
n/a | FTS5 BM25 is dramatically faster than LIKE on the kept search_logs path. |
Also at SQLite startup, internal/storage/factory.go applies a fail-closed PRAGMA stanza: journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, wal_autocheckpoint=10000, journal_size_limit=67108864 (64 MB WAL cap), busy_timeout=5000, plus budget-scaled memory knobs: page cache = budget/32 clamped to [64 MB, 256 MB] and mmap = budget/8 clamped to [256 MB, 1 GB], where the budget comes from internal/membudget (cgroup v2 → v1 → /proc/meminfo; a 4 GB host gets 128 MB cache + 512 MB mmap, detection failure falls back to the 256 MB/1 GB ceilings). Operators override with SQLITE_CACHE_SIZE_KB / SQLITE_MMAP_SIZE_BYTES. With the pure-Go driver the page cache is Go-heap memory and competes with GOMEMLIMIT. PRAGMA auto_vacuum=INCREMENTAL is attempted best-effort before the WAL switch (the WAL header freezes the stored mode; only affects newly created DB files). Any pragma failure in the fail-closed stanza aborts startup with a wrapped error — these are not optional. See docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md for per-default reasoning.
API auth (platform). API_KEY gates /api/*, OTLP HTTP (/v1/*), and the MCP endpoint via Authorization: Bearer <API_KEY>. When empty, the middleware is a pass-through (dev only). Unprotected paths: /live, /ready, /metrics*, /ws*. A shared API_KEY grants access to every tenant — there is no per-tenant-key file in the current code; isolate tenants at the network/auth layer if that matters. (If an API_TENANT_KEYS_FILE override lands later, re-check internal/api/auth.go for the flag name.)
Database auth (Azure Entra). Setting DB_AZURE_AUTH=true enables Azure Entra ID (AAD) authentication for PostgreSQL. The driver uses DefaultAzureCredential, which resolves identity via the standard probe order (env vars → workload identity → managed identity → Azure CLI → developer credentials). When Azure auth is enabled, strict TLS (sslmode=require, verify-ca, or verify-full) is mandatory; weaker modes are rejected at startup. DB_CONN_MAX_LIFETIME is internally capped to 30 minutes to stay inside the token TTL.
The RetentionScheduler in internal/storage/ runs an hourly batched purge of data older than HOT_RETENTION_DAYS via PurgeLogsBatched, PurgeTracesBatched, and PurgeMetricBucketsBatched, plus a daily maintenance pass: PRAGMA optimize and PRAGMA incremental_vacuum(10000) on SQLite (the historical full VACUUM held an exclusive whole-DB lock for 10–60 min on multi-GB files, starving ingest into a 429 storm; restore it with RETENTION_FULL_VACUUM=true or run POST /api/admin/vacuum on demand — note pre-existing DB files keep their auto_vacuum mode until a manual full VACUUM rewrites them, so incremental_vacuum no-ops harmlessly there), ANALYZE-equivalent maintenance on other drivers as before. Purge is cross-tenant — it scopes by age, not tenant_id. Valid HOT_RETENTION_DAYS is clamped to the range 1..36500.
Failure-mode gauges (prefix OtelContext_):
retention_consecutive_failures— reset to 0 on success; alert when > 3retention_last_success_timestamp— Unix seconds; alert when stale relative to the hourly tickretention_rows_purged_total,retention_purge_duration_seconds,retention_vacuum_duration_seconds— throughput and latency
OtelContext targets the OpenSSF Best Practices passing badge (project 12646) and ships a six-job OSS-CLI security stack, supplemented by SonarCloud SAST as a required gate (board reversal 2026-04-28). No CodeQL, no NVD-direct tooling. Cost: $0 for the OSS-CLI tier; SonarCloud is free for public repos.
| Concern | Tool | Gate |
|---|---|---|
| SCA (Go modules + npm) | OSV-Scanner against go.mod + ui/package-lock.json (OSV.dev / GHSA / ecosystem feeds; not NVD) |
Block merge on High/Critical |
| SCA (filesystem + OS) + container scan | Trivy filesystem scan; Dependabot surfaces advisories on the Security tab | Block merge on severity: HIGH,CRITICAL, exit-code: 1, ignore-unfixed: true |
| SAST | Semgrep (p/security-audit + p/owasp-top-ten + p/golang) |
Block merge on --severity ERROR |
| Secret scan | Gitleaks (full git history) | Block merge on any finding |
| Duplication | jscpd, threshold 3%, --min-tokens 100, scoped to internal/ + ui/src/, excludes tests, vendor, build artifacts, and the legacy internal/graph/ package |
Block merge above threshold |
| SBOM | anchore/sbom-action (SPDX + CycloneDX) |
Surface as 90-day artifact; do not gate merge |
| Lint (Go) | golangci-lint (existing .golangci.yml) |
Wired into ci.yml, not security.yml |
All actions are SHA-pinned per Scorecard Pinned-Dependencies. Top-level permissions: read-all; jobs scope up only when needed (gitleaks needs full history; sbom uploads).
Required external gate: SonarCloud Code Analysis. Runs as the SonarCloud GitHub App (no in-repo workflow); listed in main branch protection's required_status_checks since 2026-04-28. Reinstated by board reversal — earlier docs that said "do not re-introduce" are superseded.
Not used (do not re-introduce without an explicit board reversal): CodeQL (GHAS-paid for non-public repos), OWASP Dependency-Check (or any NVD-direct tool — NVD has analysis-backlog and rate-limit reliability problems).
- Schedule: push to
main+ Mondays 06:00 UTC + manualworkflow_dispatch. - Output: SARIF → Security tab; results published to public Scorecard dashboard.
- Hardening:
step-security/harden-runner(egress: audit),actions/checkoutwithpersist-credentials: false. - Baseline: to be measured after first push to
main. Track via the Scorecard dashboard linked from the README badge. - Stretch target: ≥ 8.0/10. Best-effort — Scorecard does not gate merge per the board ruling. The
passingBest Practices badge is the only hard supply-chain gate.
See SECURITY.md. Preferred channel: GitHub Security Advisories at https://github.com/RandomCodeSpace/otelcontext/security/advisories/new. Email fallback: ak.nitrr13@gmail.com with subject prefix [otelcontext security].
- Repo-local config helper:
scripts/setup-git-signed.sh— supports ssh, openpgp, and x509 signing; honours the contributor's existing global git identity. - Branch protection on
mainrequiring signed commits is configured at the GitHub repo level (board-admin action; not file-driven). When toggled on, every commit landing onmainmust verify.
.bestpractices.json— OpenSSF Best Practices evidence map (project 12646, levelpassing, six categories self-assessed). The badge level transition fromin_progress→passingrequires a board admin to log into bestpractices.dev with the OSS-Random identity.
go build -o otelcontext . # Build
./otelcontext # Run (default: SQLite, ports 4317/8080)
go vet ./... # Lint
go test ./... # Test