Skip to content

feat(observability): direct-POST metrics + error channel, per-call fetch operations, audit CLI#179

Open
vibe-dex wants to merge 12 commits into
mainfrom
vibe-dex/investigate-plan
Open

feat(observability): direct-POST metrics + error channel, per-call fetch operations, audit CLI#179
vibe-dex wants to merge 12 commits into
mainfrom
vibe-dex/investigate-plan

Conversation

@vibe-dex

@vibe-dex vibe-dex commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Builds on the 5.0 CF-native observability pipeline. Stamps service.name and service.version (from CF_VERSION_METADATA) on every framework span and log line via the attribute floor, and sets SpanStatusCode.ERROR on framework spans when withTracing rethrows. Adds new spans across the framework — cache lookup/store decisions in both cache paths (HTML + serverFn) with profile/kind/decision labels via an extended recordCacheMetric(hit, profile, decision), the runSectionLoaders batch, the deferred section load, and every admin protocol handler (meta, decofile.read|reload, render, invoke). Logger and logRequest now stamp trace_id/span_id on every line emitted inside a withTracing scope, so logs and traces correlate in ClickStack. New public helper injectTraceContext(headers) writes a W3C traceparent from the active span for outbound fetches (consumed by createInstrumentedFetch in @decocms/apps). Ships docs/observability.md covering the architecture, span/metric reference, wrangler.jsonc shape, identity floor, ClickHouse JSONExtract query patterns, trace propagation, and sampling notes; linked from README.

Test plan

  • bun run typecheck clean
  • bun test — all 553 tests pass (11 new)
  • bun run build succeeds; new exports (injectTraceContext, CacheDecision, spanContext) appear in dist d.ts
  • bunx tsx scripts/check-tier-boundaries.ts — clean
  • Post-merge smoke on a preview deployment with version_metadata binding + ingest destinations: confirm ResourceAttributes['service.version'] and SpanAttributes['cache.profile'] land in ClickHouse; force a 5xx and verify the framework span shows STATUS_CODE_ERROR in ClickStack; filter logs by trace_id and confirm correlation.

🤖 Generated with Claude Code


Summary by cubic

Expands CF-native observability with wider span coverage, direct OTLP/HTTP metrics and 100%-error logs, safe URL logging, trace/log correlation, per-call operation names on outbound fetches, and a deco-audit-observability CLI. Also exposes Router preload reuse options and fixes fetch method/header semantics, exporter reliability, and redaction edge cases.

  • New Features

    • Error logs: direct OTLP/HTTP channel posts level: "error" to /v1/logs when DECO_OTEL_LOGS_ENDPOINT is set; rate-limited, buffered per isolate, flushed via ctx.waitUntil. Safe to lower observability.logs.head_sampling_rate.
    • Metrics: OTLP/HTTP meter posts to /v1/metrics when DECO_OTEL_METRICS_ENDPOINT is set; aggregates per isolate, flushes via ctx.waitUntil; composes with Workers Analytics Engine.
    • Spans/metrics: stamps service.name/service.version; sets error status on framework spans; new spans for cache lookup/store (HTML + serverFn), deco.section.loaders.batch, deferred section load, and admin handlers. Cache metrics carry a decision label and set deco.cache.decision/deco.cache.profile on the active span.
    • Correlation/propagation: logger/logRequest add trace_id/span_id inside withTracing. injectTraceContext(headers) writes W3C traceparent; used by createInstrumentedFetch (default on, opt out with injectTraceparent: false).
    • Outbound fetches: createInstrumentedFetch adds per-call operation names — init.operation, defaultOperation, and resolveOperation(url, method); stamps fetch.operation on the span and exposes it via onComplete.
    • URL safety: createInstrumentedFetch redacts query values on http.url and dev logs; allowlist via keepQueryKeys. Public redactUrl helper exported.
    • Router: expose preload reuse knobs (defaultPreloadStaleTime, defaultPreloadGcTime, defaultPreloadDelay, defaultStaleTime, defaultPendingMs, defaultPendingMinMs); adds deco-pdp-fast-navigation reference.
    • Audit/docs/defaults: deco-audit-observability audits wrangler.jsonc; observability.traces.head_sampling_rate defaults to 0.01. docs/observability.md documents wiring and cost model; README links it. New exports: @decocms/start/sdk/otelHttpMeter, @decocms/start/sdk/otelHttpErrorLog, injectTraceContext, CacheDecision, redactUrl.
  • Bug Fixes

    • Fetch: header handling now matches the Fetch spec — init.headers replaces Request.headers; when omitted, existing headers are preserved. Method resolution now follows init.method ?? input.method ?? "GET" so http.method and resolveOperation see the real verb.
    • Error-log exporter: on non-2xx/transport errors, restores the flushed snapshot instead of dropping records, with snapshot-first policy — evicts newest live-buffer records to honor the buffer cap; only truncates the snapshot if it alone exceeds the cap; guards attribute serialization when JSON.stringify is undefined.
    • Metrics exporter: checks overflow before creating metric entries to avoid empty OTLP envelopes.
    • URL redaction: fallback path also strips fragments (#...) to avoid leaking tokens.
    • JSONC parsing: adds trailing-comma support via shared helpers; deco-audit-observability and deco-cf-observability handle real wrangler.jsonc files cleanly.

Written for commit 999fd30. Summary will update on new commits. Review in cubic

…and trace correlation

Stamps service.name and service.version (from CF_VERSION_METADATA) on every
framework span and log via the attribute floor; sets SpanStatusCode.ERROR on
framework spans when withTracing rethrows; adds spanContext() on the Span
adapter so logger and propagation helpers can read trace IDs.

New framework spans cover cache lookup/store decisions (HTML + serverFn paths)
with profile and kind attributes, the runSectionLoaders batch, deferred section
load, and every admin protocol handler (meta, decofile read/reload, render,
invoke). recordCacheMetric now takes a CacheDecision label
(HIT|STALE-HIT|STALE-ERROR|MISS|BYPASS) and is wired at every decision point.

Logger emit and logRequest now stamp trace_id/span_id on every line emitted
inside a withTracing scope, so logs and traces correlate in ClickStack. New
injectTraceContext(headers) helper writes a W3C traceparent for outbound
fetches (consumed by createInstrumentedFetch in @decocms/apps).

Adds docs/observability.md covering architecture, span/metric reference,
wrangler.jsonc shape, identity floor, ClickHouse JSONExtract query patterns,
trace propagation, and sampling notes; README links it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vibe-dex vibe-dex requested a review from a team May 18, 2026 13:19

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/core/sdk/otel.test.ts
vibe-dex and others added 2 commits May 18, 2026 16:19
…ng default to 1%

Phase 1 — fix spy leak flagged by cubic review on PR #179.

The `instrumentWorker — tracer bridge` describe block in `otel.test.ts`
installs a `vi.spyOn(trace, "getTracer")` per test via
`installBridgeWithFakeOtelSpan()` but the `afterEach` only reset boot
state — it never restored the spy. That meant any subsequent test in
the same vitest worker that touched the real OTel global API would
get the mocked tracer back. Added `vi.restoreAllMocks()` to the
afterEach so each test's spies die with it. 553/553 still pass.

Phase 2 — lower documented trace `head_sampling_rate` default from 0.1
(10%) to 0.01 (1%) across docs, scaffold, codemod, and skill reference.

At fleet scale (~2.5B req/mo across 100 sites, ~20 spans/req), 10%
puts the account at meaningful risk of tripping the CF 5B-events/day
cap, after which CF auto-applies forced 1% sampling for the rest of
the day fleet-wide — turning a single viral campaign on one storefront
into a fleet-wide outage of telemetry. 1% gives 2.5x headroom and
keeps annual telemetry cost in the low hundreds of dollars.

Also added to `docs/observability.md`:
- Heavy-site override tier (0.001 / 0.1%) gated on explicit team
  sign-off, NOT a codemod default. No documented rate above 1%.
- Cost model table (CF Destinations + ingestor + AE + DO) verified
  against current public CF pricing.
- Note that the 100%-error-traces upgrade path (ingest-side
  filtering at 1.0) costs ~$44K/mo at our scale vs ~$420/mo for the
  chosen path of 1% head + direct-POST errors.

The migration codemod's `--traces-rate` default now ships 0.01 so
newly-migrated sites land on the safe number out of the box.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ingest

CF Destinations supports OTLP for logs and traces only — there is no
`observability.metrics` block. To get metrics into stats-lake the
framework now POSTs OTLP/HTTP JSON straight to deco-otel-ingest's
`/v1/metrics` endpoint, alongside (not replacing) the AE meter.

Adds `createOtlpHttpMeterAdapter` in `src/core/sdk/otelHttpMeter.ts`:

- Per-isolate, in-memory buffer for counters/gauges/histograms with
  CUMULATIVE aggregation temporality (matches the clickhouseexporter
  schema the ingestor inherits).
- Counters aggregate by (metric name, attribute fingerprint); each
  `counterInc()` bumps the cumulative value in place.
- Gauges are last-write-wins per attribute fingerprint.
- Histograms accumulate count/sum/min/max + per-bucket counts against
  explicit bounds (default `[5,10,25,50,75,100,250,500,1000]` ms).
- Re-registering a metric name as a different kind drops the second
  call and surfaces via `onError("kind-mismatch", …)` — never throws.
- Buffer cap (default 2000 datapoints) bounds isolate memory. New
  attribute fingerprints past the cap are dropped via
  `onError("overflow", …)`; existing fingerprints still update.
- `flush()` POSTs the snapshot and is throttled by `minFlushIntervalMs`
  (default 5000ms) per isolate. Concurrent callers share a single
  in-flight POST. Cooldown is bypassed when the buffer is at the cap.
- Transport failures (network, non-2xx, AbortError on timeout) never
  throw — they surface via `onError("flush", err)` so the request hot
  path stays uninstrumented.

Wires into `instrumentWorker`:

- `bootObservability` reads `env.DECO_OTEL_METRICS_ENDPOINT` (env var
  configurable via `OtelOptions.otlpMetricsEndpointEnvVar`). When set
  and `otlpMetricsEnabled !== false`, the OTLP meter is installed
  beside the AE meter via `createCompositeMeter([ae, otlp])`.
- After every `handler.fetch`, the wrapped fetch fires
  `ctx.waitUntil(otlpMeter.flush())` so the buffer drains without
  blocking the response. The throttle keeps actual POST volume
  proportional to traffic, not to request count.
- Boot breadcrumb adds `otlpMetrics: true|false` so operators can
  confirm the wire at a glance from CF Logs.

Tests:

- `otelHttpMeter.test.ts` — 10 unit tests covering cumulative sum
  aggregation, histogram bucket assignment, gauge last-write-wins,
  kind-mismatch rejection, cooldown gating, overflow protection,
  in-flight flush coalescing, non-2xx + transport failures.
- `otel.test.ts` — 4 integration tests covering the boot/flush
  wiring (no-op without env var, flush enqueued via ctx.waitUntil,
  `otlpMetricsEnabled: false` opt-out, flush still fires when the
  inner handler throws).

End-to-end verified live against the deployed deco-otel-ingest
endpoint: sum, gauge, and histogram all round-trip through the
exporter → ingestor → ClickHouse `otel_metrics_{sum,gauge,histogram}`
with correct values, attribute fingerprints, and timestamps. Smoke
harness at `scripts/smoke-otlp-meter.ts` for repeatability.

567/567 vitest pass, typecheck clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/core/sdk/otelHttpMeter.ts Outdated
…nstrumentedFetch

Two related changes that close gaps in outbound fetch instrumentation
before downstream services see our traffic:

**`redactUrl()` (new in `src/core/sdk/urlRedaction.ts`).** Replaces
every query value with `"REDACTED"` by default, with an opt-in
`keepQueryKeys` allowlist for structural params (`page`, `sort`, etc.).
Strips userinfo and the fragment, preserves host + path verbatim, and
falls back to a defensive substring-before-`?` for unparseable inputs.
Exported as `@decocms/start/sdk/urlRedaction`.

Why redact at the emit site instead of relying on the ingestor:
spans land in CF Workers Tracing the instant they're stamped — once
the unredacted URL is in the CF dashboard, we can't take it back. Same
for `console.log` lines captured by `observability.logs`. The
ingestor's existing header-redaction layer doesn't reach the URL, so
this is the only point where outbound URLs get sanitized.

**`createInstrumentedFetch` updates:**

- `http.url` span attribute now carries the redacted URL.
- The colored dev log lines (`[vtex] GET ...`) print the redacted URL.
- New `keepQueryKeys: ReadonlyArray<string>` option propagates to
  `redactUrl` for callers that want to keep specific keys readable.
- New `injectTraceparent: boolean` option (default `true`) injects the
  active span's W3C `traceparent` header onto every outbound request
  via `injectTraceContext(headers)`. No-op when no span is active, so
  CF Worker auto-instrumented spans automatically propagate when
  present and the wrapper is invisible otherwise.
- Header merging preserves caller-supplied headers when `input` is a
  `Request` — pre-existing values win, the wrapper adds traceparent
  only.

The structured `"outgoing fetch"` breadcrumb (only emitted when
`OTEL_LOG_OUTGOING_FETCH=true`) keeps using `host`/`path` derived from
the rawUrl since both fields are individually safe and the value lands
in the logger pipe where the ingestor's redaction layer applies.

Tests:

- `urlRedaction.test.ts` — 10 cases: every-value redaction, allowlist,
  empty-value preservation, userinfo stripping, fragment drop, multi-
  value query collapse, unparseable input fallback.
- `instrumentedFetch.test.ts` — 7 cases: redacted URL on span, allowlist
  works, traceparent injection on outbound request when span has a
  spanContext, opt-out via `injectTraceparent: false`, no-op when no
  span is active, caller headers preserved alongside injected
  traceparent.

584/584 vitest pass, typecheck clean, tier-boundaries audit clean.

This unblocks PR #3 in `@decocms/apps-start`: VTEX/Shopify fetch
wrappers will migrate to `createInstrumentedFetch` and inherit
traceparent propagation + URL redaction for free, instead of duplicating
the logic in each app.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 6 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread src/core/sdk/instrumentedFetch.ts Outdated
Comment thread src/core/sdk/urlRedaction.ts Outdated
Comment thread src/core/sdk/instrumentedFetch.test.ts Outdated
vibe-dex and others added 2 commits May 18, 2026 17:10
…ture

Errors are too important to leave to head sampling. Once the cost model
drops `logs.head_sampling_rate` below 1.0 — the recommended target is
0.01 once this channel ships — a sampled CF Destinations pipe loses
99% of `logger.error(...)` calls. Sites lose the one signal they care
about most.

This commit adds a separate, framework-controlled transport for
`level: "error"` records via direct POST to deco-otel-ingest `/v1/logs`
(same endpoint CF Destinations uses; same `default.otel_logs` table;
no ingestor changes required).

**`createOtlpHttpErrorLogAdapter` (new in `src/core/sdk/otelHttpErrorLog.ts`).**

- Returns `{ adapter, flush, pendingRecordCount }`. The `adapter` is a
  `LoggerAdapter` that filters out debug/info/warn upstream of the
  buffer — only errors travel this transport. The composite logger
  fans non-error calls out to `defaultLoggerAdapter` which CF
  Destinations samples normally.
- Token-bucket rate limiter prevents log storms (a pathological error
  loop blasting 10K errors/sec would be self-inflicted DoS on
  stats-lake). Default: 100 errors/min refill, burst capacity 20.
  Both knobs are options.
- Buffer cap (default 500 records) bounds isolate memory. Overflow
  surfaces via `onError("overflow", …)`; existing records still queue.
- `flush()` POSTs and is throttled by `minFlushIntervalMs` (default
  5000ms). Concurrent callers share a single in-flight POST. Cooldown
  is bypassed when the buffer is at the cap.
- Snapshot-then-reset semantics: records buffered after `flush()`
  starts the POST aren't lost — they queue for the next flush.
- Transport failures surface via `onError("flush", err)`. Never throws.
- Wire format: OTLP/HTTP JSON `ResourceLogs` matching what CF
  Destinations emits, so the existing ingestor handler accepts both
  transports unchanged. Scalar attribute kinds (string/bool/int/double)
  serialize to the right OTLP variants; structured attrs round-trip
  as JSON strings.

**`instrumentWorker` wiring:**

- `bootObservability` reads `env.DECO_OTEL_LOGS_ENDPOINT` (env var
  configurable via `OtelOptions.otlpErrorLogsEndpointEnvVar`). When
  set and `otlpErrorLogsEnabled !== false`, the error-log adapter is
  composed onto `defaultLoggerAdapter` via `createCompositeLogger`.
- After every `handler.fetch`, the wrapped fetch fires
  `ctx.waitUntil(otlpErrorLog.flush())` alongside the metrics flush.
  Both throttle themselves; calling on every request is cheap.
- Boot breadcrumb adds `otlpErrorLogs: true|false`.
- The exporter's `onError` callback uses `console.warn` directly (not
  `logger.warn`) to avoid recursing back into the logger composite.

Tests:

- `otelHttpErrorLog.test.ts` — 10 unit tests: level filter, OTLP shape,
  resource attributes, scalar serialization for all kinds, token-bucket
  rate limiting with refill, buffer overflow, flush success/non-200/
  rejection, cooldown gating with cap bypass, in-flight coalescing.
- `otel.test.ts` — 3 integration tests: error routes through direct
  POST when env is set, info/warn/debug do NOT trigger POSTs, opt-out
  via `otlpErrorLogsEnabled: false`.

End-to-end verified live against deployed deco-otel-ingest: an error
record with mixed-kind attributes (string, bool, int) round-trips
into `default.otel_logs` with correct SeverityText="error",
SeverityNumber=17, Body, LogAttributes, ServiceName, and
ScopeName="@decocms/start". Smoke harness at
`scripts/smoke-otlp-errorlog.ts`.

597/597 vitest pass, typecheck clean.

With this channel in place, the next safe step (in a follow-up PR) is
to lower `logs.head_sampling_rate` from 1.0 to 0.01, dropping the
fleet-wide CF Destinations log bill by ~99% while keeping 100% of
errors via this transport.

Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 6 (deeper instrumentation): the highest-value addition over the
existing withTracing coverage is making cache decisions filterable on
traces directly, without joining to the metric tables.

`recordCacheMetric` now stamps `deco.cache.decision` and
`deco.cache.profile` on the closest enclosing span before recording the
counter. Existing call sites in workerEntry.ts (HIT / STALE-HIT /
STALE-ERROR / MISS / BYPASS) get this for free — no call-site changes.

This is intentionally narrower than the originally scoped
block.resolve / schema.compose / render.shell spans. Those would either
duplicate existing spans (resolvePage already wraps block resolution)
or fire only at boot time (composeMeta), and would explode cardinality
without proportional value. Stamping the cache decision on the span
that's already there is one attribute per request, zero new spans.

Tests: 4 new cases covering decision+profile stamp, no-span safety,
profile-omitted, and confirming the counter still fires.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread src/core/sdk/otelHttpErrorLog.ts
Comment thread src/core/sdk/otelHttpErrorLog.ts Outdated
vibe-dex and others added 2 commits May 18, 2026 17:20
…rangler.jsonc

Phase 7 (audit rules): the **detect** half of D3 ("audit is the safety
net"). Pairs 1:1 with the existing `deco-cf-observability --write` fix
— every rule has a `fix:` line in its finding pointing at the exact
codemod flag that resolves it.

Read-only auditor with 6 rules covering the fleet-scale failure modes:

  observability_missing            error
  observability_disabled           error
  traces_disabled / logs_disabled  warn
  head_sampling_rate_elevated      error   (traces > 0.01)
  logs_head_sampling_rate_low      warn    (logs < 1)
  persist_disabled_no_destination  error

CI-friendly: exit 0 on clean, 1 on findings (any severity), 2 on
unparseable wrangler.jsonc. `--json` for machine-readable output.

Wired up as a published bin: `deco-audit-observability`. Storefronts can
run it in CI; a future CF Tail Worker pre-merge gate can run it against
PR changes.

Also factored `stripJsoncComments` out of the codemod into
`scripts/lib/jsonc.ts` so both scripts share the helper.

Tests: 12 cases — canonical-passes, every rule fires, boundary value
0.01 doesn't false-fire, persist:false-with-destination is fine,
multiple findings stack correctly.

Co-authored-by: Cursor <cursoragent@cursor.com>
…loss profile

Phase 9 (docs refresh): aligns docs/observability.md with what shipped
across Phases 3, 4, 5, 6, 7. No more "will ship" or "to land" — only
present tense for code that exists.

Changes:
- Architecture diagram now shows both transport paths (CF Destinations
  for traces + info/warn logs; direct POST for metrics + error logs).
- New "Metrics: AE vs OTLP" subsection explaining the two-meter
  composite — what each pipe is for, what's lost by dropping each.
- New "Data loss profile" table covering all five signal classes:
  traces, info/warn logs, error logs, OTLP metrics, AE metrics. Each
  with explicit loss conditions (sampling, rate limiter, buffer
  overflow, isolate eviction) so operators know where to look when a
  signal goes missing.
- New "Wiring" subsection documenting the direct-POST env vars
  (DECO_OTEL_METRICS_ENDPOINT, DECO_OTEL_LOGS_ENDPOINT) — these are
  the user-facing config knobs that turn the new channels on.
- Sampling section now uses present tense for the direct-POST error
  channel, with an explicit note that errors bypass head sampling.
- Cost-model footnote moved from future-tense to past-tense for the
  shipped error channel; updated savings projection.
- "Out of scope" trimmed and made specific — calls out that direct
  POST IS in scope (deliberately) for metrics and error logs only,
  not for spans or info-logs.

Tests + typecheck still 613 passing.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread scripts/audit-observability-config.ts Outdated
Phase 8 (cubic review fixes): triage of seven findings on the last five
cubic reviews. All seven were real and now addressed. Tests up to 626
(from 613), all green.

P1 — correctness:
- instrumentedFetch.ts header semantics now follow the Fetch spec:
  when init.headers is provided, it REPLACES Request.headers; when it's
  omitted, Request.headers are preserved. Previous code unioned them,
  which silently leaked Request headers through even when callers
  intended to drop them. Two new tests pin the contract.
- otelHttpErrorLog.ts no longer permanently drops records on a failed
  POST. The snapshot is now restored to the front of the buffer on
  non-2xx and on network errors, so a subsequent flush picks them up.
  The buffer's existing maxBufferRecords cap still bounds total memory
  if the endpoint stays down — newest-tail records are dropped via
  onError("overflow", …) only when the buffer is at the cap during
  restore. Three new tests cover the restore path including the
  cap-overflow case.

P2 — defensive coverage:
- urlRedaction.ts fallback path now strips fragments too (`#…`), not
  just queries — fragments carry OAuth implicit-grant access tokens.
- otelHttpMeter.ts splits getOrCreate so the overflow check runs
  BEFORE materializing a new MetricEntry. Previously, a dropped
  datapoint left a permanent empty entry in the metrics map that
  serialized to an empty OTLP envelope each flush.
- otelHttpErrorLog.ts attrToOtlpValue guards JSON.stringify(v) ===
  undefined (functions, undefined, symbols) — previously surfaced as
  malformed { stringValue: undefined } in the OTLP payload.
- lib/jsonc.ts adds stripJsoncTrailingCommas + parseJsonc helpers; the
  audit and codemod now use them so real wrangler.jsonc files with
  trailing commas (very common) audit cleanly instead of failing with
  a parse error.

P3 — test honesty:
- instrumentedFetch.test.ts "preserves rawUrl in the structured
  outgoing-fetch log" was misnamed: the test didn't enable
  OTEL_LOG_OUTGOING_FETCH and didn't assert on the logger. Replaced
  with two tests: one that flips the env flag, captures the logger,
  and asserts the breadcrumb's host/path/method/status; another that
  confirms the breadcrumb is absent when the flag is off.

Docs: data loss profile in docs/observability.md updated to reflect
the failed-flush restore path and the actual maxBufferRecords default.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 12 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/sdk/otelHttpErrorLog.ts">

<violation number="1" location="src/core/sdk/otelHttpErrorLog.ts:206">
P2: When flush fails and the buffer is full, this branch drops the entire failed snapshot (older records) instead of evicting newer buffered entries. That reverses the intended “prioritize originating-error context” behavior and can lose the most relevant error logs during endpoint outages.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread src/core/sdk/otelHttpErrorLog.ts Outdated
Add per-call operation resolution to the upstream
createInstrumentedFetch helper so commerce integrations
(@decocms/apps) can produce semantic span names like
"vtex.intelligent-search.product_search" or "vtex.checkout.orderForm"
without forking the framework's outbound-fetch span shape or paying for
a second per-call child span.

API additions:

- `init.operation?: string` — per-call override. Stripped from init
  before reaching baseFetch.
- `FetchInstrumentationOptions.defaultOperation?: string` — fallback
  when init.operation is unset. Useful for single-purpose clients
  (e.g. Resend → "emails.send" for every call).
- `FetchInstrumentationOptions.resolveOperation?: (url, method) =>
  string | undefined` — URL-derived long-tail router. Returns
  undefined to opt out (span falls back to "${name}.fetch"). Designed
  to give the apps repo a two-layer model: explicit operation strings
  at hot call sites, URL-derived defaults for the long tail.

Resolution precedence: init.operation ?? defaultOperation ??
resolveOperation(url, method) ?? "fetch". The resolved operation is
stamped as `fetch.operation` on the span (queryable independent of
span name) and surfaced in the onComplete callback payload so per-app
duration histograms can label by operation.

Types:

- `InstrumentedFetchInit = RequestInit & { operation?: string }`
- `InstrumentedFetch = (input, init?: InstrumentedFetchInit) =>
  Promise<Response>`

`InstrumentedFetch` remains structurally assignable to typeof fetch
(operation is optional), so existing setter signatures
(setVtexFetch, setShopifyFetch) keep compiling unchanged. Apps repo
will switch its module-level `let _fetch: typeof fetch` to
`InstrumentedFetch` to surface the operation property type.

Why this lives in @decocms/start instead of @decocms/apps: the
framework already owns the outbound-fetch span shape (${name}.fetch).
Letting two repos define parallel span shapes would force every call
to produce two spans (outer apps-side wrapper + inner upstream
${name}.fetch), doubling per-month CF Destinations egress and
ClickHouse storage. Single-span model is materially cheaper at fleet
scale and architecturally cleaner.

Tests: 7 new in instrumentedFetch.test.ts — explicit operation,
defaultOperation, resolveOperation, undefined-returning router,
precedence (init wins over default wins over router), onComplete
payload, init-mutation safety (operation stripped before baseFetch).
633 tests total, all green; typecheck and biome clean.

Docs: docs/observability.md gains a "Per-call operation names on
outbound fetches" subsection under Trace propagation, and the Out-of-
scope note for commerce spans is rewritten to reflect the
framework/apps split (framework owns span shape, apps owns operation
strings + provider histogram).

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread src/core/sdk/instrumentedFetch.ts
cubic flagged that the Phase 8 restore-on-failure path inverted the
intended priority: when buffer + snapshot exceeded the cap, the path
truncated the SNAPSHOT (older, more-likely-causal records) rather
than the buffer (newer records that arrived during the failed POST).
That contradicted the stated invariant ("preserve originating-error
context").

Fix: restore policy is now snapshot-first, evict newest:

1. snapshot + buffer ≤ cap → unshift snapshot, done.
2. Otherwise, drop newest-tail records from the live buffer until
   either the buffer is empty or the total fits.
3. Only if the snapshot ALONE still exceeds the cap do we truncate
   it — and even then, keep the OLDEST end and drop the newest tail.

The previous test "drops oldest-tail records of the snapshot"
asserted the buggy behavior and has been replaced with two tests
that pin the corrected semantics:

- snapshot (2) + buffer (2) > cap (3) → drops 1 newer record from
  buffer, snapshot fully restored; onError surfaces the count and
  the preserved-record count.
- snapshot (2) + buffer (2) > cap (2) → drops both buffer records;
  snapshot fully restored.

Data-loss-profile in docs/observability.md updated to match.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vibe-dex vibe-dex changed the title feat(observability): broaden span/metric coverage for cache, errors, trace correlation feat(observability): direct-POST metrics + error channel, per-call fetch operations, audit CLI May 18, 2026
@vibe-dex vibe-dex changed the base branch from main to next May 18, 2026 21:13
…omitted

cubic flagged that `resolveOperation` (and the `http.method` span attr)
received "GET" for any caller doing
`fetch(new Request(url, { method: "POST" }))` without an explicit
`init.method` — the wrapper's `init?.method || "GET"` ignored the
Request's own method. POST/PUT/DELETE traffic surfaced as GET on
spans and was routed by `resolveOperation(url, "GET")` even when the
real method was different.

Fix: method resolution now follows the Fetch spec —
`init.method ?? input.method ?? "GET"`. Init wins (Fetch spec), then
the Request's method survives, then we default to GET to match
`new Request(url).method`.

Tests:
- a Request with method=POST and no init → span attr http.method=POST,
  resolveOperation called with "POST", span name reflects POST.
- init.method=DELETE overrides Request.method=POST → http.method=DELETE.

636 tests total, all green.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant