chore(workspace-engine): Enable OTEL Logs by jsingleton-dev · Pull Request #1113 · ctrlplanedev/ctrlplane

jsingleton-dev · 2026-05-07T16:42:24Z

This PR enables OTEL log export. We had to switch our logging lib from charmbracelet to slog because slog allows you to register log handlers which allows us to use the OTEL handler in the sdk.

Slog is mostly compatible with our existing logging setup with the exception of log.Fatal. I had to make a compromise to keep the existing behavior that we should refactor separately.

I used the context logs where possible so we get proper trace/span correlation with logs. Verified this is working locally. Examples from local otel collector

LogRecord #0
ObservedTimestamp: 2026-05-07 17:39:22.144053 +0000 UTC
Timestamp: 2026-05-07 17:39:22.144011 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
--
     -> resource_id: Str(f0f4dd8a-1cb6-42f7-ae2b-322b9a0f0d52)
Trace ID: deee90d53575a82b6c3f88047cf9e172
Span ID: 9f6b9fed35b2cc40
Flags: 1
LogRecord #1
ObservedTimestamp: 2026-05-07 17:39:22.144809 +0000 UTC
Timestamp: 2026-05-07 17:39:22.14478 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
--
     -> resource_id: Str(ad1c6f9c-6cf9-4888-9c89-9997e9083f2d)
Trace ID: 36dbda3898332c67d9161579d3168bbf
Span ID: 9c390034d7dcc0d4
Flags: 1

Summary by CodeRabbit

New Features
- Structured logging integrated with OpenTelemetry for centralized log collection and a new logs pipeline.
- Configurable log level via environment variable.
Improvements
- Widespread switch to context-aware structured logging across services for clearer, consistent observability.
- More reliable startup/shutdown log messages and improved error reporting.

coderabbitai · 2026-05-07T16:42:31Z

Warning

Rate limit exceeded

@jsingleton-dev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 9 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 458940ef-0aff-4aab-a436-7c723a56596a

📥 Commits

Reviewing files that changed from the base of the PR and between 66e0886 and 615a479.

📒 Files selected for processing (1)

apps/workspace-engine/svc/controllers/desiredrelease/controller.go

📝 Walkthrough

Walkthrough

This PR migrates workspace-engine from Charmbracelet logging to Go's structured log/slog, adds tee and level slog handlers with tests, configures OTLP log export and batching, adds a LogLevel config, updates numerous call-sites to context-aware slog, and adds a local OTEL collector logs pipeline.

Changes

Structured Logging Migration to slog

Layer / File(s)	Summary
Dependencies & Configuration `apps/workspace-engine/go.mod`, `apps/workspace-engine/pkg/config/env.go`	Added OpenTelemetry logging deps (`otelslog`, `otel/sdk/log`) and exposed `Config.LogLevel` (`LOG_LEVEL`, default `"INFO"`).
Custom Log Handlers & Tests `apps/workspace-engine/log_handler.go`, `apps/workspace-engine/log_handler_test.go`	Adds `teeHandler` (fan-out + error join) and `levelHandler` (min-level gating) with unit tests for forwarding, Enabled union, attr/group propagation, and aggregated errors.
Logger Initialization & OTEL Setup `apps/workspace-engine/otel.go`, `apps/workspace-engine/main.go`	Adds `parseLogLevel()` and `initLogger()` to set up OTLP log exporter, batching `sdklog.LoggerProvider`, OTEL slog bridge, stderr handler, and a tee combining outputs; main initializes and defers shutdown.
Database, Convert & Store `apps/workspace-engine/pkg/db/client.go`, `apps/workspace-engine/pkg/db/convert.go`, `apps/workspace-engine/pkg/store/policies/get_for_release_targets.go`, `apps/workspace-engine/pkg/store/resources/get_resources.go`	Replaced charmbracelet/log with context-aware `slog.ErrorContext`/`slog.InfoContext` for DB init failures, JSON unmarshal errors, policy ID parse failures, and resource optimization logs.
Reconcile Workers & Events `apps/workspace-engine/pkg/reconcile/worker.go`, `apps/workspace-engine/pkg/reconcile/events/*`	Worker claim/item error logs and enqueue event operations updated to `slog.*Context` with structured fields (e.g., `count`).
Workspace Evaluation & Getter `apps/workspace-engine/pkg/workspace/relationships/eval/evaluate.go`, `apps/workspace-engine/pkg/workspace/releasemanager/policy/evaluator/deploymentdependency/getter.go`	CEL compile/eval warnings use `slog.WarnContext`; getter DB/query errors use `slog.ErrorContext`.
Job Agents `apps/workspace-engine/pkg/jobagents/*`	ArgoCD, Argo Workflows, Terraform Cloud logging changed to `slog.WarnContext`/`slog.ErrorContext` for retries, warnings, and errors.
Metrics Providers & Evaluator `apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/*`	Datadog, Prometheus, and HTTP metric providers updated to `slog.ErrorContext`/`slog.WarnContext`/`slog.DebugContext`; metric evaluator and reconcile logging moved to context-aware slog calls.
Controllers (Init & Reconcile) `apps/workspace-engine/svc/controllers/*`	Controller constructors standardized to log init failures with `slog.Error` then `os.Exit(1)`; reconcile and getter logs migrated to `slog.*Context`.
HTTP & Utility Services `apps/workspace-engine/svc/http/`, `apps/workspace-engine/svc/pprof/`, `apps/workspace-engine/svc/claimcleanup/service.go`, `apps/workspace-engine/svc/service.go`	HTTP server startup/error logging, middleware error logging, pprof lifecycle, claim cleanup, and Runner lifecycle use `slog.*Context`; fatal startup errors exit after logging.
OpenTelemetry Collector Configuration `otel-collector-local.yaml`	Added `service.pipelines.logs` to route OTLP receiver → batch processor → logging exporter for local log visibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

ctrlplanedev/ctrlplane#682: touches policy-retrieval/logging in the same area (GetPoliciesForReleaseTarget).

Suggested reviewers

jsbroks
adityachoudhari26

Poem

🐰 From charm to slog I hop with glee,

I stitch handlers—tee and level—free,
OTLP pipes hum a tidy song,
Context flows clean and strong,
Logs now find their home where they should be.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: enabling OpenTelemetry logs in the workspace-engine by switching from charmbracelet to slog logging.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jsingleton/otel-logs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jsingleton-dev · 2026-05-07T16:44:22Z

+			slog.Error("Failed to parse database config", "error", err)
+			os.Exit(1)


Would prefer not to have this added everywhere but this is what our previous log Fatal was doing under the hood. slog has no Fatal option so this keeps existing behavior. We should refactor to return errors up the stack instead but that can be a separate change as this is already a large diff

mleonidas

I now this was just adding logging, but I added some comments on some issues i noticed from existing code, if you want to implement the changes i proposed feel free if not I'll create some issues. otherwise LGTM

mleonidas · 2026-05-07T17:49:12Z

 	if err != nil {
 		if !errors.Is(err, pgx.ErrNoRows) {
-			slog.Error(
+			slog.ErrorContext(ctx,


this isn't really an error, and the log message is actually kind of confusing, I need to look more around the context of this but, the db returning no rows is more informational, unless there were expected release targets that we somehow knew we should find here but didn't?

Yeah agree no rows generally isn't generally an error. I'll take a closer look at the flow for this one

mleonidas · 2026-05-07T18:08:00Z

 					return
 				}
-				log.Error("claim-cleanup: failed to release expired claims", "error", err)
+				slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err)


I know you're just adding in logging here but I'm noticing this ctx.Err() != nil check which is kind of awkward., and has a subtle timing issue in the rare case that CleanupExpiredClaims returns a real error and the if check, we'd swallow the legitimate failure. I'd probably refactor this to

func (s *Service) run(ctx context.Context) { defer close(s.done) ticker := time.NewTicker(s.interval) defer ticker.Stop() for { select { case <-ctx.Done(): return case <-ticker.C: s.tick(ctx) } } } func (s *Service) tick(ctx context.Context) { released, err := s.queue.CleanupExpiredClaims(ctx) if err != nil { if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { return } slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err) return } if released > 0 { slog.InfoContext(ctx, "claim-cleanup: released expired claims", "count", released) } }

I would also make the argument that we probably want a context timeout, which would really give us the ergonomics around this is shutting down vs we failed and have an error

mleonidas · 2026-05-07T18:32:16Z

 		if err != nil {
-			log.Fatal("Failed to parse database config:", err)
+			slog.ErrorContext(ctx, "Failed to parse database config", "error", err)
+			os.Exit(1)


this comment is more for me than for this review, this whole file should be refactored, there's a bunch of subtle bugs with this

The singleton is just overhead without any benefit. sync.Once exists to coordinate concurrent first-time initialization. We call it 13+ times in main.go sequentially, there's no concurrency to coordinate

errors are swallowed by os.Exit inside the library.

the first callers context is silently captured., since it runs inside Once.Do

And probably the most important GetPool only returns a *pgxpool.Pool with no error, which is a lie it can error, instead it just kills the process, the GetDB calls nil check is basically a data race
There is a bunch more problems but I think we actually want to refactor this

i would make this package look like

package db import ( "context" "fmt" "time" "github.com/exaring/otelpgx" "github.com/jackc/pgx/v5/pgxpool" ) type Config struct { URL string MaxPoolSize int32 ApplicationName string } // NewPool creates a configured pgx pool. The caller owns it and must Close it. func NewPool(ctx context.Context, cfg Config) (*pgxpool.Pool, error) { pgxCfg, err := pgxpool.ParseConfig(cfg.URL) if err != nil { return nil, fmt.Errorf("parse database config: %w", err) } pgxCfg.MaxConns = cfg.MaxPoolSize pgxCfg.MinConns = 1 pgxCfg.HealthCheckPeriod = 30 * time.Second pgxCfg.ConnConfig.Tracer = otelpgx.NewTracer() pgxCfg.ConnConfig.RuntimeParams["application_name"] = cfg.ApplicationName pool, err := pgxpool.NewWithConfig(ctx, pgxCfg) if err != nil { return nil, fmt.Errorf("create database pool: %w", err) } return pool, nil }

and then in main, or tests

pool, err := db.NewPool(ctx, db.Config{ URL: config.Global.PostgresURL, MaxPoolSize: int32(config.Global.PostgresMaxPoolSize), ApplicationName: config.Global.PostgresApplicationName, }) if err != nil { slog.Error("failed to initialize database", "error", err) os.Exit(1) } defer pool.Close()

and pass the pool to all the services and we can remove all the nil checks from all the services and the log lines as the pool HAS to be initialized at that point.

Yeah the errors swallowed by os exit is in definite need of a fix. Was planning to tackle those in a separate pr. Good callouts on the other ones too

mleonidas · 2026-05-07T18:33:51Z

-		log.Fatal("Failed to get pgx pool")
-		panic("failed to get pgx pool")
+		slog.Error("Failed to get pgx pool")
+		os.Exit(1)


see comment about db client refactor

mleonidas · 2026-05-07T18:34:03Z

-		log.Fatal("Failed to get pgx pool")
-		panic("failed to get pgx pool")
+		slog.Error("Failed to get pgx pool")
+		os.Exit(1)


see comment about db client refactor

mleonidas · 2026-05-07T18:34:14Z

 	if pgxPool == nil {
-		log.Fatal("Failed to get pgx pool")
-		panic("failed to get pgx pool")
+		slog.Error("Failed to get pgx pool")


see comment about db client refactor

mleonidas · 2026-05-07T18:34:27Z

 	if pgxPool == nil {
-		log.Fatal("Failed to get pgx pool")
-		panic("failed to get pgx pool")
+		slog.Error("Failed to get pgx pool")


see comment about db client refactor

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (1)

apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go (1)
65-66: ⚠️ Potential issue | 🟠 Major

Same OTEL flush-before-exit risk as in other controllers

The slog.Error + os.Exit(1) pattern here is subject to the same batched-exporter flush issue noted in policyeval/controller.go. No additional context is available in this New function, so slog.Error (no context) is the right variant, but the flush concern remains.

Also applies to: 97-99
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go`
around lines 65 - 66, The code currently calls slog.Error followed immediately
by os.Exit(1) (seen in New and again around the block at 97-99), which can cause
OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry
shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's
Shutdown(ctx) with a short timeout) and wait for it to complete, then log the
error and call os.Exit(1); update the error paths in the New function and the
later exit block to perform the OTEL shutdown with a context deadline before
exiting.

🧹 Nitpick comments (1)

apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go (1)
381-382: ⚡ Quick win

slog.Warn calls in helper functions lack context — inconsistent with PR's trace-correlation goal

buildResultData, extractVectorResults, and extractMatrixResults use slog.Warn without a context, so these log records won't carry trace/span IDs. ctx is available in Measure and could be threaded through to these helpers.
✏️ Suggested refactor

Pass ctx through the call chain:
-func buildResultData(
-	statusCode int,
-	respBody []byte,
-	duration time.Duration,
-) (map[string]any, error) {
+func buildResultData(
+	ctx context.Context,
+	statusCode int,
+	respBody []byte,
+	duration time.Duration,
+) (map[string]any, error) {
-	value, results, err := extractResults(promResp)
+	value, results, err := extractResults(ctx, promResp)
	if err != nil {
-		slog.Warn("Could not extract Prometheus result values", "error", err)
+		slog.WarnContext(ctx, "Could not extract Prometheus result values", "error", err)
	}
Apply the same propagation through extractResults → extractVectorResults / extractMatrixResults.
Also applies to: 419-420, 454-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go`
around lines 381 - 382, The helper functions buildResultData,
extractVectorResults, extractMatrixResults (and the intermediate extractResults)
must accept a context.Context parameter and be updated where called (Measure →
extractResults → extractVectorResults/extractMatrixResults) so logs carry
trace/span IDs; replace plain slog.Warn calls inside those helpers with a
context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx)
or equivalent) and call logger.Warn(...) including the existing error details.
Ensure all function signatures and call sites are updated consistently to pass
ctx through the chain.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/workspace-engine/log_handler.go`:
- Around line 64-66: The Enabled method in levelHandler currently only checks
the local level (lvl >= l.level) and ignores the wrapped handler's Enabled
decision; update levelHandler.Enabled(ctx context.Context, lvl slog.Level) to
also call and respect l.handler.Enabled(ctx, lvl) (i.e., return true only if
both the local level allows the record and the wrapped handler.Enabled returns
true), ensuring you pass the incoming ctx through to l.handler.Enabled so that
context-dependent filtering in the wrapped handler is honored.

In `@apps/workspace-engine/pkg/config/env.go`:
- Line 34: The env var name OTEL_LOG_LEVEL used by the OTELLogLevel field
repurposes an OpenTelemetry-reserved variable and could conflict with SDK
behavior; rename the struct field tag and env var to a non-conflicting name
(e.g., ExporterLogLevel or SlogToOTELLevel) by updating the OTELLogLevel field's
envconfig tag from OTEL_LOG_LEVEL to the new variable (and rename the field if
you prefer for clarity), then update any code that reads/writes OTELLogLevel
(constructors, config loaders, documentation, and tests) to use the new symbol
and env var so the application log-forwarding threshold no longer collides with
the OpenTelemetry SDK setting.

In `@apps/workspace-engine/pkg/db/convert.go`:
- Line 477: The unmarshal error log call using slog.Error("failed to unmarshal
release variables", "error", err) should include the release identifier to aid
debugging; update that logging invocation (the slog.Error call in convert.go
where release variables are unmarshaled) to add the row.ID (e.g., "release_id",
row.ID) or another unique release identifier used in the surrounding code so the
error message can be correlated to the specific release/row.

In `@apps/workspace-engine/pkg/store/policies/get_for_release_targets.go`:
- Around line 130-140: The slog.InfoContext call in get_for_release_targets.go
currently logs "policy_ids" with len(policyIDs) (an int) which is misleading;
change the logged attributes so the count and the actual IDs are correctly named
and represented — e.g., log "policy_count" with len(policyIDs) and separately
log "policy_ids" with the policyIDs slice (or a stringified list of IDs) in the
slog.InfoContext invocation used around setting policies for release targets,
ensuring the attribute keys match the value types.

In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`:
- Around line 128-129: The slog.Error followed immediately by os.Exit(1) can
cause OTEL BatchProcessor logs to be dropped; modify the startup error path in
the controller (where slog.Error is called before os.Exit) to invoke the OTEL
logs provider ForceFlush (or equivalent synchronous flush API) and wait for it
to complete or time out before calling os.Exit(1), handling any ForceFlush
error; apply the same change to the equivalent error path in the deployment plan
result controller (the other instance where slog.Error is followed by os.Exit).

In `@apps/workspace-engine/svc/controllers/desiredrelease/controller.go`:
- Around line 139-141: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable so logs are correlated to the request context:
locate the error path in the desired release reconcile worker creation block in
controller.go (the slog.Error("Failed to create desired release reconcile
worker", "error", err) call) and change it to call slog.ErrorContext(ctx, ...)
passing the same message and error key/value; no other behavior changes
required.

In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 160-161: The code logs the pointer address because resolved.Site
is a *string; update the slog call in the Datadog metric request block (the
slog.ErrorContext invocation) to pass the already dereferenced local variable
site (or explicitly *resolved.Site after nil check) instead of resolved.Site so
the actual site string is logged; ensure you handle the nil case if you choose
to dereference (use the existing site string variable if present).

In `@apps/workspace-engine/svc/controllers/policyeval/controller.go`:
- Around line 125-127: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable: locate the error handling in the policy eval
reconcile worker creation (the block that currently calls slog.Error("Failed to
create policy eval reconcile worker", "error", err) and os.Exit(1)) and change
the log invocation to slog.ErrorContext(ctx, ...) so the log includes the
context for trace/span correlation while keeping the same message and error key;
leave the os.Exit(1) behavior unchanged.

In `@apps/workspace-engine/svc/http/http.go`:
- Around line 43-45: The current handler in the goroutine that runs
s.httpServer.ListenAndServe logs the error with slog.Error then calls os.Exit(1)
which prevents OTEL's batch logs from flushing; instead remove the direct
os.Exit call and send the error to a coordinator channel (e.g., a FatalErr or
serverError channel) so the main goroutine can perform an orderly shutdown
(invoke otel provider Shutdown/ForceFlush and other cleanup) before exiting;
modify the goroutine around s.httpServer.ListenAndServe and replace os.Exit
usage with sending the error to that channel, and ensure the main service
manager selects on that channel and performs the provider shutdown and final
os.Exit.

---

Duplicate comments:
In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go`:
- Around line 65-66: The code currently calls slog.Error followed immediately by
os.Exit(1) (seen in New and again around the block at 97-99), which can cause
OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry
shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's
Shutdown(ctx) with a short timeout) and wait for it to complete, then log the
error and call os.Exit(1); update the error paths in the New function and the
later exit block to perform the OTEL shutdown with a context deadline before
exiting.

---

Nitpick comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go`:
- Around line 381-382: The helper functions buildResultData,
extractVectorResults, extractMatrixResults (and the intermediate extractResults)
must accept a context.Context parameter and be updated where called (Measure →
extractResults → extractVectorResults/extractMatrixResults) so logs carry
trace/span IDs; replace plain slog.Warn calls inside those helpers with a
context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx)
or equivalent) and call logger.Warn(...) including the existing error details.
Ensure all function signatures and call sites are updated consistently to pass
ctx through the chain.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d412833b-2030-4c55-903a-9034803be1af

📥 Commits

Reviewing files that changed from the base of the PR and between 73e359a and a3959e5.

⛔ Files ignored due to path filters (1)

apps/workspace-engine/go.sum is excluded by !**/*.sum

📒 Files selected for processing (47)

apps/workspace-engine/go.mod
apps/workspace-engine/log_handler.go
apps/workspace-engine/log_handler_test.go
apps/workspace-engine/main.go
apps/workspace-engine/otel.go
apps/workspace-engine/pkg/config/env.go
apps/workspace-engine/pkg/db/client.go
apps/workspace-engine/pkg/db/convert.go
apps/workspace-engine/pkg/jobagents/argo/application_upserter.go
apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
apps/workspace-engine/pkg/jobagents/argoworkflows/workflow_submitter.go
apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
apps/workspace-engine/pkg/reconcile/events/deploymentselector.go
apps/workspace-engine/pkg/reconcile/events/desiredrelease.go
apps/workspace-engine/pkg/reconcile/events/environmentselector.go
apps/workspace-engine/pkg/reconcile/events/jobeligibility.go
apps/workspace-engine/pkg/reconcile/events/policyeval.go
apps/workspace-engine/pkg/reconcile/worker.go
apps/workspace-engine/pkg/store/policies/get_for_release_targets.go
apps/workspace-engine/pkg/store/resources/get_resources.go
apps/workspace-engine/pkg/workspace/relationships/eval/evaluate.go
apps/workspace-engine/pkg/workspace/releasemanager/policy/evaluator/deploymentdependency/getter.go
apps/workspace-engine/svc/claimcleanup/service.go
apps/workspace-engine/svc/controllers/deploymentplan/controller.go
apps/workspace-engine/svc/controllers/deploymentplanresult/controller.go
apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
apps/workspace-engine/svc/controllers/desiredrelease/controller.go
apps/workspace-engine/svc/controllers/desiredrelease/reconcile.go
apps/workspace-engine/svc/controllers/environmentresourceselectoreval/controller.go
apps/workspace-engine/svc/controllers/forcedeploy/controller.go
apps/workspace-engine/svc/controllers/forcedeploy/reconcile.go
apps/workspace-engine/svc/controllers/jobdispatch/controller.go
apps/workspace-engine/svc/controllers/jobeligibility/controller.go
apps/workspace-engine/svc/controllers/jobeligibility/getters_postgres.go
apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/metric.go
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/http/http.go
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go
apps/workspace-engine/svc/controllers/jobverificationmetric/reconcile.go
apps/workspace-engine/svc/controllers/policyeval/controller.go
apps/workspace-engine/svc/controllers/relationshipeval/controller.go
apps/workspace-engine/svc/http/http.go
apps/workspace-engine/svc/http/server/server.go
apps/workspace-engine/svc/pprof/pprof.go
apps/workspace-engine/svc/service.go
otel-collector-local.yaml

coderabbitai · 2026-05-07T19:24:47Z

 		var raw map[string]json.RawMessage
 		if err := json.Unmarshal(row.Variables, &raw); err != nil {
-			log.Error("failed to unmarshal release variables", "error", err)
+			slog.Error("failed to unmarshal release variables", "error", err)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add release ID to the unmarshal error log for debuggability.

Line 482 correctly includes "key", k, but line 477 logs the outer unmarshal failure with no identifier. Without row.ID (or at minimum the row's resource/deployment/environment IDs), correlating this log entry to a specific release in production is difficult.

🐛 Proposed fix

- slog.Error("failed to unmarshal release variables", "error", err) + slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

slog.Error("failed to unmarshal release variables", "error", err)

slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/pkg/db/convert.go` at line 477, The unmarshal error log call using slog.Error("failed to unmarshal release variables", "error", err) should include the release identifier to aid debugging; update that logging invocation (the slog.Error call in convert.go where release variables are unmarshaled) to add the row.ID (e.g., "release_id", row.ID) or another unique release identifier used in the surrounding code so the error message can be correlated to the specific release/row.

coderabbitai · 2026-05-07T19:24:48Z

+		slog.Error("Failed to get pgx pool")
+		os.Exit(1)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

OTEL log records may be dropped on os.Exit(1) before the batch exporter flushes

The OTEL log exporter configured in this PR uses a batch processor. Calling os.Exit(1) immediately after slog.Error bypasses all deferred cleanup and Shutdown() calls on the SDK, so the log record is almost certainly never exported to the collector. This is true for every slog.Error + os.Exit(1) pair across the changed controllers.

Consider flushing/shutting down the OTEL provider before exiting, or at minimum accepting that these startup-failure records won't reach the collector (and ensuring the stderr output is enough for operators).

Also applies to: 125-127

coderabbitai · 2026-05-07T19:24:48Z

 		if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
-			log.Fatal("HTTP ListenAndServe error", "error", err)
+			slog.Error("HTTP ListenAndServe error", "error", err)
+			os.Exit(1)


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

os.Exit(1) skips OTEL log flush — the fatal error log is likely never exported

slog.Error (line 44) enqueues the record in the BatchLogRecordProcessor, but os.Exit(1) immediately terminates the process, bypassing all deferred calls and the OTEL logger provider's Shutdown/ForceFlush. The error log that matters most (a fatal server failure) is precisely the one that will typically not reach the collector.

The previous charmbracelet log.Fatal wrote synchronously to stdout/stderr before exiting, so it was always visible. The OTEL async batch pipeline loses that guarantee.

The clean fix is to signal the main goroutine so it can drive an orderly shutdown (which includes flushing the OTEL providers) before exiting:

💡 Sketch of a channel-based approach

type Service struct { cfg config.Config pool *pgxpool.Pool httpServer *http.Server + fatalErr chan error } func New(cfg config.Config, pool *pgxpool.Pool) *Service { - return &Service{cfg: cfg, pool: pool} + return &Service{cfg: cfg, pool: pool, fatalErr: make(chan error, 1)} } +// FatalErr returns a channel that receives a non-nil error if the HTTP server +// dies with an unrecoverable error. The service manager should drain this +// channel and initiate a graceful shutdown (which flushes the OTEL pipeline). +func (s *Service) FatalErr() <-chan error { return s.fatalErr } go func() { slog.Info("HTTP server listening", "address", addr) if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { slog.Error("HTTP ListenAndServe error", "error", err) - os.Exit(1) + s.fatalErr <- err } }()

The main service manager then selects on FatalErr() and calls the shutdown sequence (including otelProvider.Shutdown(ctx)) before calling os.Exit.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/svc/http/http.go` around lines 43 - 45, The current handler in the goroutine that runs s.httpServer.ListenAndServe logs the error with slog.Error then calls os.Exit(1) which prevents OTEL's batch logs from flushing; instead remove the direct os.Exit call and send the error to a coordinator channel (e.g., a FatalErr or serverError channel) so the main goroutine can perform an orderly shutdown (invoke otel provider Shutdown/ForceFlush and other cleanup) before exiting; modify the goroutine around s.httpServer.ListenAndServe and replace os.Exit usage with sending the error to that channel, and ensure the main service manager selects on that channel and performs the provider shutdown and final os.Exit.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

apps/workspace-engine/otel.go (1)

47-54: ⚡ Quick win

Resource construction is duplicated across all three init* functions.

Identical resource.New(ctx, resource.WithAttributes(semconv.ServiceNameKey.String(serviceName))) blocks appear in initTracer, initMetrics, and initLogger. A shared helper (or passing a pre-built *resource.Resource) would eliminate the duplication and guarantee all three signals share the same resource descriptor.

♻️ Proposed refactor

+func buildResource(ctx context.Context, serviceName string) (*resource.Resource, error) {
+	return resource.New(ctx,
+		resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)),
+	)
+}
+
 func initTracer() (func(), error) {
 	ctx := context.Background()
 	serviceName := config.Global.OTELServiceName
 	endpoint := stripScheme(config.Global.OTELExporterOTLPEndpoint)
-	res, err := resource.New(ctx,
-		resource.WithAttributes(
-			semconv.ServiceNameKey.String(serviceName),
-		),
-	)
+	res, err := buildResource(ctx, serviceName)
 	if err != nil {
 		return nil, fmt.Errorf("failed to create resource: %w", err)
 	}

Apply the same substitution in initMetrics and initLogger.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/otel.go` around lines 47 - 54, The three functions
initTracer, initMetrics, and initLogger each call resource.New(...) with the
same resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block;
factor this out by creating a shared helper (e.g., buildResource(ctx,
serviceName) or newResource) that returns the *resource.Resource (or build it
once and pass it in) and update initTracer, initMetrics, and initLogger to use
that helper instead of duplicating resource.New, ensuring all signals share the
exact same resource descriptor.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 204-207: The debug log is passing resolved.Site (a *string) so it
will print the pointer address; change the call to use the local string variable
(site) or dereference resolved.Site safely (e.g., use site which was set
earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so
the actual site string is logged instead of the pointer; update the call that
currently references resolved.Site to reference site (or *resolved.Site after
nil-check) in the function containing slog.DebugContext.

---

Nitpick comments:
In `@apps/workspace-engine/otel.go`:
- Around line 47-54: The three functions initTracer, initMetrics, and initLogger
each call resource.New(...) with the same
resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block;
factor this out by creating a shared helper (e.g., buildResource(ctx,
serviceName) or newResource) that returns the *resource.Resource (or build it
once and pass it in) and update initTracer, initMetrics, and initLogger to use
that helper instead of duplicating resource.New, ensuring all signals share the
exact same resource descriptor.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bc8eb71b-8b44-4349-b394-c239904dbe74

📥 Commits

Reviewing files that changed from the base of the PR and between a3959e5 and 66e0886.

📒 Files selected for processing (14)

apps/workspace-engine/log_handler.go
apps/workspace-engine/otel.go
apps/workspace-engine/pkg/config/env.go
apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
apps/workspace-engine/pkg/store/policies/get_for_release_targets.go
apps/workspace-engine/pkg/store/resources/get_resources.go
apps/workspace-engine/svc/claimcleanup/service.go
apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
apps/workspace-engine/svc/controllers/desiredrelease/controller.go
apps/workspace-engine/svc/controllers/forcedeploy/reconcile.go
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go
apps/workspace-engine/svc/controllers/policyeval/controller.go
apps/workspace-engine/svc/service.go

✅ Files skipped from review due to trivial changes (1)

apps/workspace-engine/svc/service.go

🚧 Files skipped from review as they are similar to previous changes (6)

apps/workspace-engine/pkg/config/env.go
apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
apps/workspace-engine/log_handler.go
apps/workspace-engine/svc/claimcleanup/service.go

coderabbitai · 2026-05-07T20:27:23Z

+	slog.DebugContext(ctx, "Datadog metric measurement",
 		"site", resolved.Site,
 		"status", resp.StatusCode,
 		"duration", duration)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

resolved.Site is *string — the debug log will emit the pointer address, not the site string

Line 160 was correctly fixed to use the local site string variable, but the debug log at Line 205 still passes resolved.Site directly, which logs the pointer address.

🐛 Proposed fix

- slog.DebugContext(ctx, "Datadog metric measurement", - "site", resolved.Site, - "status", resp.StatusCode, - "duration", duration) + slog.DebugContext(ctx, "Datadog metric measurement", + "site", site, + "status", resp.StatusCode, + "duration", duration)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

slog.DebugContext(ctx, "Datadog metric measurement",

"site", resolved.Site,

"status", resp.StatusCode,

"duration", duration)

slog.DebugContext(ctx, "Datadog metric measurement",

"site", site,

"status", resp.StatusCode,

"duration", duration)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go` around lines 204 - 207, The debug log is passing resolved.Site (a *string) so it will print the pointer address; change the call to use the local string variable (site) or dereference resolved.Site safely (e.g., use site which was set earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so the actual site string is logged instead of the pointer; update the call that currently references resolved.Site to reference site (or *resolved.Site after nil-check) in the function containing slog.DebugContext.

jsingleton-dev changed the title ~~chore(logs): Enable OTEL Logs~~ chore(workspace-engine): Enable OTEL Logs May 7, 2026

jsingleton-dev commented May 7, 2026

View reviewed changes

mleonidas approved these changes May 7, 2026

View reviewed changes

jsingleton-dev added 8 commits May 7, 2026 14:03

chore(workspace-engine): add OTel logs SDK dependencies

5191cab

chore(workspace-engine): add OTEL_LOG_LEVEL config

72794c1

feat(workspace-engine): add slog tee handler for multi-sink logging

8b3322a

Switch to slog and add otel handler

3100517

Tune export interval from default 1s

6fd7d46

Use context logging for trace correlation

31b78a4

Enable log collection locally

210ecb3

uuid as string for logs

a3959e5

jsingleton-dev force-pushed the jsingleton/otel-logs branch from 452c046 to a3959e5 Compare May 7, 2026 19:06

jsingleton-dev marked this pull request as ready for review May 7, 2026 19:07

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

jsingleton-dev added 4 commits May 7, 2026 14:25

Lint fixes

c78158d

Lint fix

227814f

Log level fixes

87b7380

Log fixes

66e0886

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Lint

615a479

		slog.Error("Failed to parse database config", "error", err)
		os.Exit(1)

	slog.Error("failed to unmarshal release variables", "error", err)
	slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)

Conversation

jsingleton-dev commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mleonidas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsingleton-dev commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading