Skip to content

chore(workspace-engine): Enable OTEL Logs#1113

Open
jsingleton-dev wants to merge 13 commits intomainfrom
jsingleton/otel-logs
Open

chore(workspace-engine): Enable OTEL Logs#1113
jsingleton-dev wants to merge 13 commits intomainfrom
jsingleton/otel-logs

Conversation

@jsingleton-dev
Copy link
Copy Markdown
Collaborator

@jsingleton-dev jsingleton-dev commented May 7, 2026

This PR enables OTEL log export. We had to switch our logging lib from charmbracelet to slog because slog allows you to register log handlers which allows us to use the OTEL handler in the sdk.

Slog is mostly compatible with our existing logging setup with the exception of log.Fatal. I had to make a compromise to keep the existing behavior that we should refactor separately.

I used the context logs where possible so we get proper trace/span correlation with logs. Verified this is working locally. Examples from local otel collector

LogRecord #0
ObservedTimestamp: 2026-05-07 17:39:22.144053 +0000 UTC
Timestamp: 2026-05-07 17:39:22.144011 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
--
     -> resource_id: Str(f0f4dd8a-1cb6-42f7-ae2b-322b9a0f0d52)
Trace ID: deee90d53575a82b6c3f88047cf9e172
Span ID: 9f6b9fed35b2cc40
Flags: 1
LogRecord #1
ObservedTimestamp: 2026-05-07 17:39:22.144809 +0000 UTC
Timestamp: 2026-05-07 17:39:22.14478 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
--
     -> resource_id: Str(ad1c6f9c-6cf9-4888-9c89-9997e9083f2d)
Trace ID: 36dbda3898332c67d9161579d3168bbf
Span ID: 9c390034d7dcc0d4
Flags: 1

Summary by CodeRabbit

  • New Features

    • Structured logging integrated with OpenTelemetry for centralized log collection and a new logs pipeline.
    • Configurable log level via environment variable.
  • Improvements

    • Widespread switch to context-aware structured logging across services for clearer, consistent observability.
    • More reliable startup/shutdown log messages and improved error reporting.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Warning

Rate limit exceeded

@jsingleton-dev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 9 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 458940ef-0aff-4aab-a436-7c723a56596a

📥 Commits

Reviewing files that changed from the base of the PR and between 66e0886 and 615a479.

📒 Files selected for processing (1)
  • apps/workspace-engine/svc/controllers/desiredrelease/controller.go
📝 Walkthrough

Walkthrough

This PR migrates workspace-engine from Charmbracelet logging to Go's structured log/slog, adds tee and level slog handlers with tests, configures OTLP log export and batching, adds a LogLevel config, updates numerous call-sites to context-aware slog, and adds a local OTEL collector logs pipeline.

Changes

Structured Logging Migration to slog

Layer / File(s) Summary
Dependencies & Configuration
apps/workspace-engine/go.mod, apps/workspace-engine/pkg/config/env.go
Added OpenTelemetry logging deps (otelslog, otel/sdk/log) and exposed Config.LogLevel (LOG_LEVEL, default "INFO").
Custom Log Handlers & Tests
apps/workspace-engine/log_handler.go, apps/workspace-engine/log_handler_test.go
Adds teeHandler (fan-out + error join) and levelHandler (min-level gating) with unit tests for forwarding, Enabled union, attr/group propagation, and aggregated errors.
Logger Initialization & OTEL Setup
apps/workspace-engine/otel.go, apps/workspace-engine/main.go
Adds parseLogLevel() and initLogger() to set up OTLP log exporter, batching sdklog.LoggerProvider, OTEL slog bridge, stderr handler, and a tee combining outputs; main initializes and defers shutdown.
Database, Convert & Store
apps/workspace-engine/pkg/db/client.go, apps/workspace-engine/pkg/db/convert.go, apps/workspace-engine/pkg/store/policies/get_for_release_targets.go, apps/workspace-engine/pkg/store/resources/get_resources.go
Replaced charmbracelet/log with context-aware slog.ErrorContext/slog.InfoContext for DB init failures, JSON unmarshal errors, policy ID parse failures, and resource optimization logs.
Reconcile Workers & Events
apps/workspace-engine/pkg/reconcile/worker.go, apps/workspace-engine/pkg/reconcile/events/*
Worker claim/item error logs and enqueue event operations updated to slog.*Context with structured fields (e.g., count).
Workspace Evaluation & Getter
apps/workspace-engine/pkg/workspace/relationships/eval/evaluate.go, apps/workspace-engine/pkg/workspace/releasemanager/policy/evaluator/deploymentdependency/getter.go
CEL compile/eval warnings use slog.WarnContext; getter DB/query errors use slog.ErrorContext.
Job Agents
apps/workspace-engine/pkg/jobagents/*
ArgoCD, Argo Workflows, Terraform Cloud logging changed to slog.WarnContext/slog.ErrorContext for retries, warnings, and errors.
Metrics Providers & Evaluator
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/*
Datadog, Prometheus, and HTTP metric providers updated to slog.ErrorContext/slog.WarnContext/slog.DebugContext; metric evaluator and reconcile logging moved to context-aware slog calls.
Controllers (Init & Reconcile)
apps/workspace-engine/svc/controllers/*
Controller constructors standardized to log init failures with slog.Error then os.Exit(1); reconcile and getter logs migrated to slog.*Context.
HTTP & Utility Services
apps/workspace-engine/svc/http/*, apps/workspace-engine/svc/pprof/*, apps/workspace-engine/svc/claimcleanup/service.go, apps/workspace-engine/svc/service.go
HTTP server startup/error logging, middleware error logging, pprof lifecycle, claim cleanup, and Runner lifecycle use slog.*Context; fatal startup errors exit after logging.
OpenTelemetry Collector Configuration
otel-collector-local.yaml
Added service.pipelines.logs to route OTLP receiver → batch processor → logging exporter for local log visibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • jsbroks
  • adityachoudhari26

Poem

🐰 From charm to slog I hop with glee,

I stitch handlers—tee and level—free,
OTLP pipes hum a tidy song,
Context flows clean and strong,
Logs now find their home where they should be.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: enabling OpenTelemetry logs in the workspace-engine by switching from charmbracelet to slog logging.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jsingleton/otel-logs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jsingleton-dev jsingleton-dev changed the title chore(logs): Enable OTEL Logs chore(workspace-engine): Enable OTEL Logs May 7, 2026
Comment thread apps/workspace-engine/pkg/db/client.go Outdated
Comment on lines +25 to +26
slog.Error("Failed to parse database config", "error", err)
os.Exit(1)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer not to have this added everywhere but this is what our previous log Fatal was doing under the hood. slog has no Fatal option so this keeps existing behavior. We should refactor to return errors up the stack instead but that can be a separate change as this is already a large diff

Copy link
Copy Markdown
Collaborator

@mleonidas mleonidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now this was just adding logging, but I added some comments on some issues i noticed from existing code, if you want to implement the changes i proposed feel free if not I'll create some issues. otherwise LGTM

if err != nil {
if !errors.Is(err, pgx.ErrNoRows) {
slog.Error(
slog.ErrorContext(ctx,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't really an error, and the log message is actually kind of confusing, I need to look more around the context of this but, the db returning no rows is more informational, unless there were expected release targets that we somehow knew we should find here but didn't?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree no rows generally isn't generally an error. I'll take a closer look at the flow for this one

return
}
log.Error("claim-cleanup: failed to release expired claims", "error", err)
slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're just adding in logging here but I'm noticing this ctx.Err() != nil check which is kind of awkward., and has a subtle timing issue in the rare case that CleanupExpiredClaims returns a real error and the if check, we'd swallow the legitimate failure. I'd probably refactor this to

func (s *Service) run(ctx context.Context) {
	defer close(s.done)
	ticker := time.NewTicker(s.interval)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			s.tick(ctx)
		}
	}
}

func (s *Service) tick(ctx context.Context) {
    released, err := s.queue.CleanupExpiredClaims(ctx)
    if err != nil {
        if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
			return
		}
		slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err)
		return
	}
	if released > 0 {
		slog.InfoContext(ctx, "claim-cleanup: released expired claims", "count", released)
    }
}

I would also make the argument that we probably want a context timeout, which would really give us the ergonomics around this is shutting down vs we failed and have an error

if err != nil {
log.Fatal("Failed to parse database config:", err)
slog.ErrorContext(ctx, "Failed to parse database config", "error", err)
os.Exit(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is more for me than for this review, this whole file should be refactored, there's a bunch of subtle bugs with this

  1. The singleton is just overhead without any benefit. sync.Once exists to coordinate concurrent first-time initialization. We call it 13+ times in main.go sequentially, there's no concurrency to coordinate
  2. errors are swallowed by os.Exit inside the library.
  3. the first callers context is silently captured., since it runs inside Once.Do
  4. And probably the most important GetPool only returns a *pgxpool.Pool with no error, which is a lie it can error, instead it just kills the process, the GetDB calls nil check is basically a data race
    There is a bunch more problems but I think we actually want to refactor this

i would make this package look like

package db

import (
	"context"
	"fmt"
	"time"

	"github.com/exaring/otelpgx"
	"github.com/jackc/pgx/v5/pgxpool"
)

type Config struct {
	URL             string
	MaxPoolSize     int32
	ApplicationName string
}

// NewPool creates a configured pgx pool. The caller owns it and must Close it.
func NewPool(ctx context.Context, cfg Config) (*pgxpool.Pool, error) {
	pgxCfg, err := pgxpool.ParseConfig(cfg.URL)
	if err != nil {
		return nil, fmt.Errorf("parse database config: %w", err)
	}

	pgxCfg.MaxConns = cfg.MaxPoolSize
	pgxCfg.MinConns = 1
	pgxCfg.HealthCheckPeriod = 30 * time.Second
	pgxCfg.ConnConfig.Tracer = otelpgx.NewTracer()
	pgxCfg.ConnConfig.RuntimeParams["application_name"] = cfg.ApplicationName

	pool, err := pgxpool.NewWithConfig(ctx, pgxCfg)
	if err != nil {
		return nil, fmt.Errorf("create database pool: %w", err)
	}
	return pool, nil
}

and then in main, or tests

pool, err := db.NewPool(ctx, db.Config{
    URL:             config.Global.PostgresURL,
    MaxPoolSize:     int32(config.Global.PostgresMaxPoolSize),
    ApplicationName: config.Global.PostgresApplicationName,
})
if err != nil {
    slog.Error("failed to initialize database", "error", err)
    os.Exit(1)
}
defer pool.Close()

and pass the pool to all the services and we can remove all the nil checks from all the services and the log lines as the pool HAS to be initialized at that point.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the errors swallowed by os exit is in definite need of a fix. Was planning to tackle those in a separate pr. Good callouts on the other ones too

log.Fatal("Failed to get pgx pool")
panic("failed to get pgx pool")
slog.Error("Failed to get pgx pool")
os.Exit(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about db client refactor

log.Fatal("Failed to get pgx pool")
panic("failed to get pgx pool")
slog.Error("Failed to get pgx pool")
os.Exit(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about db client refactor

if pgxPool == nil {
log.Fatal("Failed to get pgx pool")
panic("failed to get pgx pool")
slog.Error("Failed to get pgx pool")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about db client refactor

if pgxPool == nil {
log.Fatal("Failed to get pgx pool")
panic("failed to get pgx pool")
slog.Error("Failed to get pgx pool")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about db client refactor

@jsingleton-dev jsingleton-dev force-pushed the jsingleton/otel-logs branch from 452c046 to a3959e5 Compare May 7, 2026 19:06
@jsingleton-dev jsingleton-dev marked this pull request as ready for review May 7, 2026 19:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (1)
apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go (1)

65-66: ⚠️ Potential issue | 🟠 Major

Same OTEL flush-before-exit risk as in other controllers

The slog.Error + os.Exit(1) pattern here is subject to the same batched-exporter flush issue noted in policyeval/controller.go. No additional context is available in this New function, so slog.Error (no context) is the right variant, but the flush concern remains.

Also applies to: 97-99

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go`
around lines 65 - 66, The code currently calls slog.Error followed immediately
by os.Exit(1) (seen in New and again around the block at 97-99), which can cause
OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry
shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's
Shutdown(ctx) with a short timeout) and wait for it to complete, then log the
error and call os.Exit(1); update the error paths in the New function and the
later exit block to perform the OTEL shutdown with a context deadline before
exiting.
🧹 Nitpick comments (1)
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go (1)

381-382: ⚡ Quick win

slog.Warn calls in helper functions lack context — inconsistent with PR's trace-correlation goal

buildResultData, extractVectorResults, and extractMatrixResults use slog.Warn without a context, so these log records won't carry trace/span IDs. ctx is available in Measure and could be threaded through to these helpers.

✏️ Suggested refactor

Pass ctx through the call chain:

-func buildResultData(
-	statusCode int,
-	respBody []byte,
-	duration time.Duration,
-) (map[string]any, error) {
+func buildResultData(
+	ctx context.Context,
+	statusCode int,
+	respBody []byte,
+	duration time.Duration,
+) (map[string]any, error) {
-	value, results, err := extractResults(promResp)
+	value, results, err := extractResults(ctx, promResp)
	if err != nil {
-		slog.Warn("Could not extract Prometheus result values", "error", err)
+		slog.WarnContext(ctx, "Could not extract Prometheus result values", "error", err)
	}

Apply the same propagation through extractResultsextractVectorResults / extractMatrixResults.

Also applies to: 419-420, 454-455

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go`
around lines 381 - 382, The helper functions buildResultData,
extractVectorResults, extractMatrixResults (and the intermediate extractResults)
must accept a context.Context parameter and be updated where called (Measure →
extractResults → extractVectorResults/extractMatrixResults) so logs carry
trace/span IDs; replace plain slog.Warn calls inside those helpers with a
context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx)
or equivalent) and call logger.Warn(...) including the existing error details.
Ensure all function signatures and call sites are updated consistently to pass
ctx through the chain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/workspace-engine/log_handler.go`:
- Around line 64-66: The Enabled method in levelHandler currently only checks
the local level (lvl >= l.level) and ignores the wrapped handler's Enabled
decision; update levelHandler.Enabled(ctx context.Context, lvl slog.Level) to
also call and respect l.handler.Enabled(ctx, lvl) (i.e., return true only if
both the local level allows the record and the wrapped handler.Enabled returns
true), ensuring you pass the incoming ctx through to l.handler.Enabled so that
context-dependent filtering in the wrapped handler is honored.

In `@apps/workspace-engine/pkg/config/env.go`:
- Line 34: The env var name OTEL_LOG_LEVEL used by the OTELLogLevel field
repurposes an OpenTelemetry-reserved variable and could conflict with SDK
behavior; rename the struct field tag and env var to a non-conflicting name
(e.g., ExporterLogLevel or SlogToOTELLevel) by updating the OTELLogLevel field's
envconfig tag from OTEL_LOG_LEVEL to the new variable (and rename the field if
you prefer for clarity), then update any code that reads/writes OTELLogLevel
(constructors, config loaders, documentation, and tests) to use the new symbol
and env var so the application log-forwarding threshold no longer collides with
the OpenTelemetry SDK setting.

In `@apps/workspace-engine/pkg/db/convert.go`:
- Line 477: The unmarshal error log call using slog.Error("failed to unmarshal
release variables", "error", err) should include the release identifier to aid
debugging; update that logging invocation (the slog.Error call in convert.go
where release variables are unmarshaled) to add the row.ID (e.g., "release_id",
row.ID) or another unique release identifier used in the surrounding code so the
error message can be correlated to the specific release/row.

In `@apps/workspace-engine/pkg/store/policies/get_for_release_targets.go`:
- Around line 130-140: The slog.InfoContext call in get_for_release_targets.go
currently logs "policy_ids" with len(policyIDs) (an int) which is misleading;
change the logged attributes so the count and the actual IDs are correctly named
and represented — e.g., log "policy_count" with len(policyIDs) and separately
log "policy_ids" with the policyIDs slice (or a stringified list of IDs) in the
slog.InfoContext invocation used around setting policies for release targets,
ensuring the attribute keys match the value types.

In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`:
- Around line 128-129: The slog.Error followed immediately by os.Exit(1) can
cause OTEL BatchProcessor logs to be dropped; modify the startup error path in
the controller (where slog.Error is called before os.Exit) to invoke the OTEL
logs provider ForceFlush (or equivalent synchronous flush API) and wait for it
to complete or time out before calling os.Exit(1), handling any ForceFlush
error; apply the same change to the equivalent error path in the deployment plan
result controller (the other instance where slog.Error is followed by os.Exit).

In `@apps/workspace-engine/svc/controllers/desiredrelease/controller.go`:
- Around line 139-141: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable so logs are correlated to the request context:
locate the error path in the desired release reconcile worker creation block in
controller.go (the slog.Error("Failed to create desired release reconcile
worker", "error", err) call) and change it to call slog.ErrorContext(ctx, ...)
passing the same message and error key/value; no other behavior changes
required.

In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 160-161: The code logs the pointer address because resolved.Site
is a *string; update the slog call in the Datadog metric request block (the
slog.ErrorContext invocation) to pass the already dereferenced local variable
site (or explicitly *resolved.Site after nil check) instead of resolved.Site so
the actual site string is logged; ensure you handle the nil case if you choose
to dereference (use the existing site string variable if present).

In `@apps/workspace-engine/svc/controllers/policyeval/controller.go`:
- Around line 125-127: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable: locate the error handling in the policy eval
reconcile worker creation (the block that currently calls slog.Error("Failed to
create policy eval reconcile worker", "error", err) and os.Exit(1)) and change
the log invocation to slog.ErrorContext(ctx, ...) so the log includes the
context for trace/span correlation while keeping the same message and error key;
leave the os.Exit(1) behavior unchanged.

In `@apps/workspace-engine/svc/http/http.go`:
- Around line 43-45: The current handler in the goroutine that runs
s.httpServer.ListenAndServe logs the error with slog.Error then calls os.Exit(1)
which prevents OTEL's batch logs from flushing; instead remove the direct
os.Exit call and send the error to a coordinator channel (e.g., a FatalErr or
serverError channel) so the main goroutine can perform an orderly shutdown
(invoke otel provider Shutdown/ForceFlush and other cleanup) before exiting;
modify the goroutine around s.httpServer.ListenAndServe and replace os.Exit
usage with sending the error to that channel, and ensure the main service
manager selects on that channel and performs the provider shutdown and final
os.Exit.

---

Duplicate comments:
In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go`:
- Around line 65-66: The code currently calls slog.Error followed immediately by
os.Exit(1) (seen in New and again around the block at 97-99), which can cause
OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry
shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's
Shutdown(ctx) with a short timeout) and wait for it to complete, then log the
error and call os.Exit(1); update the error paths in the New function and the
later exit block to perform the OTEL shutdown with a context deadline before
exiting.

---

Nitpick comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go`:
- Around line 381-382: The helper functions buildResultData,
extractVectorResults, extractMatrixResults (and the intermediate extractResults)
must accept a context.Context parameter and be updated where called (Measure →
extractResults → extractVectorResults/extractMatrixResults) so logs carry
trace/span IDs; replace plain slog.Warn calls inside those helpers with a
context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx)
or equivalent) and call logger.Warn(...) including the existing error details.
Ensure all function signatures and call sites are updated consistently to pass
ctx through the chain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d412833b-2030-4c55-903a-9034803be1af

📥 Commits

Reviewing files that changed from the base of the PR and between 73e359a and a3959e5.

⛔ Files ignored due to path filters (1)
  • apps/workspace-engine/go.sum is excluded by !**/*.sum
📒 Files selected for processing (47)
  • apps/workspace-engine/go.mod
  • apps/workspace-engine/log_handler.go
  • apps/workspace-engine/log_handler_test.go
  • apps/workspace-engine/main.go
  • apps/workspace-engine/otel.go
  • apps/workspace-engine/pkg/config/env.go
  • apps/workspace-engine/pkg/db/client.go
  • apps/workspace-engine/pkg/db/convert.go
  • apps/workspace-engine/pkg/jobagents/argo/application_upserter.go
  • apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
  • apps/workspace-engine/pkg/jobagents/argoworkflows/workflow_submitter.go
  • apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
  • apps/workspace-engine/pkg/reconcile/events/deploymentselector.go
  • apps/workspace-engine/pkg/reconcile/events/desiredrelease.go
  • apps/workspace-engine/pkg/reconcile/events/environmentselector.go
  • apps/workspace-engine/pkg/reconcile/events/jobeligibility.go
  • apps/workspace-engine/pkg/reconcile/events/policyeval.go
  • apps/workspace-engine/pkg/reconcile/worker.go
  • apps/workspace-engine/pkg/store/policies/get_for_release_targets.go
  • apps/workspace-engine/pkg/store/resources/get_resources.go
  • apps/workspace-engine/pkg/workspace/relationships/eval/evaluate.go
  • apps/workspace-engine/pkg/workspace/releasemanager/policy/evaluator/deploymentdependency/getter.go
  • apps/workspace-engine/svc/claimcleanup/service.go
  • apps/workspace-engine/svc/controllers/deploymentplan/controller.go
  • apps/workspace-engine/svc/controllers/deploymentplanresult/controller.go
  • apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
  • apps/workspace-engine/svc/controllers/desiredrelease/controller.go
  • apps/workspace-engine/svc/controllers/desiredrelease/reconcile.go
  • apps/workspace-engine/svc/controllers/environmentresourceselectoreval/controller.go
  • apps/workspace-engine/svc/controllers/forcedeploy/controller.go
  • apps/workspace-engine/svc/controllers/forcedeploy/reconcile.go
  • apps/workspace-engine/svc/controllers/jobdispatch/controller.go
  • apps/workspace-engine/svc/controllers/jobeligibility/controller.go
  • apps/workspace-engine/svc/controllers/jobeligibility/getters_postgres.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/metric.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/http/http.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/reconcile.go
  • apps/workspace-engine/svc/controllers/policyeval/controller.go
  • apps/workspace-engine/svc/controllers/relationshipeval/controller.go
  • apps/workspace-engine/svc/http/http.go
  • apps/workspace-engine/svc/http/server/server.go
  • apps/workspace-engine/svc/pprof/pprof.go
  • apps/workspace-engine/svc/service.go
  • otel-collector-local.yaml

Comment thread apps/workspace-engine/log_handler.go Outdated
Comment thread apps/workspace-engine/pkg/config/env.go Outdated
var raw map[string]json.RawMessage
if err := json.Unmarshal(row.Variables, &raw); err != nil {
log.Error("failed to unmarshal release variables", "error", err)
slog.Error("failed to unmarshal release variables", "error", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add release ID to the unmarshal error log for debuggability.

Line 482 correctly includes "key", k, but line 477 logs the outer unmarshal failure with no identifier. Without row.ID (or at minimum the row's resource/deployment/environment IDs), correlating this log entry to a specific release in production is difficult.

🐛 Proposed fix
-			slog.Error("failed to unmarshal release variables", "error", err)
+			slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
slog.Error("failed to unmarshal release variables", "error", err)
slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/pkg/db/convert.go` at line 477, The unmarshal error log
call using slog.Error("failed to unmarshal release variables", "error", err)
should include the release identifier to aid debugging; update that logging
invocation (the slog.Error call in convert.go where release variables are
unmarshaled) to add the row.ID (e.g., "release_id", row.ID) or another unique
release identifier used in the surrounding code so the error message can be
correlated to the specific release/row.

Comment thread apps/workspace-engine/pkg/store/policies/get_for_release_targets.go
Comment thread apps/workspace-engine/svc/controllers/desiredrelease/controller.go Outdated
Comment on lines +92 to +93
slog.Error("Failed to get pgx pool")
os.Exit(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

OTEL log records may be dropped on os.Exit(1) before the batch exporter flushes

The OTEL log exporter configured in this PR uses a batch processor. Calling os.Exit(1) immediately after slog.Error bypasses all deferred cleanup and Shutdown() calls on the SDK, so the log record is almost certainly never exported to the collector. This is true for every slog.Error + os.Exit(1) pair across the changed controllers.

Consider flushing/shutting down the OTEL provider before exiting, or at minimum accepting that these startup-failure records won't reach the collector (and ensuring the stderr output is enough for operators).

Also applies to: 125-127

Comment thread apps/workspace-engine/svc/controllers/policyeval/controller.go Outdated
Comment on lines 43 to +45
if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatal("HTTP ListenAndServe error", "error", err)
slog.Error("HTTP ListenAndServe error", "error", err)
os.Exit(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

os.Exit(1) skips OTEL log flush — the fatal error log is likely never exported

slog.Error (line 44) enqueues the record in the BatchLogRecordProcessor, but os.Exit(1) immediately terminates the process, bypassing all deferred calls and the OTEL logger provider's Shutdown/ForceFlush. The error log that matters most (a fatal server failure) is precisely the one that will typically not reach the collector.

The previous charmbracelet log.Fatal wrote synchronously to stdout/stderr before exiting, so it was always visible. The OTEL async batch pipeline loses that guarantee.

The clean fix is to signal the main goroutine so it can drive an orderly shutdown (which includes flushing the OTEL providers) before exiting:

💡 Sketch of a channel-based approach
 type Service struct {
 	cfg        config.Config
 	pool       *pgxpool.Pool
 	httpServer *http.Server
+	fatalErr   chan error
 }

 func New(cfg config.Config, pool *pgxpool.Pool) *Service {
-	return &Service{cfg: cfg, pool: pool}
+	return &Service{cfg: cfg, pool: pool, fatalErr: make(chan error, 1)}
 }

+// FatalErr returns a channel that receives a non-nil error if the HTTP server
+// dies with an unrecoverable error. The service manager should drain this
+// channel and initiate a graceful shutdown (which flushes the OTEL pipeline).
+func (s *Service) FatalErr() <-chan error { return s.fatalErr }

 	go func() {
 		slog.Info("HTTP server listening", "address", addr)
 		if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 			slog.Error("HTTP ListenAndServe error", "error", err)
-			os.Exit(1)
+			s.fatalErr <- err
 		}
 	}()

The main service manager then selects on FatalErr() and calls the shutdown sequence (including otelProvider.Shutdown(ctx)) before calling os.Exit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/svc/http/http.go` around lines 43 - 45, The current
handler in the goroutine that runs s.httpServer.ListenAndServe logs the error
with slog.Error then calls os.Exit(1) which prevents OTEL's batch logs from
flushing; instead remove the direct os.Exit call and send the error to a
coordinator channel (e.g., a FatalErr or serverError channel) so the main
goroutine can perform an orderly shutdown (invoke otel provider
Shutdown/ForceFlush and other cleanup) before exiting; modify the goroutine
around s.httpServer.ListenAndServe and replace os.Exit usage with sending the
error to that channel, and ensure the main service manager selects on that
channel and performs the provider shutdown and final os.Exit.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/workspace-engine/otel.go (1)

47-54: ⚡ Quick win

Resource construction is duplicated across all three init* functions.

Identical resource.New(ctx, resource.WithAttributes(semconv.ServiceNameKey.String(serviceName))) blocks appear in initTracer, initMetrics, and initLogger. A shared helper (or passing a pre-built *resource.Resource) would eliminate the duplication and guarantee all three signals share the same resource descriptor.

♻️ Proposed refactor
+func buildResource(ctx context.Context, serviceName string) (*resource.Resource, error) {
+	return resource.New(ctx,
+		resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)),
+	)
+}
+
 func initTracer() (func(), error) {
 	ctx := context.Background()
 	serviceName := config.Global.OTELServiceName
 	endpoint := stripScheme(config.Global.OTELExporterOTLPEndpoint)
-	res, err := resource.New(ctx,
-		resource.WithAttributes(
-			semconv.ServiceNameKey.String(serviceName),
-		),
-	)
+	res, err := buildResource(ctx, serviceName)
 	if err != nil {
 		return nil, fmt.Errorf("failed to create resource: %w", err)
 	}

Apply the same substitution in initMetrics and initLogger.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/workspace-engine/otel.go` around lines 47 - 54, The three functions
initTracer, initMetrics, and initLogger each call resource.New(...) with the
same resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block;
factor this out by creating a shared helper (e.g., buildResource(ctx,
serviceName) or newResource) that returns the *resource.Resource (or build it
once and pass it in) and update initTracer, initMetrics, and initLogger to use
that helper instead of duplicating resource.New, ensuring all signals share the
exact same resource descriptor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 204-207: The debug log is passing resolved.Site (a *string) so it
will print the pointer address; change the call to use the local string variable
(site) or dereference resolved.Site safely (e.g., use site which was set
earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so
the actual site string is logged instead of the pointer; update the call that
currently references resolved.Site to reference site (or *resolved.Site after
nil-check) in the function containing slog.DebugContext.

---

Nitpick comments:
In `@apps/workspace-engine/otel.go`:
- Around line 47-54: The three functions initTracer, initMetrics, and initLogger
each call resource.New(...) with the same
resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block;
factor this out by creating a shared helper (e.g., buildResource(ctx,
serviceName) or newResource) that returns the *resource.Resource (or build it
once and pass it in) and update initTracer, initMetrics, and initLogger to use
that helper instead of duplicating resource.New, ensuring all signals share the
exact same resource descriptor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bc8eb71b-8b44-4349-b394-c239904dbe74

📥 Commits

Reviewing files that changed from the base of the PR and between a3959e5 and 66e0886.

📒 Files selected for processing (14)
  • apps/workspace-engine/log_handler.go
  • apps/workspace-engine/otel.go
  • apps/workspace-engine/pkg/config/env.go
  • apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
  • apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
  • apps/workspace-engine/pkg/store/policies/get_for_release_targets.go
  • apps/workspace-engine/pkg/store/resources/get_resources.go
  • apps/workspace-engine/svc/claimcleanup/service.go
  • apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
  • apps/workspace-engine/svc/controllers/desiredrelease/controller.go
  • apps/workspace-engine/svc/controllers/forcedeploy/reconcile.go
  • apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go
  • apps/workspace-engine/svc/controllers/policyeval/controller.go
  • apps/workspace-engine/svc/service.go
✅ Files skipped from review due to trivial changes (1)
  • apps/workspace-engine/svc/service.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • apps/workspace-engine/pkg/config/env.go
  • apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
  • apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
  • apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
  • apps/workspace-engine/log_handler.go
  • apps/workspace-engine/svc/claimcleanup/service.go

Comment on lines +204 to 207
slog.DebugContext(ctx, "Datadog metric measurement",
"site", resolved.Site,
"status", resp.StatusCode,
"duration", duration)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

resolved.Site is *string — the debug log will emit the pointer address, not the site string

Line 160 was correctly fixed to use the local site string variable, but the debug log at Line 205 still passes resolved.Site directly, which logs the pointer address.

🐛 Proposed fix
-	slog.DebugContext(ctx, "Datadog metric measurement",
-		"site", resolved.Site,
-		"status", resp.StatusCode,
-		"duration", duration)
+	slog.DebugContext(ctx, "Datadog metric measurement",
+		"site", site,
+		"status", resp.StatusCode,
+		"duration", duration)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
slog.DebugContext(ctx, "Datadog metric measurement",
"site", resolved.Site,
"status", resp.StatusCode,
"duration", duration)
slog.DebugContext(ctx, "Datadog metric measurement",
"site", site,
"status", resp.StatusCode,
"duration", duration)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`
around lines 204 - 207, The debug log is passing resolved.Site (a *string) so it
will print the pointer address; change the call to use the local string variable
(site) or dereference resolved.Site safely (e.g., use site which was set
earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so
the actual site string is logged instead of the pointer; update the call that
currently references resolved.Site to reference site (or *resolved.Site after
nil-check) in the function containing slog.DebugContext.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants