chore(workspace-engine): Enable OTEL Logs#1113
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR migrates workspace-engine from Charmbracelet logging to Go's structured ChangesStructured Logging Migration to slog
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| slog.Error("Failed to parse database config", "error", err) | ||
| os.Exit(1) |
There was a problem hiding this comment.
Would prefer not to have this added everywhere but this is what our previous log Fatal was doing under the hood. slog has no Fatal option so this keeps existing behavior. We should refactor to return errors up the stack instead but that can be a separate change as this is already a large diff
mleonidas
left a comment
There was a problem hiding this comment.
I now this was just adding logging, but I added some comments on some issues i noticed from existing code, if you want to implement the changes i proposed feel free if not I'll create some issues. otherwise LGTM
| if err != nil { | ||
| if !errors.Is(err, pgx.ErrNoRows) { | ||
| slog.Error( | ||
| slog.ErrorContext(ctx, |
There was a problem hiding this comment.
this isn't really an error, and the log message is actually kind of confusing, I need to look more around the context of this but, the db returning no rows is more informational, unless there were expected release targets that we somehow knew we should find here but didn't?
There was a problem hiding this comment.
Yeah agree no rows generally isn't generally an error. I'll take a closer look at the flow for this one
| return | ||
| } | ||
| log.Error("claim-cleanup: failed to release expired claims", "error", err) | ||
| slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err) |
There was a problem hiding this comment.
I know you're just adding in logging here but I'm noticing this ctx.Err() != nil check which is kind of awkward., and has a subtle timing issue in the rare case that CleanupExpiredClaims returns a real error and the if check, we'd swallow the legitimate failure. I'd probably refactor this to
func (s *Service) run(ctx context.Context) {
defer close(s.done)
ticker := time.NewTicker(s.interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
s.tick(ctx)
}
}
}
func (s *Service) tick(ctx context.Context) {
released, err := s.queue.CleanupExpiredClaims(ctx)
if err != nil {
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
return
}
slog.ErrorContext(ctx, "claim-cleanup: failed to release expired claims", "error", err)
return
}
if released > 0 {
slog.InfoContext(ctx, "claim-cleanup: released expired claims", "count", released)
}
}I would also make the argument that we probably want a context timeout, which would really give us the ergonomics around this is shutting down vs we failed and have an error
| if err != nil { | ||
| log.Fatal("Failed to parse database config:", err) | ||
| slog.ErrorContext(ctx, "Failed to parse database config", "error", err) | ||
| os.Exit(1) |
There was a problem hiding this comment.
this comment is more for me than for this review, this whole file should be refactored, there's a bunch of subtle bugs with this
- The singleton is just overhead without any benefit. sync.Once exists to coordinate concurrent first-time initialization. We call it 13+ times in main.go sequentially, there's no concurrency to coordinate
- errors are swallowed by os.Exit inside the library.
- the first callers context is silently captured., since it runs inside Once.Do
- And probably the most important GetPool only returns a *pgxpool.Pool with no error, which is a lie it can error, instead it just kills the process, the GetDB calls nil check is basically a data race
There is a bunch more problems but I think we actually want to refactor this
i would make this package look like
package db
import (
"context"
"fmt"
"time"
"github.com/exaring/otelpgx"
"github.com/jackc/pgx/v5/pgxpool"
)
type Config struct {
URL string
MaxPoolSize int32
ApplicationName string
}
// NewPool creates a configured pgx pool. The caller owns it and must Close it.
func NewPool(ctx context.Context, cfg Config) (*pgxpool.Pool, error) {
pgxCfg, err := pgxpool.ParseConfig(cfg.URL)
if err != nil {
return nil, fmt.Errorf("parse database config: %w", err)
}
pgxCfg.MaxConns = cfg.MaxPoolSize
pgxCfg.MinConns = 1
pgxCfg.HealthCheckPeriod = 30 * time.Second
pgxCfg.ConnConfig.Tracer = otelpgx.NewTracer()
pgxCfg.ConnConfig.RuntimeParams["application_name"] = cfg.ApplicationName
pool, err := pgxpool.NewWithConfig(ctx, pgxCfg)
if err != nil {
return nil, fmt.Errorf("create database pool: %w", err)
}
return pool, nil
}and then in main, or tests
pool, err := db.NewPool(ctx, db.Config{
URL: config.Global.PostgresURL,
MaxPoolSize: int32(config.Global.PostgresMaxPoolSize),
ApplicationName: config.Global.PostgresApplicationName,
})
if err != nil {
slog.Error("failed to initialize database", "error", err)
os.Exit(1)
}
defer pool.Close()and pass the pool to all the services and we can remove all the nil checks from all the services and the log lines as the pool HAS to be initialized at that point.
There was a problem hiding this comment.
Yeah the errors swallowed by os exit is in definite need of a fix. Was planning to tackle those in a separate pr. Good callouts on the other ones too
| log.Fatal("Failed to get pgx pool") | ||
| panic("failed to get pgx pool") | ||
| slog.Error("Failed to get pgx pool") | ||
| os.Exit(1) |
There was a problem hiding this comment.
see comment about db client refactor
| log.Fatal("Failed to get pgx pool") | ||
| panic("failed to get pgx pool") | ||
| slog.Error("Failed to get pgx pool") | ||
| os.Exit(1) |
There was a problem hiding this comment.
see comment about db client refactor
| if pgxPool == nil { | ||
| log.Fatal("Failed to get pgx pool") | ||
| panic("failed to get pgx pool") | ||
| slog.Error("Failed to get pgx pool") |
There was a problem hiding this comment.
see comment about db client refactor
| if pgxPool == nil { | ||
| log.Fatal("Failed to get pgx pool") | ||
| panic("failed to get pgx pool") | ||
| slog.Error("Failed to get pgx pool") |
There was a problem hiding this comment.
see comment about db client refactor
452c046 to
a3959e5
Compare
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (1)
apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go (1)
65-66:⚠️ Potential issue | 🟠 MajorSame OTEL flush-before-exit risk as in other controllers
The
slog.Error+os.Exit(1)pattern here is subject to the same batched-exporter flush issue noted inpolicyeval/controller.go. No additional context is available in thisNewfunction, soslog.Error(no context) is the right variant, but the flush concern remains.Also applies to: 97-99
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go` around lines 65 - 66, The code currently calls slog.Error followed immediately by os.Exit(1) (seen in New and again around the block at 97-99), which can cause OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's Shutdown(ctx) with a short timeout) and wait for it to complete, then log the error and call os.Exit(1); update the error paths in the New function and the later exit block to perform the OTEL shutdown with a context deadline before exiting.
🧹 Nitpick comments (1)
apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go (1)
381-382: ⚡ Quick win
slog.Warncalls in helper functions lack context — inconsistent with PR's trace-correlation goal
buildResultData,extractVectorResults, andextractMatrixResultsuseslog.Warnwithout a context, so these log records won't carry trace/span IDs.ctxis available inMeasureand could be threaded through to these helpers.✏️ Suggested refactor
Pass
ctxthrough the call chain:-func buildResultData( - statusCode int, - respBody []byte, - duration time.Duration, -) (map[string]any, error) { +func buildResultData( + ctx context.Context, + statusCode int, + respBody []byte, + duration time.Duration, +) (map[string]any, error) {- value, results, err := extractResults(promResp) + value, results, err := extractResults(ctx, promResp) if err != nil { - slog.Warn("Could not extract Prometheus result values", "error", err) + slog.WarnContext(ctx, "Could not extract Prometheus result values", "error", err) }Apply the same propagation through
extractResults→extractVectorResults/extractMatrixResults.Also applies to: 419-420, 454-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go` around lines 381 - 382, The helper functions buildResultData, extractVectorResults, extractMatrixResults (and the intermediate extractResults) must accept a context.Context parameter and be updated where called (Measure → extractResults → extractVectorResults/extractMatrixResults) so logs carry trace/span IDs; replace plain slog.Warn calls inside those helpers with a context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx) or equivalent) and call logger.Warn(...) including the existing error details. Ensure all function signatures and call sites are updated consistently to pass ctx through the chain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/workspace-engine/log_handler.go`:
- Around line 64-66: The Enabled method in levelHandler currently only checks
the local level (lvl >= l.level) and ignores the wrapped handler's Enabled
decision; update levelHandler.Enabled(ctx context.Context, lvl slog.Level) to
also call and respect l.handler.Enabled(ctx, lvl) (i.e., return true only if
both the local level allows the record and the wrapped handler.Enabled returns
true), ensuring you pass the incoming ctx through to l.handler.Enabled so that
context-dependent filtering in the wrapped handler is honored.
In `@apps/workspace-engine/pkg/config/env.go`:
- Line 34: The env var name OTEL_LOG_LEVEL used by the OTELLogLevel field
repurposes an OpenTelemetry-reserved variable and could conflict with SDK
behavior; rename the struct field tag and env var to a non-conflicting name
(e.g., ExporterLogLevel or SlogToOTELLevel) by updating the OTELLogLevel field's
envconfig tag from OTEL_LOG_LEVEL to the new variable (and rename the field if
you prefer for clarity), then update any code that reads/writes OTELLogLevel
(constructors, config loaders, documentation, and tests) to use the new symbol
and env var so the application log-forwarding threshold no longer collides with
the OpenTelemetry SDK setting.
In `@apps/workspace-engine/pkg/db/convert.go`:
- Line 477: The unmarshal error log call using slog.Error("failed to unmarshal
release variables", "error", err) should include the release identifier to aid
debugging; update that logging invocation (the slog.Error call in convert.go
where release variables are unmarshaled) to add the row.ID (e.g., "release_id",
row.ID) or another unique release identifier used in the surrounding code so the
error message can be correlated to the specific release/row.
In `@apps/workspace-engine/pkg/store/policies/get_for_release_targets.go`:
- Around line 130-140: The slog.InfoContext call in get_for_release_targets.go
currently logs "policy_ids" with len(policyIDs) (an int) which is misleading;
change the logged attributes so the count and the actual IDs are correctly named
and represented — e.g., log "policy_count" with len(policyIDs) and separately
log "policy_ids" with the policyIDs slice (or a stringified list of IDs) in the
slog.InfoContext invocation used around setting policies for release targets,
ensuring the attribute keys match the value types.
In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`:
- Around line 128-129: The slog.Error followed immediately by os.Exit(1) can
cause OTEL BatchProcessor logs to be dropped; modify the startup error path in
the controller (where slog.Error is called before os.Exit) to invoke the OTEL
logs provider ForceFlush (or equivalent synchronous flush API) and wait for it
to complete or time out before calling os.Exit(1), handling any ForceFlush
error; apply the same change to the equivalent error path in the deployment plan
result controller (the other instance where slog.Error is followed by os.Exit).
In `@apps/workspace-engine/svc/controllers/desiredrelease/controller.go`:
- Around line 139-141: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable so logs are correlated to the request context:
locate the error path in the desired release reconcile worker creation block in
controller.go (the slog.Error("Failed to create desired release reconcile
worker", "error", err) call) and change it to call slog.ErrorContext(ctx, ...)
passing the same message and error key/value; no other behavior changes
required.
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 160-161: The code logs the pointer address because resolved.Site
is a *string; update the slog call in the Datadog metric request block (the
slog.ErrorContext invocation) to pass the already dereferenced local variable
site (or explicitly *resolved.Site after nil check) instead of resolved.Site so
the actual site string is logged; ensure you handle the nil case if you choose
to dereference (use the existing site string variable if present).
In `@apps/workspace-engine/svc/controllers/policyeval/controller.go`:
- Around line 125-127: Replace the plain slog.Error call with slog.ErrorContext
using the existing ctx variable: locate the error handling in the policy eval
reconcile worker creation (the block that currently calls slog.Error("Failed to
create policy eval reconcile worker", "error", err) and os.Exit(1)) and change
the log invocation to slog.ErrorContext(ctx, ...) so the log includes the
context for trace/span correlation while keeping the same message and error key;
leave the os.Exit(1) behavior unchanged.
In `@apps/workspace-engine/svc/http/http.go`:
- Around line 43-45: The current handler in the goroutine that runs
s.httpServer.ListenAndServe logs the error with slog.Error then calls os.Exit(1)
which prevents OTEL's batch logs from flushing; instead remove the direct
os.Exit call and send the error to a coordinator channel (e.g., a FatalErr or
serverError channel) so the main goroutine can perform an orderly shutdown
(invoke otel provider Shutdown/ForceFlush and other cleanup) before exiting;
modify the goroutine around s.httpServer.ListenAndServe and replace os.Exit
usage with sending the error to that channel, and ensure the main service
manager selects on that channel and performs the provider shutdown and final
os.Exit.
---
Duplicate comments:
In `@apps/workspace-engine/svc/controllers/jobverificationmetric/controller.go`:
- Around line 65-66: The code currently calls slog.Error followed immediately by
os.Exit(1) (seen in New and again around the block at 97-99), which can cause
OTEL batched exporters to drop spans; before exiting, call the OpenTelemetry
shutdown routine (e.g., otel.Shutdown(ctx) or your tracer provider's
Shutdown(ctx) with a short timeout) and wait for it to complete, then log the
error and call os.Exit(1); update the error paths in the New function and the
later exit block to perform the OTEL shutdown with a context deadline before
exiting.
---
Nitpick comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.go`:
- Around line 381-382: The helper functions buildResultData,
extractVectorResults, extractMatrixResults (and the intermediate extractResults)
must accept a context.Context parameter and be updated where called (Measure →
extractResults → extractVectorResults/extractMatrixResults) so logs carry
trace/span IDs; replace plain slog.Warn calls inside those helpers with a
context-aware logger retrieved from ctx (e.g., logger := slog.FromContext(ctx)
or equivalent) and call logger.Warn(...) including the existing error details.
Ensure all function signatures and call sites are updated consistently to pass
ctx through the chain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d412833b-2030-4c55-903a-9034803be1af
⛔ Files ignored due to path filters (1)
apps/workspace-engine/go.sumis excluded by!**/*.sum
📒 Files selected for processing (47)
apps/workspace-engine/go.modapps/workspace-engine/log_handler.goapps/workspace-engine/log_handler_test.goapps/workspace-engine/main.goapps/workspace-engine/otel.goapps/workspace-engine/pkg/config/env.goapps/workspace-engine/pkg/db/client.goapps/workspace-engine/pkg/db/convert.goapps/workspace-engine/pkg/jobagents/argo/application_upserter.goapps/workspace-engine/pkg/jobagents/argo/argocd_plan.goapps/workspace-engine/pkg/jobagents/argoworkflows/workflow_submitter.goapps/workspace-engine/pkg/jobagents/terraformcloud/tfe.goapps/workspace-engine/pkg/reconcile/events/deploymentselector.goapps/workspace-engine/pkg/reconcile/events/desiredrelease.goapps/workspace-engine/pkg/reconcile/events/environmentselector.goapps/workspace-engine/pkg/reconcile/events/jobeligibility.goapps/workspace-engine/pkg/reconcile/events/policyeval.goapps/workspace-engine/pkg/reconcile/worker.goapps/workspace-engine/pkg/store/policies/get_for_release_targets.goapps/workspace-engine/pkg/store/resources/get_resources.goapps/workspace-engine/pkg/workspace/relationships/eval/evaluate.goapps/workspace-engine/pkg/workspace/releasemanager/policy/evaluator/deploymentdependency/getter.goapps/workspace-engine/svc/claimcleanup/service.goapps/workspace-engine/svc/controllers/deploymentplan/controller.goapps/workspace-engine/svc/controllers/deploymentplanresult/controller.goapps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.goapps/workspace-engine/svc/controllers/desiredrelease/controller.goapps/workspace-engine/svc/controllers/desiredrelease/reconcile.goapps/workspace-engine/svc/controllers/environmentresourceselectoreval/controller.goapps/workspace-engine/svc/controllers/forcedeploy/controller.goapps/workspace-engine/svc/controllers/forcedeploy/reconcile.goapps/workspace-engine/svc/controllers/jobdispatch/controller.goapps/workspace-engine/svc/controllers/jobeligibility/controller.goapps/workspace-engine/svc/controllers/jobeligibility/getters_postgres.goapps/workspace-engine/svc/controllers/jobverificationmetric/controller.goapps/workspace-engine/svc/controllers/jobverificationmetric/metrics/metric.goapps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.goapps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/http/http.goapps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/prometheus/prometheus.goapps/workspace-engine/svc/controllers/jobverificationmetric/reconcile.goapps/workspace-engine/svc/controllers/policyeval/controller.goapps/workspace-engine/svc/controllers/relationshipeval/controller.goapps/workspace-engine/svc/http/http.goapps/workspace-engine/svc/http/server/server.goapps/workspace-engine/svc/pprof/pprof.goapps/workspace-engine/svc/service.gootel-collector-local.yaml
| var raw map[string]json.RawMessage | ||
| if err := json.Unmarshal(row.Variables, &raw); err != nil { | ||
| log.Error("failed to unmarshal release variables", "error", err) | ||
| slog.Error("failed to unmarshal release variables", "error", err) |
There was a problem hiding this comment.
Add release ID to the unmarshal error log for debuggability.
Line 482 correctly includes "key", k, but line 477 logs the outer unmarshal failure with no identifier. Without row.ID (or at minimum the row's resource/deployment/environment IDs), correlating this log entry to a specific release in production is difficult.
🐛 Proposed fix
- slog.Error("failed to unmarshal release variables", "error", err)
+ slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| slog.Error("failed to unmarshal release variables", "error", err) | |
| slog.Error("failed to unmarshal release variables", "releaseID", row.ID, "error", err) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/workspace-engine/pkg/db/convert.go` at line 477, The unmarshal error log
call using slog.Error("failed to unmarshal release variables", "error", err)
should include the release identifier to aid debugging; update that logging
invocation (the slog.Error call in convert.go where release variables are
unmarshaled) to add the row.ID (e.g., "release_id", row.ID) or another unique
release identifier used in the surrounding code so the error message can be
correlated to the specific release/row.
| slog.Error("Failed to get pgx pool") | ||
| os.Exit(1) |
There was a problem hiding this comment.
OTEL log records may be dropped on os.Exit(1) before the batch exporter flushes
The OTEL log exporter configured in this PR uses a batch processor. Calling os.Exit(1) immediately after slog.Error bypasses all deferred cleanup and Shutdown() calls on the SDK, so the log record is almost certainly never exported to the collector. This is true for every slog.Error + os.Exit(1) pair across the changed controllers.
Consider flushing/shutting down the OTEL provider before exiting, or at minimum accepting that these startup-failure records won't reach the collector (and ensuring the stderr output is enough for operators).
Also applies to: 125-127
| if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { | ||
| log.Fatal("HTTP ListenAndServe error", "error", err) | ||
| slog.Error("HTTP ListenAndServe error", "error", err) | ||
| os.Exit(1) |
There was a problem hiding this comment.
os.Exit(1) skips OTEL log flush — the fatal error log is likely never exported
slog.Error (line 44) enqueues the record in the BatchLogRecordProcessor, but os.Exit(1) immediately terminates the process, bypassing all deferred calls and the OTEL logger provider's Shutdown/ForceFlush. The error log that matters most (a fatal server failure) is precisely the one that will typically not reach the collector.
The previous charmbracelet log.Fatal wrote synchronously to stdout/stderr before exiting, so it was always visible. The OTEL async batch pipeline loses that guarantee.
The clean fix is to signal the main goroutine so it can drive an orderly shutdown (which includes flushing the OTEL providers) before exiting:
💡 Sketch of a channel-based approach
type Service struct {
cfg config.Config
pool *pgxpool.Pool
httpServer *http.Server
+ fatalErr chan error
}
func New(cfg config.Config, pool *pgxpool.Pool) *Service {
- return &Service{cfg: cfg, pool: pool}
+ return &Service{cfg: cfg, pool: pool, fatalErr: make(chan error, 1)}
}
+// FatalErr returns a channel that receives a non-nil error if the HTTP server
+// dies with an unrecoverable error. The service manager should drain this
+// channel and initiate a graceful shutdown (which flushes the OTEL pipeline).
+func (s *Service) FatalErr() <-chan error { return s.fatalErr }
go func() {
slog.Info("HTTP server listening", "address", addr)
if err := s.httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
slog.Error("HTTP ListenAndServe error", "error", err)
- os.Exit(1)
+ s.fatalErr <- err
}
}()The main service manager then selects on FatalErr() and calls the shutdown sequence (including otelProvider.Shutdown(ctx)) before calling os.Exit.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/workspace-engine/svc/http/http.go` around lines 43 - 45, The current
handler in the goroutine that runs s.httpServer.ListenAndServe logs the error
with slog.Error then calls os.Exit(1) which prevents OTEL's batch logs from
flushing; instead remove the direct os.Exit call and send the error to a
coordinator channel (e.g., a FatalErr or serverError channel) so the main
goroutine can perform an orderly shutdown (invoke otel provider
Shutdown/ForceFlush and other cleanup) before exiting; modify the goroutine
around s.httpServer.ListenAndServe and replace os.Exit usage with sending the
error to that channel, and ensure the main service manager selects on that
channel and performs the provider shutdown and final os.Exit.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/workspace-engine/otel.go (1)
47-54: ⚡ Quick winResource construction is duplicated across all three
init*functions.Identical
resource.New(ctx, resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)))blocks appear ininitTracer,initMetrics, andinitLogger. A shared helper (or passing a pre-built*resource.Resource) would eliminate the duplication and guarantee all three signals share the same resource descriptor.♻️ Proposed refactor
+func buildResource(ctx context.Context, serviceName string) (*resource.Resource, error) { + return resource.New(ctx, + resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)), + ) +} + func initTracer() (func(), error) { ctx := context.Background() serviceName := config.Global.OTELServiceName endpoint := stripScheme(config.Global.OTELExporterOTLPEndpoint) - res, err := resource.New(ctx, - resource.WithAttributes( - semconv.ServiceNameKey.String(serviceName), - ), - ) + res, err := buildResource(ctx, serviceName) if err != nil { return nil, fmt.Errorf("failed to create resource: %w", err) }Apply the same substitution in
initMetricsandinitLogger.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/otel.go` around lines 47 - 54, The three functions initTracer, initMetrics, and initLogger each call resource.New(...) with the same resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block; factor this out by creating a shared helper (e.g., buildResource(ctx, serviceName) or newResource) that returns the *resource.Resource (or build it once and pass it in) and update initTracer, initMetrics, and initLogger to use that helper instead of duplicating resource.New, ensuring all signals share the exact same resource descriptor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`:
- Around line 204-207: The debug log is passing resolved.Site (a *string) so it
will print the pointer address; change the call to use the local string variable
(site) or dereference resolved.Site safely (e.g., use site which was set
earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so
the actual site string is logged instead of the pointer; update the call that
currently references resolved.Site to reference site (or *resolved.Site after
nil-check) in the function containing slog.DebugContext.
---
Nitpick comments:
In `@apps/workspace-engine/otel.go`:
- Around line 47-54: The three functions initTracer, initMetrics, and initLogger
each call resource.New(...) with the same
resource.WithAttributes(semconv.ServiceNameKey.String(serviceName)) block;
factor this out by creating a shared helper (e.g., buildResource(ctx,
serviceName) or newResource) that returns the *resource.Resource (or build it
once and pass it in) and update initTracer, initMetrics, and initLogger to use
that helper instead of duplicating resource.New, ensuring all signals share the
exact same resource descriptor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bc8eb71b-8b44-4349-b394-c239904dbe74
📒 Files selected for processing (14)
apps/workspace-engine/log_handler.goapps/workspace-engine/otel.goapps/workspace-engine/pkg/config/env.goapps/workspace-engine/pkg/jobagents/argo/argocd_plan.goapps/workspace-engine/pkg/jobagents/terraformcloud/tfe.goapps/workspace-engine/pkg/store/policies/get_for_release_targets.goapps/workspace-engine/pkg/store/resources/get_resources.goapps/workspace-engine/svc/claimcleanup/service.goapps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.goapps/workspace-engine/svc/controllers/desiredrelease/controller.goapps/workspace-engine/svc/controllers/forcedeploy/reconcile.goapps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.goapps/workspace-engine/svc/controllers/policyeval/controller.goapps/workspace-engine/svc/service.go
✅ Files skipped from review due to trivial changes (1)
- apps/workspace-engine/svc/service.go
🚧 Files skipped from review as they are similar to previous changes (6)
- apps/workspace-engine/pkg/config/env.go
- apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
- apps/workspace-engine/pkg/jobagents/terraformcloud/tfe.go
- apps/workspace-engine/pkg/jobagents/argo/argocd_plan.go
- apps/workspace-engine/log_handler.go
- apps/workspace-engine/svc/claimcleanup/service.go
| slog.DebugContext(ctx, "Datadog metric measurement", | ||
| "site", resolved.Site, | ||
| "status", resp.StatusCode, | ||
| "duration", duration) |
There was a problem hiding this comment.
resolved.Site is *string — the debug log will emit the pointer address, not the site string
Line 160 was correctly fixed to use the local site string variable, but the debug log at Line 205 still passes resolved.Site directly, which logs the pointer address.
🐛 Proposed fix
- slog.DebugContext(ctx, "Datadog metric measurement",
- "site", resolved.Site,
- "status", resp.StatusCode,
- "duration", duration)
+ slog.DebugContext(ctx, "Datadog metric measurement",
+ "site", site,
+ "status", resp.StatusCode,
+ "duration", duration)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| slog.DebugContext(ctx, "Datadog metric measurement", | |
| "site", resolved.Site, | |
| "status", resp.StatusCode, | |
| "duration", duration) | |
| slog.DebugContext(ctx, "Datadog metric measurement", | |
| "site", site, | |
| "status", resp.StatusCode, | |
| "duration", duration) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@apps/workspace-engine/svc/controllers/jobverificationmetric/metrics/provider/datadog/datadog.go`
around lines 204 - 207, The debug log is passing resolved.Site (a *string) so it
will print the pointer address; change the call to use the local string variable
(site) or dereference resolved.Site safely (e.g., use site which was set
earlier) in the slog.DebugContext invocation for "Datadog metric measurement" so
the actual site string is logged instead of the pointer; update the call that
currently references resolved.Site to reference site (or *resolved.Site after
nil-check) in the function containing slog.DebugContext.
This PR enables OTEL log export. We had to switch our logging lib from charmbracelet to slog because slog allows you to register log handlers which allows us to use the OTEL handler in the sdk.
Slog is mostly compatible with our existing logging setup with the exception of
log.Fatal. I had to make a compromise to keep the existing behavior that we should refactor separately.I used the context logs where possible so we get proper trace/span correlation with logs. Verified this is working locally. Examples from local otel collector
Summary by CodeRabbit
New Features
Improvements