Skip to content

Add OpenTelemetry observability stack (traces, metrics, logs)#13

Merged
vladiant merged 6 commits into
mainfrom
add_observability
Apr 2, 2026
Merged

Add OpenTelemetry observability stack (traces, metrics, logs)#13
vladiant merged 6 commits into
mainfrom
add_observability

Conversation

@vladiant
Copy link
Copy Markdown
Collaborator

@vladiant vladiant commented Apr 2, 2026

Summary

Adds full observability to the image processing service using OpenTelemetry, with a dedicated minikube namespace for the monitoring stack (Prometheus, Tempo, Loki, Grafana).

What changed

Application instrumentation (src)

  • OpenTelemetry SDK integration: distributed tracing via OTLP gRPC to Tempo, metrics via Prometheus exporter, log correlation via LoggingInstrumentor with log_hook callback.
  • MetricsMiddleware exposing RED metrics: http_requests_total, http_request_errors_total, http_request_duration_seconds, http_active_requests.
  • Image-specific metrics: image_processing_duration_seconds, image_uploads_total, images_currently_processing.
  • Auto-instrumentation for FastAPI routes and SQLAlchemy queries.
  • Trace context (trace_id, span_id, trace_flags) injected into structured JSON logs for Loki–Tempo correlation.
  • Three opt-in config settings: IMG_OTEL_ENABLED, IMG_OTEL_EXPORTER_OTLP_ENDPOINT, IMG_OTEL_SERVICE_NAME.

Observability stack (observability)

  • Kubernetes manifests for Prometheus, Tempo (with service-graphs and span-metrics generators), Loki, Promtail (DaemonSet), and Grafana — all in a separate observability namespace.
  • Grafana pre-provisioned with datasources (Prometheus, Tempo, Loki) and three dashboards:
    • RED Metrics — request rate, error rate, latency percentiles, saturation gauges
    • Traces — service map, recent traces with clickable trace IDs, duration distribution
    • Logs — application logs, log volume by level, error filter
  • One-command setup.sh / teardown.sh scripts.

Build

  • Dockerfile switched to Uvicorn --factory mode so each worker initializes its own TracerProvider.
  • 8 new OpenTelemetry dependencies added to requirements.txt / pyproject.toml.

Key fixes included

  • log_hook callback for correct trace_id/span_id in log records.
  • TracerProvider init order fixed (before instrument_app()).
  • Promtail static path fallback for reliable log discovery.
  • Grafana datasource UIDs hardcoded (provisioned dashboards don't resolve template variables).
  • Tempo metrics generator + Prometheus --web.enable-remote-write-receiver for service map data.
  • Grafana anonymous role set to Editor so trace ID links open Explore.
  • image_uploads_total counter now actually incremented on upload.

Testing

  • All 216 tests pass — no external services required.
  • Lint (ruff), format, and type-check (mypy --strict) all clean.
  • Manually verified in minikube: metrics scraped, traces flowing, logs correlated, all three dashboards rendering correctly.

Version

MINOR bump → 2.1.0 (new feature, no breaking changes).

Stats

29 files changed, +1954, −6

@vladiant vladiant merged commit ec531d9 into main Apr 2, 2026
4 checks passed
@vladiant vladiant deleted the add_observability branch April 2, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant