vlantonov · vladiant · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,67 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [2.1.0] - 2026-04-02
+
+### Added
+
+- OpenTelemetry integration for distributed tracing, metrics, and log correlation.
+  Traces are exported via OTLP gRPC to Tempo; metrics are exposed via a Prometheus
+  `/metrics` endpoint; structured JSON logs include `trace_id` and `span_id` for
+  Loki–Tempo correlation.
+- RED metrics (Rate, Errors, Duration) and saturation gauges via custom
+  `MetricsMiddleware`: `http_requests_total`, `http_request_errors_total`,
+  `http_request_duration_seconds`, `http_active_requests`.
+- Image processing metrics: `image_processing_duration_seconds`,
+  `image_uploads_total`, `images_currently_processing`.
+- Auto-instrumentation for FastAPI routes and SQLAlchemy queries using
+  `opentelemetry-instrumentation-fastapi` and
+  `opentelemetry-instrumentation-sqlalchemy`.
+- Observability stack Kubernetes manifests (`minikube/observability/`):
+  Prometheus, Tempo, Loki, Promtail (DaemonSet), and Grafana with
+  pre-provisioned datasources and dashboards.
+- Three Grafana dashboards provisioned automatically:
+  - **RED Metrics** — request rate, error rate, latency percentiles, saturation
+  - **Traces** — service map, recent traces, duration distribution
+  - **Logs** — application logs, log volume by level, error log filter
+- `IMG_OTEL_ENABLED`, `IMG_OTEL_EXPORTER_OTLP_ENDPOINT`, and
+  `IMG_OTEL_SERVICE_NAME` configuration settings for opt-in observability.
+- Trace context (`trace_id`, `span_id`, `trace_flags`) injected into JSON log
+  output via `opentelemetry-instrumentation-logging`.
+- `minikube/observability/setup.sh` and `teardown.sh` scripts for one-command
+  deployment of the full observability stack.
+
+### Fixed
+
+- Trace context (`trace_id`, `span_id`) now correctly appears in log records by
+  using a `log_hook` callback in `LoggingInstrumentor` instead of relying on the
+  no-op `set_logging_format=False` mode.
+- OpenTelemetry `TracerProvider` and `MeterProvider` are now initialized in
+  `create_app()` before `FastAPIInstrumentor.instrument_app()`, ensuring spans
+  are created with the real provider instead of the no-op default.
+- Switched Dockerfile CMD to Uvicorn `--factory` mode
+  (`src.main:create_app --factory`) so each worker process initializes its own
+  `TracerProvider` and gRPC exporter, avoiding broken state from pre-fork setup.
+- Promtail log collection: added static `__path__` glob
+  (`/var/log/pods/cv-platform_image-service-*/*/*.log`) as a reliable fallback
+  alongside Kubernetes SD, with `docker: {}` pipeline stage for container log
+  format unwrapping.
+- Grafana provisioned datasources now have explicit `uid` fields (`prometheus`,
+  `tempo`, `loki`), fixing "Datasource prometheus was not found" errors in
+  Tempo's Service Map panel and dashboard cross-references.
+- Replaced `${DS_PROMETHEUS}`, `${DS_TEMPO}`, and `${DS_LOKI}` template
+  variables in all provisioned dashboard JSON files with hardcoded datasource
+  UIDs, since Grafana provisioned dashboards do not resolve template variables.
+- Enabled Tempo metrics generator with `service-graphs` and `span-metrics`
+  processors, and added `--web.enable-remote-write-receiver` to Prometheus, so
+  the Service Map panel receives `traces_service_graph_*` metrics.
+- Fixed `06-grafana-dashboards.yaml` ConfigMap which contained `PLACEHOLDER`
+  instead of actual dashboard JSON; now embeds the real dashboard definitions.
+- Changed Grafana anonymous org role from `Viewer` to `Editor` so trace ID links
+  in the Traces dashboard can open the Explore view.
+- `image_uploads_total` counter is now incremented in `UploadImageUseCase` on
+  each successful upload; previously the metric was defined but never recorded.
+
 ## [2.0.2] - 2026-04-02
 
 ### Added
@@ -249,7 +310,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Fix `type: ignore` comment on `rowcount` to use correct mypy error code `attr-defined`.
 - Add proper type annotation for `settings` parameter in retention sweep endpoint.
 
-[unreleased]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.2...HEAD
+[unreleased]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.1.0...HEAD
+[2.1.0]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.2...v2.1.0
 [2.0.2]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.1...v2.0.2
 [2.0.1]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.0...v2.0.1
 [2.0.0]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v1.3.0...v2.0.0

diff --git a/Dockerfile b/Dockerfile
@@ -27,4 +27,4 @@ USER appuser
 
 EXPOSE 8000
 
-CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
+CMD ["uvicorn", "src.main:create_app", "--factory", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
diff --git a/PROJECT_DESCRIPTION.md b/PROJECT_DESCRIPTION.md
@@ -120,6 +120,44 @@ A complete **Minikube demo** is included (`minikube/`) with automated setup, tea
 
 ---
 
+## Observability (OpenTelemetry)
+
+The service is fully instrumented with **OpenTelemetry** for distributed tracing, metrics, and log correlation — enabled via `IMG_OTEL_ENABLED=true`.
+
+### Tracing
+
+- Auto-instrumentation for FastAPI routes and SQLAlchemy queries via `opentelemetry-instrumentation-fastapi` and `opentelemetry-instrumentation-sqlalchemy`.
+- Traces exported via OTLP gRPC to **Grafana Tempo**, with `trace_id` and `span_id` injected into structured JSON logs for correlation.
+
+### Metrics (RED + Saturation)
+
+- **Rate**: `http_requests_total` — total HTTP requests by method, path, status.
+- **Errors**: `http_request_errors_total` — 4xx/5xx errors by status code.
+- **Duration**: `http_request_duration_seconds` — request latency histogram.
+- **Saturation**: `http_active_requests` — in-flight request gauge.
+- **Image processing**: `image_processing_duration_seconds`, `image_uploads_total`, `images_currently_processing`.
+- Metrics exposed via a Prometheus `/metrics` endpoint, scraped by **Prometheus**.
+
+### Logging
+
+- Structured JSON logs with `trace_id`, `span_id`, and `correlation_id` fields.
+- Collected by **Promtail** and shipped to **Grafana Loki** for aggregation and search.
+- Derived fields in Loki link `trace_id` to Tempo for seamless log-to-trace navigation.
+
+### Grafana Dashboards
+
+Three dashboards are provisioned automatically:
+
+| Dashboard | Description |
+|-----------|-------------|
+| **RED Metrics** | Request rate, error rate, latency percentiles, saturation gauges |
+| **Traces** | Service map, recent traces with clickable trace IDs, duration distribution |
+| **Logs** | Application logs, log volume by level, error log filter |
+
+The full observability stack (Prometheus, Tempo, Loki, Promtail, Grafana) deploys to a separate `observability` Kubernetes namespace via `minikube/observability/setup.sh`.
+
+---
+
 ## 12-Factor Configuration
 
 All settings are provided via environment variables (prefix `IMG_`) using **pydantic-settings**, ensuring type-safe, validated configuration:
@@ -145,6 +183,9 @@ All settings are provided via environment variables (prefix `IMG_`) using **pyda
 | `IMG_RATE_LIMIT_READ_MAX` | `60` | Max read requests per window per IP |
 | `IMG_RATE_LIMIT_READ_WINDOW` | `60` | Read rate limit window (seconds) |
 | `IMG_DEBUG` | `false` | Enable debug logging |
+| `IMG_OTEL_ENABLED` | `false` | Enable OpenTelemetry instrumentation |
+| `IMG_OTEL_EXPORTER_OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint for trace export |
+| `IMG_OTEL_SERVICE_NAME` | `image-processing-service` | Service name in traces and metrics |
 
 ---
 
@@ -188,6 +229,8 @@ Test tooling: pytest, pytest-asyncio (auto mode), httpx, aiosqlite (in-memory SQ
 | Multi-Stage Docker | Builder/runtime separation for minimal production images |
 | Horizontal Autoscaling | Kubernetes HPA scales pods based on CPU and memory metrics |
 | 12-Factor Config | Type-safe environment variables via pydantic-settings |
+| OpenTelemetry | Distributed tracing, metrics, and log correlation across services |
+| RED Metrics | Rate, Errors, Duration monitoring via custom middleware |
 | FastAPI DI | Routes depend on use case abstractions, not concrete implementations |
 
 ---
@@ -205,4 +248,5 @@ Test tooling: pytest, pytest-asyncio (auto mode), httpx, aiosqlite (in-memory SQ
 | **Orchestration** | Kubernetes, Minikube (local demo) |
 | **Testing** | pytest, pytest-asyncio, httpx, aiosqlite |
 | **Code Quality** | ruff (linter/formatter), mypy (type checker) |
+| **Observability** | OpenTelemetry SDK, Prometheus, Grafana, Tempo, Loki, Promtail |
 | **Migrations** | Alembic |
diff --git a/README.md b/README.md
@@ -43,6 +43,16 @@ cd minikube && ./setup.sh    # deploy full stack
 
 See [minikube/README.md](minikube/README.md) for details.
 
+### Observability Stack (Minikube)
+
+```bash
+cd minikube/observability && ./setup.sh   # deploy Prometheus, Tempo, Loki, Grafana
+minikube service grafana --namespace=observability  # open Grafana
+./teardown.sh                             # clean up
+```
+
+See [minikube/observability/README.md](minikube/observability/README.md) for dashboards and configuration.
+
 ### Kubernetes (Production)
 
 ```bash
@@ -99,6 +109,9 @@ All settings via environment variables (prefix `IMG_`), validated by [pydantic-s
 | `IMG_RATE_LIMIT_READ_MAX` | `60` | Max read requests per window per IP |
 | `IMG_RATE_LIMIT_READ_WINDOW` | `60` | Read rate limit window (seconds) |
 | `IMG_DEBUG` | `false` | Enable debug logging |
+| `IMG_OTEL_ENABLED` | `false` | Enable OpenTelemetry instrumentation |
+| `IMG_OTEL_EXPORTER_OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint for trace export |
+| `IMG_OTEL_SERVICE_NAME` | `image-processing-service` | Service name in traces and metrics |
 
 ## Project Structure
 
@@ -109,11 +122,13 @@ src/
 ├── domain/                            # Entities & ports (zero external deps)
 ├── application/                       # Use cases & DTOs
 ├── infrastructure/                    # Adapters (PostgreSQL, Pillow, filesystem)
+│   └── observability/                 # OpenTelemetry setup, metrics, middleware
 └── presentation/                      # FastAPI routes, schemas, middleware
 
 cpp/                                   # Optional C++ resize module (pybind11)
 k8s/                                   # Kubernetes manifests (Deployment, HPA, PVC, …)
 minikube/                              # Local K8s demo scripts
+│   └── observability/                 # Prometheus, Tempo, Loki, Grafana manifests
 tests/                                 # tests across all architecture layers
 ```
 

diff --git a/minikube/01-configmap.yaml b/minikube/01-configmap.yaml
@@ -13,3 +13,6 @@ data:
   IMG_DB_MAX_OVERFLOW: "10"
   IMG_RETENTION_BATCH_SIZE: "50"
   IMG_DEBUG: "true"
+  IMG_OTEL_ENABLED: "true"
+  IMG_OTEL_EXPORTER_OTLP_ENDPOINT: "http://tempo.observability.svc.cluster.local:4317"
+  IMG_OTEL_SERVICE_NAME: "image-processing-service"
diff --git a/minikube/05-service.yaml b/minikube/05-service.yaml
@@ -8,7 +8,8 @@ spec:
   selector:
     app: image-service
   ports:
-    - port: 80
+    - name: http
+      port: 80
       targetPort: 8000
       nodePort: 30080
       protocol: TCP

diff --git a/minikube/observability/00-namespace.yaml b/minikube/observability/00-namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: observability
diff --git a/minikube/observability/01-prometheus.yaml b/minikube/observability/01-prometheus.yaml
@@ -0,0 +1,113 @@
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: prometheus-config
+  namespace: observability
+data:
+  prometheus.yml: |
+    global:
+      scrape_interval: 15s
+      evaluation_interval: 15s
+
+    scrape_configs:
+      - job_name: "image-service"
+        metrics_path: /metrics
+        kubernetes_sd_configs:
+          - role: endpoints
+            namespaces:
+              names:
+                - cv-platform
+        relabel_configs:
+          - source_labels: [__meta_kubernetes_service_name]
+            action: keep
+            regex: image-service
+          - source_labels: [__meta_kubernetes_endpoint_port_name]
+            action: keep
+            regex: http
+      # Fallback static config if K8s SD is not available
+      - job_name: "image-service-static"
+        metrics_path: /metrics
+        static_configs:
+          - targets: ["image-service.cv-platform.svc.cluster.local:80"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: prometheus
+rules:
+  - apiGroups: [""]
+    resources: ["services", "endpoints", "pods"]
+    verbs: ["get", "list", "watch"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: prometheus
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: prometheus
+subjects:
+  - kind: ServiceAccount
+    name: default
+    namespace: observability
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: prometheus
+  namespace: observability
+  labels:
+    app: prometheus
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: prometheus
+  template:
+    metadata:
+      labels:
+        app: prometheus
+    spec:
+      containers:
+        - name: prometheus
+          image: prom/prometheus:v2.53.0
+          args:
+            - "--config.file=/etc/prometheus/prometheus.yml"
+            - "--storage.tsdb.path=/prometheus"
+            - "--storage.tsdb.retention.time=7d"
+            - "--web.enable-remote-write-receiver"
+          ports:
+            - containerPort: 9090
+          resources:
+            requests:
+              cpu: "100m"
+              memory: "256Mi"
+            limits:
+              cpu: "500m"
+              memory: "512Mi"
+          volumeMounts:
+            - name: config
+              mountPath: /etc/prometheus
+            - name: data
+              mountPath: /prometheus
+      volumes:
+        - name: config
+          configMap:
+            name: prometheus-config
+        - name: data
+          emptyDir: {}
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: prometheus
+  namespace: observability
+spec:
+  selector:
+    app: prometheus
+  ports:
+    - port: 9090
+      targetPort: 9090
+  type: ClusterIP
Original file line number	Diff line number	Diff line change
Expand Up		@@ -27,4 +27,4 @@ USER appuser

		EXPOSE 8000

		CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
		CMD ["uvicorn", "src.main:create_app", "--factory", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]