Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,67 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [2.1.0] - 2026-04-02

### Added

- OpenTelemetry integration for distributed tracing, metrics, and log correlation.
Traces are exported via OTLP gRPC to Tempo; metrics are exposed via a Prometheus
`/metrics` endpoint; structured JSON logs include `trace_id` and `span_id` for
Loki–Tempo correlation.
- RED metrics (Rate, Errors, Duration) and saturation gauges via custom
`MetricsMiddleware`: `http_requests_total`, `http_request_errors_total`,
`http_request_duration_seconds`, `http_active_requests`.
- Image processing metrics: `image_processing_duration_seconds`,
`image_uploads_total`, `images_currently_processing`.
- Auto-instrumentation for FastAPI routes and SQLAlchemy queries using
`opentelemetry-instrumentation-fastapi` and
`opentelemetry-instrumentation-sqlalchemy`.
- Observability stack Kubernetes manifests (`minikube/observability/`):
Prometheus, Tempo, Loki, Promtail (DaemonSet), and Grafana with
pre-provisioned datasources and dashboards.
- Three Grafana dashboards provisioned automatically:
- **RED Metrics** — request rate, error rate, latency percentiles, saturation
- **Traces** — service map, recent traces, duration distribution
- **Logs** — application logs, log volume by level, error log filter
- `IMG_OTEL_ENABLED`, `IMG_OTEL_EXPORTER_OTLP_ENDPOINT`, and
`IMG_OTEL_SERVICE_NAME` configuration settings for opt-in observability.
- Trace context (`trace_id`, `span_id`, `trace_flags`) injected into JSON log
output via `opentelemetry-instrumentation-logging`.
- `minikube/observability/setup.sh` and `teardown.sh` scripts for one-command
deployment of the full observability stack.

### Fixed

- Trace context (`trace_id`, `span_id`) now correctly appears in log records by
using a `log_hook` callback in `LoggingInstrumentor` instead of relying on the
no-op `set_logging_format=False` mode.
- OpenTelemetry `TracerProvider` and `MeterProvider` are now initialized in
`create_app()` before `FastAPIInstrumentor.instrument_app()`, ensuring spans
are created with the real provider instead of the no-op default.
- Switched Dockerfile CMD to Uvicorn `--factory` mode
(`src.main:create_app --factory`) so each worker process initializes its own
`TracerProvider` and gRPC exporter, avoiding broken state from pre-fork setup.
- Promtail log collection: added static `__path__` glob
(`/var/log/pods/cv-platform_image-service-*/*/*.log`) as a reliable fallback
alongside Kubernetes SD, with `docker: {}` pipeline stage for container log
format unwrapping.
- Grafana provisioned datasources now have explicit `uid` fields (`prometheus`,
`tempo`, `loki`), fixing "Datasource prometheus was not found" errors in
Tempo's Service Map panel and dashboard cross-references.
- Replaced `${DS_PROMETHEUS}`, `${DS_TEMPO}`, and `${DS_LOKI}` template
variables in all provisioned dashboard JSON files with hardcoded datasource
UIDs, since Grafana provisioned dashboards do not resolve template variables.
- Enabled Tempo metrics generator with `service-graphs` and `span-metrics`
processors, and added `--web.enable-remote-write-receiver` to Prometheus, so
the Service Map panel receives `traces_service_graph_*` metrics.
- Fixed `06-grafana-dashboards.yaml` ConfigMap which contained `PLACEHOLDER`
instead of actual dashboard JSON; now embeds the real dashboard definitions.
- Changed Grafana anonymous org role from `Viewer` to `Editor` so trace ID links
in the Traces dashboard can open the Explore view.
- `image_uploads_total` counter is now incremented in `UploadImageUseCase` on
each successful upload; previously the metric was defined but never recorded.

## [2.0.2] - 2026-04-02

### Added
Expand Down Expand Up @@ -249,7 +310,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fix `type: ignore` comment on `rowcount` to use correct mypy error code `attr-defined`.
- Add proper type annotation for `settings` parameter in retention sweep endpoint.

[unreleased]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.2...HEAD
[unreleased]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.1.0...HEAD
[2.1.0]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.2...v2.1.0
[2.0.2]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.1...v2.0.2
[2.0.1]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v2.0.0...v2.0.1
[2.0.0]: https://github.com/vlantonov/ImageProcessingServiceDemo/compare/v1.3.0...v2.0.0
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ USER appuser

EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
CMD ["uvicorn", "src.main:create_app", "--factory", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
44 changes: 44 additions & 0 deletions PROJECT_DESCRIPTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,44 @@ A complete **Minikube demo** is included (`minikube/`) with automated setup, tea

---

## Observability (OpenTelemetry)

The service is fully instrumented with **OpenTelemetry** for distributed tracing, metrics, and log correlation — enabled via `IMG_OTEL_ENABLED=true`.

### Tracing

- Auto-instrumentation for FastAPI routes and SQLAlchemy queries via `opentelemetry-instrumentation-fastapi` and `opentelemetry-instrumentation-sqlalchemy`.
- Traces exported via OTLP gRPC to **Grafana Tempo**, with `trace_id` and `span_id` injected into structured JSON logs for correlation.

### Metrics (RED + Saturation)

- **Rate**: `http_requests_total` — total HTTP requests by method, path, status.
- **Errors**: `http_request_errors_total` — 4xx/5xx errors by status code.
- **Duration**: `http_request_duration_seconds` — request latency histogram.
- **Saturation**: `http_active_requests` — in-flight request gauge.
- **Image processing**: `image_processing_duration_seconds`, `image_uploads_total`, `images_currently_processing`.
- Metrics exposed via a Prometheus `/metrics` endpoint, scraped by **Prometheus**.

### Logging

- Structured JSON logs with `trace_id`, `span_id`, and `correlation_id` fields.
- Collected by **Promtail** and shipped to **Grafana Loki** for aggregation and search.
- Derived fields in Loki link `trace_id` to Tempo for seamless log-to-trace navigation.

### Grafana Dashboards

Three dashboards are provisioned automatically:

| Dashboard | Description |
|-----------|-------------|
| **RED Metrics** | Request rate, error rate, latency percentiles, saturation gauges |
| **Traces** | Service map, recent traces with clickable trace IDs, duration distribution |
| **Logs** | Application logs, log volume by level, error log filter |

The full observability stack (Prometheus, Tempo, Loki, Promtail, Grafana) deploys to a separate `observability` Kubernetes namespace via `minikube/observability/setup.sh`.

---

## 12-Factor Configuration

All settings are provided via environment variables (prefix `IMG_`) using **pydantic-settings**, ensuring type-safe, validated configuration:
Expand All @@ -145,6 +183,9 @@ All settings are provided via environment variables (prefix `IMG_`) using **pyda
| `IMG_RATE_LIMIT_READ_MAX` | `60` | Max read requests per window per IP |
| `IMG_RATE_LIMIT_READ_WINDOW` | `60` | Read rate limit window (seconds) |
| `IMG_DEBUG` | `false` | Enable debug logging |
| `IMG_OTEL_ENABLED` | `false` | Enable OpenTelemetry instrumentation |
| `IMG_OTEL_EXPORTER_OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint for trace export |
| `IMG_OTEL_SERVICE_NAME` | `image-processing-service` | Service name in traces and metrics |

---

Expand Down Expand Up @@ -188,6 +229,8 @@ Test tooling: pytest, pytest-asyncio (auto mode), httpx, aiosqlite (in-memory SQ
| Multi-Stage Docker | Builder/runtime separation for minimal production images |
| Horizontal Autoscaling | Kubernetes HPA scales pods based on CPU and memory metrics |
| 12-Factor Config | Type-safe environment variables via pydantic-settings |
| OpenTelemetry | Distributed tracing, metrics, and log correlation across services |
| RED Metrics | Rate, Errors, Duration monitoring via custom middleware |
| FastAPI DI | Routes depend on use case abstractions, not concrete implementations |

---
Expand All @@ -205,4 +248,5 @@ Test tooling: pytest, pytest-asyncio (auto mode), httpx, aiosqlite (in-memory SQ
| **Orchestration** | Kubernetes, Minikube (local demo) |
| **Testing** | pytest, pytest-asyncio, httpx, aiosqlite |
| **Code Quality** | ruff (linter/formatter), mypy (type checker) |
| **Observability** | OpenTelemetry SDK, Prometheus, Grafana, Tempo, Loki, Promtail |
| **Migrations** | Alembic |
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ cd minikube && ./setup.sh # deploy full stack

See [minikube/README.md](minikube/README.md) for details.

### Observability Stack (Minikube)

```bash
cd minikube/observability && ./setup.sh # deploy Prometheus, Tempo, Loki, Grafana
minikube service grafana --namespace=observability # open Grafana
./teardown.sh # clean up
```

See [minikube/observability/README.md](minikube/observability/README.md) for dashboards and configuration.

### Kubernetes (Production)

```bash
Expand Down Expand Up @@ -99,6 +109,9 @@ All settings via environment variables (prefix `IMG_`), validated by [pydantic-s
| `IMG_RATE_LIMIT_READ_MAX` | `60` | Max read requests per window per IP |
| `IMG_RATE_LIMIT_READ_WINDOW` | `60` | Read rate limit window (seconds) |
| `IMG_DEBUG` | `false` | Enable debug logging |
| `IMG_OTEL_ENABLED` | `false` | Enable OpenTelemetry instrumentation |
| `IMG_OTEL_EXPORTER_OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint for trace export |
| `IMG_OTEL_SERVICE_NAME` | `image-processing-service` | Service name in traces and metrics |

## Project Structure

Expand All @@ -109,11 +122,13 @@ src/
├── domain/ # Entities & ports (zero external deps)
├── application/ # Use cases & DTOs
├── infrastructure/ # Adapters (PostgreSQL, Pillow, filesystem)
│ └── observability/ # OpenTelemetry setup, metrics, middleware
└── presentation/ # FastAPI routes, schemas, middleware

cpp/ # Optional C++ resize module (pybind11)
k8s/ # Kubernetes manifests (Deployment, HPA, PVC, …)
minikube/ # Local K8s demo scripts
│ └── observability/ # Prometheus, Tempo, Loki, Grafana manifests
tests/ # tests across all architecture layers
```

Expand Down
3 changes: 3 additions & 0 deletions minikube/01-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,6 @@ data:
IMG_DB_MAX_OVERFLOW: "10"
IMG_RETENTION_BATCH_SIZE: "50"
IMG_DEBUG: "true"
IMG_OTEL_ENABLED: "true"
IMG_OTEL_EXPORTER_OTLP_ENDPOINT: "http://tempo.observability.svc.cluster.local:4317"
IMG_OTEL_SERVICE_NAME: "image-processing-service"
3 changes: 2 additions & 1 deletion minikube/05-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ spec:
selector:
app: image-service
ports:
- port: 80
- name: http
port: 80
targetPort: 8000
nodePort: 30080
protocol: TCP
Expand Down
4 changes: 4 additions & 0 deletions minikube/observability/00-namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: observability
113 changes: 113 additions & 0 deletions minikube/observability/01-prometheus.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: observability
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: "image-service"
metrics_path: /metrics
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- cv-platform
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: image-service
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: http
# Fallback static config if K8s SD is not available
- job_name: "image-service-static"
metrics_path: /metrics
static_configs:
- targets: ["image-service.cv-platform.svc.cluster.local:80"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: default
namespace: observability
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: observability
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.53.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=7d"
- "--web.enable-remote-write-receiver"
ports:
- containerPort: 9090
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: observability
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
Loading
Loading