Summary
On self-hosted Hatchet, the engine ('hatchet-grpc' deployment) gets stuck in a continuous error loop emitting:
ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint
\"v1_runs_olap_<DATE>_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
The error fires every ~2s (polling interval). The pod CPU-saturates handling the error spam. Liveness probe (/live HTTP, default 1s timeout 5s period) starts failing → pod gets killed → on restart the same row reprocesses → loop repeats hourly.
Setup
Symptom
$ kubectl get pods -n hatchet -l app.kubernetes.io/name=hatchet-grpc
hatchet-grpc-<X> 0/1 Running 23 restarts 5d uptime
hatchet-grpc-<Y> 0/1 Running 22 restarts 5d uptime
hatchet-grpc-<Z> 0/1 Running 17 restarts 5d uptime
Restart cause: Container grpc failed liveness probe, will be restarted. Liveness probe failure is a SECONDARY effect — root cause is olap-controller error spam saturating CPU.
RabbitMQ state
All queues 0 messages (rules out replay loop).
Root cause hypothesis
Some row in a Postgres outbox/queue/work table is fed to olap-controller, which tries to INSERT a completion row. The target row already exists in v1_runs_olap_<DATE>_completed, so INSERT fails with duplicate-key. The transaction rolls back, the source row stays in the queue, retry forever.
Suggested fix: olap-controller's INSERT for completion records should be ON CONFLICT DO NOTHING (or UPDATE) rather than failing on duplicate.
Workaround (kalevala-side)
We're attempting to relax the liveness probe (5s timeout, 30s period) so the pod survives the CPU saturation without being killed. This stops the cycle but doesn't fix the underlying error spam. Proper fix is upstream.
References
Logs
2026-04-27T07:46:43.253Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
2026-04-27T07:46:45.250Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
[... every 2 seconds, indefinitely ...]
Happy to provide more context (full pod logs, schema dumps) if useful.
Summary
On self-hosted Hatchet, the engine ('hatchet-grpc' deployment) gets stuck in a continuous error loop emitting:
The error fires every ~2s (polling interval). The pod CPU-saturates handling the error spam. Liveness probe (
/liveHTTP, default 1s timeout 5s period) starts failing → pod gets killed → on restart the same row reprocesses → loop repeats hourly.Setup
Symptom
Restart cause:
Container grpc failed liveness probe, will be restarted. Liveness probe failure is a SECONDARY effect — root cause is olap-controller error spam saturating CPU.RabbitMQ state
All queues 0 messages (rules out replay loop).
Root cause hypothesis
Some row in a Postgres outbox/queue/work table is fed to olap-controller, which tries to INSERT a completion row. The target row already exists in
v1_runs_olap_<DATE>_completed, so INSERT fails with duplicate-key. The transaction rolls back, the source row stays in the queue, retry forever.Suggested fix: olap-controller's INSERT for completion records should be
ON CONFLICT DO NOTHING(orUPDATE) rather than failing on duplicate.Workaround (kalevala-side)
We're attempting to relax the liveness probe (5s timeout, 30s period) so the pod survives the CPU saturation without being killed. This stops the cycle but doesn't fix the underlying error spam. Proper fix is upstream.
References
Logs
Happy to provide more context (full pod logs, schema dumps) if useful.