Skip to content

olap-controller: duplicate-key (SQLSTATE 23505) on v1_runs_olap_<date>_completed_pkey causes restart cycle #3737

@404prefrontalcortexnotfound

Description

Summary

On self-hosted Hatchet, the engine ('hatchet-grpc' deployment) gets stuck in a continuous error loop emitting:

ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint
\"v1_runs_olap_<DATE>_completed_pkey\" (SQLSTATE 23505)" service=olap-controller

The error fires every ~2s (polling interval). The pod CPU-saturates handling the error spam. Liveness probe (/live HTTP, default 1s timeout 5s period) starts failing → pod gets killed → on restart the same row reprocesses → loop repeats hourly.

Setup

Symptom

$ kubectl get pods -n hatchet -l app.kubernetes.io/name=hatchet-grpc
hatchet-grpc-<X>   0/1   Running   23 restarts   5d uptime
hatchet-grpc-<Y>   0/1   Running   22 restarts   5d uptime
hatchet-grpc-<Z>   0/1   Running   17 restarts   5d uptime

Restart cause: Container grpc failed liveness probe, will be restarted. Liveness probe failure is a SECONDARY effect — root cause is olap-controller error spam saturating CPU.

RabbitMQ state

All queues 0 messages (rules out replay loop).

Root cause hypothesis

Some row in a Postgres outbox/queue/work table is fed to olap-controller, which tries to INSERT a completion row. The target row already exists in v1_runs_olap_<DATE>_completed, so INSERT fails with duplicate-key. The transaction rolls back, the source row stays in the queue, retry forever.

Suggested fix: olap-controller's INSERT for completion records should be ON CONFLICT DO NOTHING (or UPDATE) rather than failing on duplicate.

Workaround (kalevala-side)

We're attempting to relax the liveness probe (5s timeout, 30s period) so the pod survives the CPU saturation without being killed. This stops the cycle but doesn't fix the underlying error spam. Proper fix is upstream.

References

Logs

2026-04-27T07:46:43.253Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
2026-04-27T07:46:45.250Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
[... every 2 seconds, indefinitely ...]

Happy to provide more context (full pod logs, schema dumps) if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions