olap-controller: duplicate-key (SQLSTATE 23505) on v1_runs_olap_<date>_completed_pkey causes restart cycle

## Summary

On self-hosted Hatchet, the engine ('hatchet-grpc' deployment) gets stuck in a continuous error loop emitting:

```
ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint
\"v1_runs_olap_<DATE>_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
```

The error fires every ~2s (polling interval). The pod CPU-saturates handling the error spam. Liveness probe (`/live` HTTP, default 1s timeout 5s period) starts failing → pod gets killed → on restart the same row reprocesses → loop repeats hourly.

## Setup

- Hatchet engine v0.83.27 (chart 0.10.4)
- Self-hosted Postgres 16 (CNPG)
- Single tenant, low workflow volume
- Internal tenant exists with controllerPartitionId — rules out https://github.com/hatchet-dev/hatchet/issues/3634

## Symptom

```
$ kubectl get pods -n hatchet -l app.kubernetes.io/name=hatchet-grpc
hatchet-grpc-<X>   0/1   Running   23 restarts   5d uptime
hatchet-grpc-<Y>   0/1   Running   22 restarts   5d uptime
hatchet-grpc-<Z>   0/1   Running   17 restarts   5d uptime
```

Restart cause: `Container grpc failed liveness probe, will be restarted`. Liveness probe failure is a SECONDARY effect — root cause is olap-controller error spam saturating CPU.

## RabbitMQ state

All queues 0 messages (rules out replay loop).

## Root cause hypothesis

Some row in a Postgres outbox/queue/work table is fed to olap-controller, which tries to INSERT a completion row. The target row already exists in `v1_runs_olap_<DATE>_completed`, so INSERT fails with duplicate-key. The transaction rolls back, the source row stays in the queue, retry forever.

Suggested fix: olap-controller's INSERT for completion records should be `ON CONFLICT DO NOTHING` (or `UPDATE`) rather than failing on duplicate.

## Workaround (kalevala-side)

We're attempting to relax the liveness probe (5s timeout, 30s period) so the pod survives the CPU saturation without being killed. This stops the cycle but doesn't fix the underlying error spam. Proper fix is upstream.

## References

- Related but distinct: hatchet-dev/hatchet#3634 (partition scheduler skipping when internal tenant missing)

## Logs

```
2026-04-27T07:46:43.253Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
2026-04-27T07:46:45.250Z ERR could not update task statuses error="ERROR: duplicate key value violates unique constraint \"v1_runs_olap_20260413_completed_pkey\" (SQLSTATE 23505)" service=olap-controller
[... every 2 seconds, indefinitely ...]
```

Happy to provide more context (full pod logs, schema dumps) if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

olap-controller: duplicate-key (SQLSTATE 23505) on v1_runs_olap_<date>_completed_pkey causes restart cycle #3737

Summary

Setup

Symptom

RabbitMQ state

Root cause hypothesis

Workaround (kalevala-side)

References

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

olap-controller: duplicate-key (SQLSTATE 23505) on v1_runs_olap_<date>_completed_pkey causes restart cycle #3737

Description

Summary

Setup

Symptom

RabbitMQ state

Root cause hypothesis

Workaround (kalevala-side)

References

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions