Skip to content

Fix proxy sigterm handling in amqp-proxy, enable CMR and override terminationGracePeriodSeconds#615

Open
lmiccini wants to merge 3 commits into
openstack-k8s-operators:mainfrom
lmiccini:fix-proxy-sigterm-handling
Open

Fix proxy sigterm handling in amqp-proxy, enable CMR and override terminationGracePeriodSeconds#615
lmiccini wants to merge 3 commits into
openstack-k8s-operators:mainfrom
lmiccini:fix-proxy-sigterm-handling

Conversation

@lmiccini

Copy link
Copy Markdown
Contributor

This PR addresses three issues/shortcomings that emerged during the rabbitmq upgrade testing:

  • Handle SIGTERM in AMQP proxy: The Python proxy sidecar only handled SIGINT, not SIGTERM. Since kubelet sends
    SIGTERM to stop containers, the proxy hung indefinitely preventing pod termination.

-> this is needed so that rolling the pods after annotating the cluster to remove the proxy would work without having to force-delete the pods.

  • Enable Continuous Membership Reconciliation (CMR): Quorum queues can end up under-replicated when clients
    declare queues while not all nodes are available (rolling restarts, parallel pod startup after data-wipe
    upgrades). RabbitMQ does not retroactively grow membership. CMR (RabbitMQ 4.x) automatically grows quorum queues to the target group size — the trigger fires within 10s of a node joining, with a 60-minute periodic safety
    net.

-> this minimizes the chances that the prestophook would wait indefinitely for all the queues to be replicated across the cluster before draining the node during stop or rolling restarts.

  • Migrate terminationGracePeriodSeconds from 604800 to 60: The old 7-day default inherited from the upstream
    cluster-operator causes pods to stay in Terminating state for days when combined with containers that don't
    handle SIGTERM. The defaulting webhook migrates existing clusters still carrying the old value; custom values
    are preserved.

self explanatory, we want to align existing clusters to the new default to avoid getting stuck indefinitely if we hit any of the situations above ^

lmiccini and others added 3 commits June 27, 2026 15:41
The Python AMQP proxy sidecar only handled KeyboardInterrupt (SIGINT),
not SIGTERM. Since kubelet sends SIGTERM to stop containers, the proxy
hung indefinitely preventing pod termination. Register a SIGTERM handler
that cancels all proxy tasks and catches the resulting CancelledError
during shutdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Quorum queues can end up under-replicated when clients declare queues
while not all RabbitMQ nodes are available (e.g. during rolling restarts
or parallel pod startup after a data-wipe upgrade). RabbitMQ does not
retroactively grow membership, so the missing nodes stay missing.

Enable CMR in the RabbitMQ 4.x config for multi-node clusters so that
quorum queues are automatically grown to the target group size. CMR's
trigger_interval (10s default) fires when a node joins, covering rolling
restarts. The periodic interval (60min default) acts as a background
safety net. auto_remove is set to false so nodes temporarily down during
restarts are not shrunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The upstream rabbitmq-cluster-operator defaults to 604800s (7 days) for
terminationGracePeriodSeconds. Combined with containers that do not
handle SIGTERM properly, this causes pods to stay in Terminating state
for days. Add a defaulting webhook that sets the value to 60s for
existing clusters still carrying the old default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot requested review from abays and dprince June 27, 2026 13:48
@openshift-ci

openshift-ci Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lmiccini lmiccini changed the title Fix proxy sigterm handling Fix proxy sigterm handling in amqp-proxy, enable CMR and override terminationGracePeriodSeconds Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant