Fix proxy sigterm handling in amqp-proxy, enable CMR and override terminationGracePeriodSeconds#615
Open
lmiccini wants to merge 3 commits into
Open
Conversation
The Python AMQP proxy sidecar only handled KeyboardInterrupt (SIGINT), not SIGTERM. Since kubelet sends SIGTERM to stop containers, the proxy hung indefinitely preventing pod termination. Register a SIGTERM handler that cancels all proxy tasks and catches the resulting CancelledError during shutdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Quorum queues can end up under-replicated when clients declare queues while not all RabbitMQ nodes are available (e.g. during rolling restarts or parallel pod startup after a data-wipe upgrade). RabbitMQ does not retroactively grow membership, so the missing nodes stay missing. Enable CMR in the RabbitMQ 4.x config for multi-node clusters so that quorum queues are automatically grown to the target group size. CMR's trigger_interval (10s default) fires when a node joins, covering rolling restarts. The periodic interval (60min default) acts as a background safety net. auto_remove is set to false so nodes temporarily down during restarts are not shrunk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The upstream rabbitmq-cluster-operator defaults to 604800s (7 days) for terminationGracePeriodSeconds. Combined with containers that do not handle SIGTERM properly, this causes pods to stay in Terminating state for days. Add a defaulting webhook that sets the value to 60s for existing clusters still carrying the old default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lmiccini The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses three issues/shortcomings that emerged during the rabbitmq upgrade testing:
SIGTERM to stop containers, the proxy hung indefinitely preventing pod termination.
-> this is needed so that rolling the pods after annotating the cluster to remove the proxy would work without having to force-delete the pods.
declare queues while not all nodes are available (rolling restarts, parallel pod startup after data-wipe
upgrades). RabbitMQ does not retroactively grow membership. CMR (RabbitMQ 4.x) automatically grows quorum queues to the target group size — the trigger fires within 10s of a node joining, with a 60-minute periodic safety
net.
-> this minimizes the chances that the prestophook would wait indefinitely for all the queues to be replicated across the cluster before draining the node during stop or rolling restarts.
cluster-operator causes pods to stay in Terminating state for days when combined with containers that don't
handle SIGTERM. The defaulting webhook migrates existing clusters still carrying the old value; custom values
are preserved.
self explanatory, we want to align existing clusters to the new default to avoid getting stuck indefinitely if we hit any of the situations above ^