[SAP] Implement graceful shutdown for cinder services#338
Open
hemna wants to merge 1 commit into
Open
Conversation
Three-phase graceful shutdown that allows in-flight volume and backup
operations to complete before the pod exits during Kubernetes rolling
updates. Covers both cinder-volume and cinder-backup services.
The fix: install SIG_IGN for SIGTERM/SIGINT/SIGHUP at the start of
Service.stop(). cinder-volume runs as N forked child processes (one per
backend) under oslo_service.ProcessLauncher. On SIGTERM, oslo_service's
child handler calls SignalHandler.clear() which resets all handlers to
SIG_DFL. The parent then sends a second SIGTERM via os.kill(child_pid,
SIGTERM). Without SIG_IGN, the child terminates immediately — even
though it's still inside pool.waitall() waiting for in-flight RPC
handlers to finish.
How it works:
Phase 1: Skip consumer cancel. We do NOT call rpcserver.stop() because
it triggers an eventlet socket race ("simultaneous read on fileno")
that disrupts outbound HTTP/RPC connections used by in-flight ops.
Broker-side deregistration happens automatically at process exit.
During the gap, heartbeat stops (scheduler reroutes within
service_down_time) and @reject_if_draining rejects late arrivals.
Phase 2: Block in pool.waitall() until all in-flight RPC handler
greenthreads complete their operations.
Phase 3: Stop coordination, call super().stop(), cleanup threadpool
executor. Process exits cleanly.
Supporting mechanisms:
- Worker entry heartbeat (set_workers decorator): touches worker DB
entries every 10s during operations, preventing new pod init_host
do_cleanup from resetting in-flight volumes to error. Uses short
interruptible sleeps (0.1s polling) for fast shutdown response.
- do_cleanup freshness check: skips worker entries updated within
service_down_time, only cleans up truly stale entries.
- Backup restore heartbeat: touches backup.updated_at every 10s,
preventing new backup pod from triggering BackupRestoreCancel.
- Backup _detach_device no-reraise: if detach fails during shutdown,
log and continue. Data integrity preserved.
- reject_if_draining decorator: unconditionally rejects new RPC calls
on a draining service so scheduler routes to healthy backends.
Test adaptations:
- Base test classes (BaseVolumeTestCase, BaseBackupTest) drain the
GreenPool in addCleanup to prevent greenthread leaks.
- test_volume_cleanup sets service_down_time=0 and runs threadpool
tasks synchronously (mock_object) to avoid race conditions.
- test_admin_actions tearDown explicitly stops fake RPC servers to
deregister stale endpoints from oslo.messaging's fake transport.
- test_service assertions updated to reflect intentional skip of
rpcserver.stop() in graceful shutdown, and signal flag reset.
- test_backup_messages updated for _detach_device no-reraise behavior.
Requires (separate changes):
- dumb-init --single-child on container commands
- terminationGracePeriodSeconds: 900 on pod spec
Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of PR #314 (2023.1-m3) adapted for the 2025.1-m3 codebase.
See PR #314 for full description, test results, and design details.
Conflict resolutions for 2025.1-m3:
_cleanup_one_snapshotmethod (upstream fix 7f81985)image_snap=Noneparameter onreimage()(upstream fix 25a2970)Both are legitimate upstream differences between 2023.1 and 2025.1 releases.