[SAP] Implement graceful shutdown for cinder services by hemna · Pull Request #338 · sapcc/cinder

hemna · 2026-05-28T15:13:37Z

Cherry-pick of PR #314 (2023.1-m3) adapted for the 2025.1-m3 codebase.

See PR #314 for full description, test results, and design details.

Conflict resolutions for 2025.1-m3:

Preserved _cleanup_one_snapshot method (upstream fix 7f81985)
Preserved image_snap=None parameter on reimage() (upstream fix 25a2970)

Both are legitimate upstream differences between 2023.1 and 2025.1 releases.

Three-phase graceful shutdown that allows in-flight volume and backup operations to complete before the pod exits during Kubernetes rolling updates. Covers both cinder-volume and cinder-backup services. The fix: install SIG_IGN for SIGTERM/SIGINT/SIGHUP at the start of Service.stop(). cinder-volume runs as N forked child processes (one per backend) under oslo_service.ProcessLauncher. On SIGTERM, oslo_service's child handler calls SignalHandler.clear() which resets all handlers to SIG_DFL. The parent then sends a second SIGTERM via os.kill(child_pid, SIGTERM). Without SIG_IGN, the child terminates immediately — even though it's still inside pool.waitall() waiting for in-flight RPC handlers to finish. How it works: Phase 1: Skip consumer cancel. We do NOT call rpcserver.stop() because it triggers an eventlet socket race ("simultaneous read on fileno") that disrupts outbound HTTP/RPC connections used by in-flight ops. Broker-side deregistration happens automatically at process exit. During the gap, heartbeat stops (scheduler reroutes within service_down_time) and @reject_if_draining rejects late arrivals. Phase 2: Block in pool.waitall() until all in-flight RPC handler greenthreads complete their operations. Phase 3: Stop coordination, call super().stop(), cleanup threadpool executor. Process exits cleanly. Supporting mechanisms: - Worker entry heartbeat (set_workers decorator): touches worker DB entries every 10s during operations, preventing new pod init_host do_cleanup from resetting in-flight volumes to error. Uses short interruptible sleeps (0.1s polling) for fast shutdown response. - do_cleanup freshness check: skips worker entries updated within service_down_time, only cleans up truly stale entries. - Backup restore heartbeat: touches backup.updated_at every 10s, preventing new backup pod from triggering BackupRestoreCancel. - Backup _detach_device no-reraise: if detach fails during shutdown, log and continue. Data integrity preserved. - reject_if_draining decorator: unconditionally rejects new RPC calls on a draining service so scheduler routes to healthy backends. Test adaptations: - Base test classes (BaseVolumeTestCase, BaseBackupTest) drain the GreenPool in addCleanup to prevent greenthread leaks. - test_volume_cleanup sets service_down_time=0 and runs threadpool tasks synchronously (mock_object) to avoid race conditions. - test_admin_actions tearDown explicitly stops fake RPC servers to deregister stale endpoints from oslo.messaging's fake transport. - test_service assertions updated to reflect intentional skip of rpcserver.stop() in graceful shutdown, and signal flag reset. - test_backup_messages updated for _detach_device no-reraise behavior. Requires (separate changes): - dumb-init --single-child on container commands - terminationGracePeriodSeconds: 900 on pod spec Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SAP] Implement graceful shutdown for cinder services#338

[SAP] Implement graceful shutdown for cinder services#338
hemna wants to merge 1 commit into
stable/2025.1-m3from
graceful-shutdown-2025

hemna commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hemna commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant