fix(failover): keep allocation if VM missing from postgres but on hyper#909
fix(failover): keep allocation if VM missing from postgres but on hyper#909umswmayj wants to merge 1 commit into
Conversation
…ll on hypervisor When the VM source (postgres) is missing data but VMs are still active on hypervisors (e.g. after a postgres data loss), the failover controller previously treated every allocated VM as 'no longer exists' and wiped the allocations. Empty reservations were then deleted, causing total loss of failover coverage despite the VMs still running. Cross-check hv1.HypervisorList.Status.Instances before removing a VM from allocations in reconcileRemoveInvalidVMFromReservations. If the VM is still reported active on any known hypervisor, keep the allocation and log a warning. The hypervisor-mismatch branch is intentionally left unchanged: it is already governed by the TrustHypervisorLocation flag. Adds unit tests covering: - VM missing from VM source but still on hypervisor (allocation kept) - VM source completely empty while VMs still on hypervisors (all reservations preserved -- the exact incident scenario) - Mixed: postgres-missing-but-on-HV vs. truly gone (only truly gone removed) - Hypervisor mismatch is not suppressed by the safeguard
📝 WalkthroughWalkthroughThe PR adds a safeguard to failover reservation reconciliation that prevents mass deletion of VM allocations when the VM source data is temporarily unavailable. It collects active VM presence from hypervisor operator status, uses this as a cross-check in the removal logic, and includes comprehensive test coverage for both preservation and removal scenarios. ChangesFailover Reservation Safeguard
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/scheduling/reservations/failover/controller.go (1)
443-446: ⚡ Quick winReduce log volume for safeguarded allocations.
This branch can execute on every reconcile while the VM source is degraded;
Infoper VM can create sustained log noise. PreferV(1)(or aggregated logging) here.Proposed change
- logger.Info("keeping VM allocation despite missing from VM source: still active on hypervisor", + logger.V(1).Info("keeping VM allocation despite missing from VM source: still active on hypervisor", "vmUUID", vmUUID, "reservation", res.Name, "allocatedHypervisor", allocatedHypervisor, "hypervisor", hv)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/failover/controller.go` around lines 443 - 446, The logger.Info call that emits per-VM messages when keeping allocations during a degraded VM source (the block updating updatedAllocations[vmUUID] and logging "keeping VM allocation...") is too noisy; change it to a lower verbosity level (e.g. logger.V(1).Info) or aggregate/summarize so it doesn't log every reconcile for each VM (target the logger.Info -> logger.V(1).Info for the "keeping VM allocation despite missing from VM source..." message referencing vmUUID, allocatedHypervisor, hv and the updatedAllocations update).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/scheduling/reservations/failover/controller.go`:
- Around line 443-446: The logger.Info call that emits per-VM messages when
keeping allocations during a degraded VM source (the block updating
updatedAllocations[vmUUID] and logging "keeping VM allocation...") is too noisy;
change it to a lower verbosity level (e.g. logger.V(1).Info) or
aggregate/summarize so it doesn't log every reconcile for each VM (target the
logger.Info -> logger.V(1).Info for the "keeping VM allocation despite missing
from VM source..." message referencing vmUUID, allocatedHypervisor, hv and the
updatedAllocations update).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1662335b-4272-4b70-adfd-ef4565dfbd51
📒 Files selected for processing (2)
internal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/failover/controller_test.go
Test Coverage ReportTest Coverage 📊: 69.5% |
When the VM source (postgres) is missing data but VMs are still active on hypervisors (e.g. after a postgres data loss), the failover controller previously treated every allocated VM as 'no longer exists' and wiped the allocations. Empty reservations were then deleted, causing total loss of failover coverage despite the VMs still running.
Cross-check hv1.HypervisorList.Status.Instances before removing a VM from allocations in reconcileRemoveInvalidVMFromReservations. If the VM is still reported active on any known hypervisor, keep the allocation and log a warning. The hypervisor-mismatch branch is intentionally left unchanged: it is already governed by the TrustHypervisorLocation flag.
Adds unit tests covering: