test: wrap vfio-pci runCommandOnConfigDaemon calls in Eventually (CNF-24222)#1237
test: wrap vfio-pci runCommandOnConfigDaemon calls in Eventually (CNF-24222)#1237sshnaidm wants to merge 1 commit into
Conversation
…-24222) After creating a vfio-pci SriovNetworkNodePolicy, the single worker node reboots/reconfigures its hardware, causing a transient EOF error when the API server tries to exec into the sriov-network-config-daemon pod via the kubelet (error dialing backend: EOF, HTTP 500). Wrap both runCommandOnConfigDaemon calls in the vfio-pci policy test in Eventually blocks with a 2-minute timeout and 5-second retry interval so the test can retry on transient connection failures after the node reboot, rather than failing immediately. This is the same race condition pattern as fixed in CNF-24178.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sshnaidm The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sshnaidm: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Problem
In the vfio-pci node policy test, after creating a
SriovNetworkNodePolicywithvfio-pcidriver, the single worker node reboots/reconfigures its hardware. The test then callsrunCommandOnConfigDaemonto validate PF info on the host, but because the node just rebooted, the API server to kubelet/pod connection has not fully recovered, resulting in:This is a race condition (same pattern as CNF-24178) where the test attempts to exec into the config daemon pod before the apiserver-to-kubelet connections are re-established after the SRIOV-induced node disruption.
Fix
Wrap both
runCommandOnConfigDaemoncalls in the vfio-pci test inEventuallyblocks with a 2-minute timeout and 5-second retry interval. This allows the test to retry on transienterror dialing backend: EOFerrors rather than failing immediately.The pattern matches the existing fix in
test_sriov_operator.go:1530-1538which already uses this approach for similar post-reboot checks.Related