docs: update Virt*Down runbooks for self-diagnosing alerts#383
Conversation
|
Warning Review limit reached
More reviews will be available in 45 minutes and 32 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR updates four KubeVirt component runbooks (VirtAPIDown, VirtControllerDown, VirtHandlerDown, VirtOperatorDown) to incorporate v1.6+ self-diagnosing alert behavior. Each runbook now documents how diagnostic labels ( Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Acting as a senior engineer: the changes are documentation-only and focused; check label/command accuracy and file line references. Nice, clear updates. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/runbooks/VirtControllerDown.md (1)
87-92: ⚖️ Poor tradeoffVerify mitigation items map to alert root causes.
The mitigation section includes "API server overloaded" and "Networking issues" as potential causes. These should be verifiable from the alert's PromQL logic or documented failure modes.
As per coding guidelines, "Mitigation must be concrete/actionable and map to root causes revealed by the alert logic."
Please verify whether the VirtControllerDown alert definition or its associated monitoring logic accounts for API server load or networking issues as observable failure modes, or if these are general operational considerations beyond the alert's scope.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/VirtControllerDown.md` around lines 87 - 92, Check the VirtControllerDown alert and its PromQL/recording rules (search for the alert name "VirtControllerDown" and associated rule files) to confirm whether the query references API server load metrics (e.g., apiserver_request_duration_seconds, apiserver_server_current_inflight_requests) or networking metrics (e.g., kube_network_??, node_network_* or CNI/plugin metrics); if those metrics are not present, remove or reword the "API server overloaded" and "Networking issues" items from the Mitigation section to indicate they are general operational considerations rather than actionable items derived from the alert logic, or alternatively update the alert's rule to include concrete, verifiable selectors/thresholds for those failure modes and document the exact PromQL expressions used so each mitigation maps directly to observable signals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/runbooks/VirtAPIDown.md`:
- Around line 7-15: The runbook incorrectly asserts that the VirtAPIDown alert
emits dynamic pod/reason labels in KubeVirt v1.6+; update
docs/runbooks/VirtAPIDown.md to remove or reword the “self-diagnosing” claim and
any logic that branches on a `reason` label so guidance reflects the actual
alert in kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.go (VirtAPIDown
defined as `cluster:kubevirt_virt_api_pods_running:count == 0` with static
labels only). Specifically, edit the ## Meaning section (currently lines ~7–15)
to drop the v1.6+ label claim, change the ## Diagnosis step 1 (around lines
~25–29) to rely on pod/deployment/log checks rather than a `reason` label, and
update the ## Mitigation steps (around lines ~72–84) to remove any fallback
logic that expects a `reason` label. Ensure the runbook's troubleshooting steps
are driven by inspecting pods, deployments, and API logs instead of alert
labels.
In `@docs/runbooks/VirtControllerDown.md`:
- Around line 7-15: The runbook wrongly claims VirtControllerDown self-diagnoses
with pod/reason labels since KubeVirt's VirtControllerDown alert (defined in
kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-controller.go and its
prom-rule CI fixture) does not inject `pod`/`reason` labels; update
docs/runbooks/VirtControllerDown.md to remove the v1.6+ self-diagnosis paragraph
(lines 7–15), and adjust Diagnosis step 1 and the Mitigation section to stop
branching on a non-existent `reason` label and instead describe the actual
behavior (alert has For: 10m plus static labels `severity` and
`operator_health_impact` and only a `summary` annotation).
In `@docs/runbooks/VirtHandlerDown.md`:
- Around line 11-19: The runbook incorrectly says VirtHandlerDown is
"self-diagnosing" with pod/node/reason labels; update the document to state that
the alert (cluster:kubevirt_virt_handler_pods_running:count == 0) uses only
static labels (severity, operatorHealthImpact) and therefore cannot include
pod/node/reason because its recording rule (the kube_pod_status_phase-based
recording) drops those dimensions by not using by(...). Remove/replace the
self-diagnosing wording and add a short note explaining how to make those labels
available (either change the alert to template labels from a vector with
by(pod,node,reason) or modify the recording rule to preserve dimensions) so
readers understand why the labels are absent and what a correct fix would
entail.
In `@docs/runbooks/VirtOperatorDown.md`:
- Around line 5-6: Update the VirtOperatorDown runbook to match the actual alert
rule: change the ## Meaning text to explicitly reference the alert expression
cluster:kubevirt_virt_operator_pods_running:count == 0 for 10m and remove any
branching logic that depends on non-existent alert labels (reason, pod) in the
## Diagnosis and the v1.6+ self-diagnosing section; instead document only the
labels the rule sets (severity and operatorHealthImpact) and diagnosis steps
based on pod count/cluster state metrics. Remove or soften the claim "The
virt-operator deployment has a default replica of 2 pods" unless you cite the
deployment source—either add a reference for the replica default or rephrase to
a non-assertive note about typical deployments. Ensure the ## Mitigation section
no longer says "Based on the reason label" and instead provides actions tied to
detecting zero running pods and checking operator deployment/replica status.
---
Nitpick comments:
In `@docs/runbooks/VirtControllerDown.md`:
- Around line 87-92: Check the VirtControllerDown alert and its PromQL/recording
rules (search for the alert name "VirtControllerDown" and associated rule files)
to confirm whether the query references API server load metrics (e.g.,
apiserver_request_duration_seconds, apiserver_server_current_inflight_requests)
or networking metrics (e.g., kube_network_??, node_network_* or CNI/plugin
metrics); if those metrics are not present, remove or reword the "API server
overloaded" and "Networking issues" items from the Mitigation section to
indicate they are general operational considerations rather than actionable
items derived from the alert logic, or alternatively update the alert's rule to
include concrete, verifiable selectors/thresholds for those failure modes and
document the exact PromQL expressions used so each mitigation maps directly to
observable signals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a36f8bfa-6095-49a9-91aa-5ed28493fad0
📒 Files selected for processing (4)
docs/runbooks/VirtAPIDown.mddocs/runbooks/VirtControllerDown.mddocs/runbooks/VirtHandlerDown.mddocs/runbooks/VirtOperatorDown.md
| Starting with KubeVirt v1.6, this alert is self-diagnosing: when a | ||
| container waiting reason is available (for example, `CrashLoopBackOff` | ||
| or `ErrImagePull`), the alert labels include `pod` and `reason` and | ||
| the description states the root cause. When no waiting reason exists | ||
| (pods are entirely absent), a fallback branch fires without those | ||
| labels. | ||
|
|
||
| In older KubeVirt versions the alert fires without diagnostic labels. | ||
| The diagnosis steps below cover both cases. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify VirtAPIDown alert definition for self-diagnosing behavior
# Find the alert definition file
fd -t f 'virt-api.go' pkg/monitoring/rules/alerts/
# Extract the alert rule
ast-grep --pattern $'Alert($$$) {
$$$
For($$$)
$$$
}'
# Also search for the alert in YAML/rule files
rg -n -A 20 'alert:.*VirtAPIDown' --type=yaml
# Search for label references (pod, reason) in the alert definition
rg -n -C 5 'reason.*label|pod.*label' pkg/monitoring/rules/alerts/virt-api.goRepository: kubevirt/monitoring
Length of output: 259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate virt-api alert rule source =="
fd -t f 'virt-api.go' . || true
echo "== Search for VirtAPIDown in repo =="
rg -n --hidden --no-ignore -S "VirtAPIDown" || true
# Pick the most likely file and dump the relevant portion
FILE="$(rg -n --hidden --no-ignore -S --files-with-matches "VirtAPIDown" | head -n 1 || true)"
if [[ -n "${FILE}" ]]; then
echo "== First match file: $FILE =="
rg -n -C 5 "VirtAPIDown" "$FILE" || true
fi
echo "== Check for pod/reason label templating in virt-api alert rule file =="
# If we have a virt-api.go file, inspect it; otherwise fall back to any matches.
VIRTAPI_FILE="$(fd -t f 'virt-api.go' . | head -n 1 || true)"
if [[ -n "${VIRTAPI_FILE}" ]]; then
echo "== Inspect: $VIRTAPI_FILE =="
rg -n -C 3 "pod|reason" "$VIRTAPI_FILE" || true
rg -n -C 3 "Labels|Annotations|label" "$VIRTAPI_FILE" || true
else
rg -n -C 3 "pod|reason" -S --files-with-matches "virt-api" . || true
fiRepository: kubevirt/kubevirt
Length of output: 1971
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== VirtAPIDown rule definition (virt-api.go) =="
FILE="pkg/monitoring/rules/alerts/virt-api.go"
if [[ -f "$FILE" ]]; then
echo "Using $FILE"
rg -n "VirtAPIDown" "$FILE"
echo
# Show the surrounding code for the alert block
rg -n -C 15 "Alert: \"VirtAPIDown\"" "$FILE" || true
echo
# Check whether pod/reason labels are referenced in the alert rule
rg -n -C 2 "\b(pod|reason)\b" "$FILE" || true
else
echo "File not found: $FILE" >&2
fiRepository: kubevirt/kubevirt
Length of output: 1164
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="docs/runbooks/VirtAPIDown.md"
echo "== File stats =="
wc -l "$FILE"
echo
echo "== Head (runbook text + Meaning) =="
sed -n '1,120p' "$FILE" | nl -ba
echo
echo "== Where runbook mentions pod/reason =="
rg -n "pod|reason|CrashLoopBackOff|ErrImagePull" "$FILE" || true
echo
echo "== Diagnosis/Mitigation sections (relevant range) =="
# Print the Diagnosis and Mitigation sections (small scoped read)
# If line numbers differ, this still gives content for review.
awk 'BEGIN{p=0}
/^## Diagnosis/{p=1}
/^## Mitigation/{p=2}
p==1{print}
p==2{print}
' "$FILE" | nl -baRepository: kubevirt/monitoring
Length of output: 211
🏁 Script executed:
#!/bin/bash
set -euo pipefail
RULE_FILE="pkg/monitoring/rules/alerts/virt-api.go"
echo "== virt-api rule file stats =="
wc -l "$RULE_FILE" || true
echo
echo "== Print VirtAPIDown block fully =="
# Show the specific alert block surrounding VirtAPIDown
rg -n -C 2 'Alert:\s*"VirtAPIDown"' "$RULE_FILE" || true
# Show a bit more context to ensure no dynamic labels exist elsewhere in the block/function
rg -n -C 20 'Alert:\s*"VirtAPIDown"' "$RULE_FILE" || true
echo
echo "== Search for pod/reason/diagnostic strings in kubevirt/kubevirt =="
rg -n --hidden --no-ignore -S "self-diagnos|waiting reason|CrashLoopBackOff|ErrImagePull" . || true
echo
echo "== Search for pod/reason label templating in virt-api alert rules =="
rg -n --hidden --no-ignore -S '\b(pod|reason)\b' "$RULE_FILE" || true
echo
echo "== Search other occurrences of VirtAPIDown in kubevirt/kubevirt =="
rg -n --hidden --no-ignore -S "VirtAPIDown" . || trueRepository: kubevirt/kubevirt
Length of output: 11294
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="docs/runbooks/VirtAPIDown.md"
echo "== Search for key terms in runbook =="
rg -n "self-diagnosing|pod`|pod |reason|CrashLoopBackOff|ErrImagePull|fallback" "$FILE" || true
echo
echo "== Show around top section (Meaning) =="
sed -n '1,60p' "$FILE"
echo
echo "== Show Diagnosis section =="
sed -n '/^## Diagnosis/,/^## Mitigation/p' "$FILE"
echo
echo "== Show Mitigation section =="
sed -n '/^## Mitigation/,$p' "$FILE"Repository: kubevirt/monitoring
Length of output: 188
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="docs/runbooks/VirtAPIDown.md"
echo "== grep for self-diagnosing / pod / reason keywords =="
rg -n "self-diagnosing|pod|reason|CrashLoopBackOff|ErrImagePull|fallback" "$FILE" || true
echo
echo "== Top section (H1 + Meaning) =="
sed -n '1,60p' "$FILE"
echo
echo "== Diagnosis section =="
sed -n '/^## Diagnosis/,$p' "$FILE" | sed -n '1,60p'Repository: kubevirt/monitoring
Length of output: 4832
Fix VirtAPIDown runbook: pod/reason “self-diagnosing” labels aren’t emitted by the alert
kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.godefinesVirtAPIDownascluster:kubevirt_virt_api_pods_running:count == 0withfor: 10m, and it only sets static labels (severity=critical,operatorHealthImpact=critical) plus a summary annotation—there are no dynamicpod/reasonlabels or fallback branch produced by the alert.- Update
docs/runbooks/VirtAPIDown.mdto remove/adjust thev1.6+“self-diagnosing” claim and any logic that branches on areasonlabel (e.g.,## Meaninglines 7-15,## Diagnosisstep 1 around lines 25-29, and## Mitigationaround lines 72-84) so guidance is driven by the actual pod/deployment/log checks in the runbook.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/runbooks/VirtAPIDown.md` around lines 7 - 15, The runbook incorrectly
asserts that the VirtAPIDown alert emits dynamic pod/reason labels in KubeVirt
v1.6+; update docs/runbooks/VirtAPIDown.md to remove or reword the
“self-diagnosing” claim and any logic that branches on a `reason` label so
guidance reflects the actual alert in
kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.go (VirtAPIDown defined
as `cluster:kubevirt_virt_api_pods_running:count == 0` with static labels only).
Specifically, edit the ## Meaning section (currently lines ~7–15) to drop the
v1.6+ label claim, change the ## Diagnosis step 1 (around lines ~25–29) to rely
on pod/deployment/log checks rather than a `reason` label, and update the ##
Mitigation steps (around lines ~72–84) to remove any fallback logic that expects
a `reason` label. Ensure the runbook's troubleshooting steps are driven by
inspecting pods, deployments, and API logs instead of alert labels.
e788272 to
4bb069a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
docs/runbooks/VirtHandlerDown.md (1)
11-19:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Alert cannot emit
pod/node/reasonlabels claimed herePast verification confirmed that
VirtHandlerDownis defined ascluster:kubevirt_virt_handler_pods_running:count == 0with only static labels. The recording rule usescount(kube_pod_status_phase{pod=~'virt-handler-.*', ...} == 1)withoutby(...), which drops pod/node dimensions. Thereforenode,pod, andreasonlabels cannot be present on the firing alert. Lines 30-36 and 75-88 also branch on labels that won't exist.Remove the self-diagnosing claim and restructure to not depend on these non-existent labels.
As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".
#!/bin/bash # Verify VirtHandlerDown alert definition and recording rule in kubevirt/kubevirt echo "=== Searching for VirtHandlerDown alert definition ===" rg -n 'Alert:\s*"VirtHandlerDown"' --glob '*.go' -A 30 echo "" echo "=== Searching for recording rule ===" rg -n 'cluster:kubevirt_virt_handler_pods_running:count' --glob '*.go' -B 5 -A 10🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/VirtHandlerDown.md` around lines 11 - 19, The doc incorrectly claims the VirtHandlerDown alert emits pod/node/reason labels; update the runbook to remove the "self-diagnosing" claim and any diagnostic steps that rely on those labels, and instead present diagnosis paths that work when the recording rule cluster:kubevirt_virt_handler_pods_running:count (which is derived from kube_pod_status_phase without a by(...) clause) has stripped pod/node dimensions — replace label-based branches (the sections referencing pod/node/reason) with concrete, label-agnostic troubleshooting commands (e.g., kubectl get pods -l app=virt-handler --all-namespaces, kubectl describe pod, check node status, examine virt-handler logs) and include a verification note showing how to confirm the alert and recording rule in the kubevirt repo (search for Alert "VirtHandlerDown" and cluster:kubevirt_virt_handler_pods_running:count) so the runbook matches the actual alert definition.docs/runbooks/VirtAPIDown.md (1)
7-15:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Alert does not emit
pod/reasonlabels claimed herePast verification confirmed that
VirtAPIDownis defined ascluster:kubevirt_virt_api_pods_running:count == 0with only static labels (severity,operatorHealthImpact) and a summary annotation. The alert does not inject dynamicpodorreasonlabels. Lines 25-29 and 72-84 also branch on areasonlabel that won't exist.Remove the v1.6+ self-diagnosing claim and restructure diagnosis/mitigation to not depend on these non-existent labels.
As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".
#!/bin/bash # Verify VirtAPIDown alert definition in kubevirt/kubevirt echo "=== Searching for VirtAPIDown alert definition ===" rg -n 'Alert:\s*"VirtAPIDown"' --glob '*.go' -A 30🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/VirtAPIDown.md` around lines 7 - 15, The runbook incorrectly claims VirtAPIDown emits dynamic pod/reason labels in v1.6+ and branches on a non-existent reason label; update the document to remove the "v1.6+ self-diagnosing" claim, eliminate or rewrite the conditional steps that depend on the pod and reason labels (references: VirtAPIDown, labels `pod` and `reason`, branches at lines 25-29 and 72-84), and restructure diagnosis/mitigation to cover both cases: (A) alert solely from the static rule cluster:kubevirt_virt_api_pods_running:count == 0 (use generic cluster-wide / API reachability checks), and (B) when external tooling provides pod/reason info; also add an explicit instruction to verify the actual alert definition in the kubevirt repo (expression/for/severity/annotations) before making label-dependent claims.docs/runbooks/VirtControllerDown.md (1)
7-15:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Alert does not emit
pod/reasonlabels claimed herePast verification confirmed that
VirtControllerDownhasFor: 10mand only static labels (severity,operator_health_impact) plus asummaryannotation. The prom-rule CI fixture also expects nopod/reasonlabels. Lines 25-31 and 75-92 incorrectly branch on areasonlabel that the alert doesn't provide.Remove the v1.6+ self-diagnosing paragraph and adjust diagnosis/mitigation to match actual alert behavior.
As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".
#!/bin/bash # Verify VirtControllerDown alert definition in kubevirt/kubevirt echo "=== Searching for VirtControllerDown alert definition ===" rg -n 'Alert:\s*"VirtControllerDown"' --glob '*.go' -A 30🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/VirtControllerDown.md` around lines 7 - 15, Remove the incorrect v1.6+ self-diagnosing paragraph and any diagnostic branches that assume the alert emits pod/reason labels; update the VirtControllerDown doc so it reflects the actual alert definition (it has For: 10m and only static labels severity and operator_health_impact plus a summary annotation) by deleting the text that branches on a reason label (references to the `reason`/`pod`-based logic around the previous lines 25-31 and 75-92) and rewrite diagnosis/mitigation steps to operate without those labels, and add a note to verify these claims against the VirtControllerDown alert in the kubevirt repo (use the alert name "VirtControllerDown" as the source of truth).docs/runbooks/VirtOperatorDown.md (1)
20-28:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Alert does not emit
pod/reasonlabels claimed herePast verification confirmed that
VirtOperatorDownonly setsseverityandoperatorHealthImpactlabels, driven purely by the "running pods count" expression. There is no source forreason/podalert labels. Lines 43-49 and 93-106 incorrectly branch on areasonlabel that doesn't exist.Remove the v1.6+ self-diagnosing paragraph and ensure diagnosis/mitigation steps are based on actual alert behavior (pod count metrics, not dynamic labels).
As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".
#!/bin/bash # Verify VirtOperatorDown alert definition in kubevirt/kubevirt echo "=== Searching for VirtOperatorDown alert definition ===" rg -n 'Alert:\s*"VirtOperatorDown"' --glob '*.go' -A 30🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/VirtOperatorDown.md` around lines 20 - 28, Remove the inaccurate "v1.6+ self-diagnosing" paragraph and any diagnostic branches that assume non-existent `pod`/`reason` labels in the VirtOperatorDown runbook; instead, update the diagnosis and mitigation steps to rely solely on the alert's actual signals (the running-pods-count metric and the alert labels that exist such as `severity` and `operatorHealthImpact`) and delete or rewrite the sections that branch on `reason`/`pod` (references to those labels and the described behavior). Ensure the doc explicitly states that `VirtOperatorDown` is driven by pod count, remove guidance that tells users to read `reason`/`pod` from the alert, and add a note to verify all claims against the `VirtOperatorDown` alert definition in the kubevirt repo (use the alert's expr/for/severity/annotations as source of truth).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/runbooks/VirtOperatorDown.md`:
- Line 18: The claim "virt-operator deployment has a default replica of 2 pods"
in VirtOperatorDown.md is unsupported by the alert/metric; either remove or
soften that sentence and instead cite the authoritative source (e.g., add a link
to the virt-operator deployment/manifest or the specific kubevirt repo file that
shows replicas: 2) or explicitly state "may be configured to X" and reference
the alert definition; search for the deployment YAML (e.g., files containing
"name: virt-operator" and "replicas:") to find the exact manifest, then update
VirtOperatorDown.md to include the exact file/URL and line or drop the numeric
claim so the runbook only references verified alert/metric facts.
---
Duplicate comments:
In `@docs/runbooks/VirtAPIDown.md`:
- Around line 7-15: The runbook incorrectly claims VirtAPIDown emits dynamic
pod/reason labels in v1.6+ and branches on a non-existent reason label; update
the document to remove the "v1.6+ self-diagnosing" claim, eliminate or rewrite
the conditional steps that depend on the pod and reason labels (references:
VirtAPIDown, labels `pod` and `reason`, branches at lines 25-29 and 72-84), and
restructure diagnosis/mitigation to cover both cases: (A) alert solely from the
static rule cluster:kubevirt_virt_api_pods_running:count == 0 (use generic
cluster-wide / API reachability checks), and (B) when external tooling provides
pod/reason info; also add an explicit instruction to verify the actual alert
definition in the kubevirt repo (expression/for/severity/annotations) before
making label-dependent claims.
In `@docs/runbooks/VirtControllerDown.md`:
- Around line 7-15: Remove the incorrect v1.6+ self-diagnosing paragraph and any
diagnostic branches that assume the alert emits pod/reason labels; update the
VirtControllerDown doc so it reflects the actual alert definition (it has For:
10m and only static labels severity and operator_health_impact plus a summary
annotation) by deleting the text that branches on a reason label (references to
the `reason`/`pod`-based logic around the previous lines 25-31 and 75-92) and
rewrite diagnosis/mitigation steps to operate without those labels, and add a
note to verify these claims against the VirtControllerDown alert in the kubevirt
repo (use the alert name "VirtControllerDown" as the source of truth).
In `@docs/runbooks/VirtHandlerDown.md`:
- Around line 11-19: The doc incorrectly claims the VirtHandlerDown alert emits
pod/node/reason labels; update the runbook to remove the "self-diagnosing" claim
and any diagnostic steps that rely on those labels, and instead present
diagnosis paths that work when the recording rule
cluster:kubevirt_virt_handler_pods_running:count (which is derived from
kube_pod_status_phase without a by(...) clause) has stripped pod/node dimensions
— replace label-based branches (the sections referencing pod/node/reason) with
concrete, label-agnostic troubleshooting commands (e.g., kubectl get pods -l
app=virt-handler --all-namespaces, kubectl describe pod, check node status,
examine virt-handler logs) and include a verification note showing how to
confirm the alert and recording rule in the kubevirt repo (search for Alert
"VirtHandlerDown" and cluster:kubevirt_virt_handler_pods_running:count) so the
runbook matches the actual alert definition.
In `@docs/runbooks/VirtOperatorDown.md`:
- Around line 20-28: Remove the inaccurate "v1.6+ self-diagnosing" paragraph and
any diagnostic branches that assume non-existent `pod`/`reason` labels in the
VirtOperatorDown runbook; instead, update the diagnosis and mitigation steps to
rely solely on the alert's actual signals (the running-pods-count metric and the
alert labels that exist such as `severity` and `operatorHealthImpact`) and
delete or rewrite the sections that branch on `reason`/`pod` (references to
those labels and the described behavior). Ensure the doc explicitly states that
`VirtOperatorDown` is driven by pod count, remove guidance that tells users to
read `reason`/`pod` from the alert, and add a note to verify all claims against
the `VirtOperatorDown` alert definition in the kubevirt repo (use the alert's
expr/for/severity/annotations as source of truth).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8e3ff539-94f0-40c0-80f1-ffa2154e6fcc
📒 Files selected for processing (4)
docs/runbooks/VirtAPIDown.mddocs/runbooks/VirtControllerDown.mddocs/runbooks/VirtHandlerDown.mddocs/runbooks/VirtOperatorDown.md
4bb069a to
3c4c465
Compare
Update VirtAPIDown, VirtControllerDown, VirtHandlerDown, and VirtOperatorDown runbooks to reflect the self-diagnosing alert changes introduced in kubevirt/kubevirt#17985. In newer versions, these alerts include pod and reason labels when a container waiting reason is available (e.g. CrashLoopBackOff, ErrImagePull), allowing immediate diagnosis from the alert itself. Each runbook now: - Explains the new self-diagnosing behavior in the Meaning section - Notes that older versions fire without diagnostic labels - Adds "Check the alert labels" as the first diagnosis step - Includes log-checking steps using label selectors - Provides structured mitigation per root cause (CrashLoopBackOff, ErrImagePull, pods absent, node issues) Signed-off-by: Shirly Radco <sradco@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
3c4c465 to
ecd4575
Compare
|
/lgtm |
Summary
Update VirtAPIDown, VirtControllerDown, VirtHandlerDown, and
VirtOperatorDown runbooks to reflect the self-diagnosing alert
changes introduced in kubevirt/kubevirt#17985.
In the newer versions, these alerts include
podandreasonlabels when a container waiting reason is available (e.g.
CrashLoopBackOff,ErrImagePull), allowing immediate diagnosisfrom the alert itself.
Changes per runbook
Each runbook now:
that older KubeVirt versions fire without diagnostic labels
so users with vthe updated versions can skip manual commands when the alert
already tells them the root cause; adds log-checking steps
using label selectors
CrashLoopBackOff,ErrImagePull/ImagePullBackOff, pods absent, node issues)Backward compatibility
The runbooks are linked from older KubeVirt versions that do not
include the self-diagnosing labels. Each runbook explicitly notes
that diagnostic labels are available "In newer versions"
and that "older KubeVirt versions fire without diagnostic labels."
The manual diagnosis steps remain for users on older versions.
Related PRs
Signed-off-by: Shirly Radco sradco@redhat.com
Co-authored-by: Cursor cursoragent@cursor.com
Made with Cursor