Skip to content

docs: update Virt*Down runbooks for self-diagnosing alerts#383

Open
sradco wants to merge 1 commit into
kubevirt:mainfrom
sradco:update-virt-down-runbooks
Open

docs: update Virt*Down runbooks for self-diagnosing alerts#383
sradco wants to merge 1 commit into
kubevirt:mainfrom
sradco:update-virt-down-runbooks

Conversation

@sradco
Copy link
Copy Markdown
Collaborator

@sradco sradco commented Jun 1, 2026

Summary

Update VirtAPIDown, VirtControllerDown, VirtHandlerDown, and
VirtOperatorDown runbooks to reflect the self-diagnosing alert
changes introduced in kubevirt/kubevirt#17985.

In the newer versions, these alerts include pod and reason
labels when a container waiting reason is available (e.g.
CrashLoopBackOff, ErrImagePull), allowing immediate diagnosis
from the alert itself.

Changes per runbook

Each runbook now:

  • Meaning: explains the new self-diagnosing behavior and notes
    that older KubeVirt versions fire without diagnostic labels
  • Diagnosis: adds "Check the alert labels" as the first step,
    so users with vthe updated versions can skip manual commands when the alert
    already tells them the root cause; adds log-checking steps
    using label selectors
  • Mitigation: structured per root cause (CrashLoopBackOff,
    ErrImagePull/ImagePullBackOff, pods absent, node issues)

Backward compatibility

The runbooks are linked from older KubeVirt versions that do not
include the self-diagnosing labels. Each runbook explicitly notes
that diagnostic labels are available "In newer versions"
and that "older KubeVirt versions fire without diagnostic labels."
The manual diagnosis steps remain for users on older versions.

Related PRs

Signed-off-by: Shirly Radco sradco@redhat.com
Co-authored-by: Cursor cursoragent@cursor.com

Made with Cursor

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jun 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Warning

Review limit reached

@sradco, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 32 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d4b87403-6a0b-4b91-81f7-45f911bf7ee6

📥 Commits

Reviewing files that changed from the base of the PR and between 4bb069a and ecd4575.

📒 Files selected for processing (4)
  • docs/runbooks/VirtAPIDown.md
  • docs/runbooks/VirtControllerDown.md
  • docs/runbooks/VirtHandlerDown.md
  • docs/runbooks/VirtOperatorDown.md
📝 Walkthrough

Walkthrough

This PR updates four KubeVirt component runbooks (VirtAPIDown, VirtControllerDown, VirtHandlerDown, VirtOperatorDown) to incorporate v1.6+ self-diagnosing alert behavior. Each runbook now documents how diagnostic labels (reason, pod) and alert descriptions indicate root causes, with fallback procedures for older versions or missing labels. The diagnosis sections have been restructured into label-aware workflows featuring concrete kubectl commands, and mitigation sections replaced with condition-specific remediation steps mapped to common failure modes (crash loops, image pull failures, absent pods, node readiness issues).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Acting as a senior engineer: the changes are documentation-only and focused; check label/command accuracy and file line references. Nice, clear updates.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
docs/runbooks/VirtControllerDown.md (1)

87-92: ⚖️ Poor tradeoff

Verify mitigation items map to alert root causes.

The mitigation section includes "API server overloaded" and "Networking issues" as potential causes. These should be verifiable from the alert's PromQL logic or documented failure modes.

As per coding guidelines, "Mitigation must be concrete/actionable and map to root causes revealed by the alert logic."

Please verify whether the VirtControllerDown alert definition or its associated monitoring logic accounts for API server load or networking issues as observable failure modes, or if these are general operational considerations beyond the alert's scope.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtControllerDown.md` around lines 87 - 92, Check the
VirtControllerDown alert and its PromQL/recording rules (search for the alert
name "VirtControllerDown" and associated rule files) to confirm whether the
query references API server load metrics (e.g.,
apiserver_request_duration_seconds, apiserver_server_current_inflight_requests)
or networking metrics (e.g., kube_network_??, node_network_* or CNI/plugin
metrics); if those metrics are not present, remove or reword the "API server
overloaded" and "Networking issues" items from the Mitigation section to
indicate they are general operational considerations rather than actionable
items derived from the alert logic, or alternatively update the alert's rule to
include concrete, verifiable selectors/thresholds for those failure modes and
document the exact PromQL expressions used so each mitigation maps directly to
observable signals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/runbooks/VirtAPIDown.md`:
- Around line 7-15: The runbook incorrectly asserts that the VirtAPIDown alert
emits dynamic pod/reason labels in KubeVirt v1.6+; update
docs/runbooks/VirtAPIDown.md to remove or reword the “self-diagnosing” claim and
any logic that branches on a `reason` label so guidance reflects the actual
alert in kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.go (VirtAPIDown
defined as `cluster:kubevirt_virt_api_pods_running:count == 0` with static
labels only). Specifically, edit the ## Meaning section (currently lines ~7–15)
to drop the v1.6+ label claim, change the ## Diagnosis step 1 (around lines
~25–29) to rely on pod/deployment/log checks rather than a `reason` label, and
update the ## Mitigation steps (around lines ~72–84) to remove any fallback
logic that expects a `reason` label. Ensure the runbook's troubleshooting steps
are driven by inspecting pods, deployments, and API logs instead of alert
labels.

In `@docs/runbooks/VirtControllerDown.md`:
- Around line 7-15: The runbook wrongly claims VirtControllerDown self-diagnoses
with pod/reason labels since KubeVirt's VirtControllerDown alert (defined in
kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-controller.go and its
prom-rule CI fixture) does not inject `pod`/`reason` labels; update
docs/runbooks/VirtControllerDown.md to remove the v1.6+ self-diagnosis paragraph
(lines 7–15), and adjust Diagnosis step 1 and the Mitigation section to stop
branching on a non-existent `reason` label and instead describe the actual
behavior (alert has For: 10m plus static labels `severity` and
`operator_health_impact` and only a `summary` annotation).

In `@docs/runbooks/VirtHandlerDown.md`:
- Around line 11-19: The runbook incorrectly says VirtHandlerDown is
"self-diagnosing" with pod/node/reason labels; update the document to state that
the alert (cluster:kubevirt_virt_handler_pods_running:count == 0) uses only
static labels (severity, operatorHealthImpact) and therefore cannot include
pod/node/reason because its recording rule (the kube_pod_status_phase-based
recording) drops those dimensions by not using by(...). Remove/replace the
self-diagnosing wording and add a short note explaining how to make those labels
available (either change the alert to template labels from a vector with
by(pod,node,reason) or modify the recording rule to preserve dimensions) so
readers understand why the labels are absent and what a correct fix would
entail.

In `@docs/runbooks/VirtOperatorDown.md`:
- Around line 5-6: Update the VirtOperatorDown runbook to match the actual alert
rule: change the ## Meaning text to explicitly reference the alert expression
cluster:kubevirt_virt_operator_pods_running:count == 0 for 10m and remove any
branching logic that depends on non-existent alert labels (reason, pod) in the
## Diagnosis and the v1.6+ self-diagnosing section; instead document only the
labels the rule sets (severity and operatorHealthImpact) and diagnosis steps
based on pod count/cluster state metrics. Remove or soften the claim "The
virt-operator deployment has a default replica of 2 pods" unless you cite the
deployment source—either add a reference for the replica default or rephrase to
a non-assertive note about typical deployments. Ensure the ## Mitigation section
no longer says "Based on the reason label" and instead provides actions tied to
detecting zero running pods and checking operator deployment/replica status.

---

Nitpick comments:
In `@docs/runbooks/VirtControllerDown.md`:
- Around line 87-92: Check the VirtControllerDown alert and its PromQL/recording
rules (search for the alert name "VirtControllerDown" and associated rule files)
to confirm whether the query references API server load metrics (e.g.,
apiserver_request_duration_seconds, apiserver_server_current_inflight_requests)
or networking metrics (e.g., kube_network_??, node_network_* or CNI/plugin
metrics); if those metrics are not present, remove or reword the "API server
overloaded" and "Networking issues" items from the Mitigation section to
indicate they are general operational considerations rather than actionable
items derived from the alert logic, or alternatively update the alert's rule to
include concrete, verifiable selectors/thresholds for those failure modes and
document the exact PromQL expressions used so each mitigation maps directly to
observable signals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a36f8bfa-6095-49a9-91aa-5ed28493fad0

📥 Commits

Reviewing files that changed from the base of the PR and between 29e37c1 and e788272.

📒 Files selected for processing (4)
  • docs/runbooks/VirtAPIDown.md
  • docs/runbooks/VirtControllerDown.md
  • docs/runbooks/VirtHandlerDown.md
  • docs/runbooks/VirtOperatorDown.md

Comment thread docs/runbooks/VirtAPIDown.md Outdated
Comment on lines +7 to +15
Starting with KubeVirt v1.6, this alert is self-diagnosing: when a
container waiting reason is available (for example, `CrashLoopBackOff`
or `ErrImagePull`), the alert labels include `pod` and `reason` and
the description states the root cause. When no waiting reason exists
(pods are entirely absent), a fallback branch fires without those
labels.

In older KubeVirt versions the alert fires without diagnostic labels.
The diagnosis steps below cover both cases.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify VirtAPIDown alert definition for self-diagnosing behavior

# Find the alert definition file
fd -t f 'virt-api.go' pkg/monitoring/rules/alerts/

# Extract the alert rule
ast-grep --pattern $'Alert($$$) {
  $$$
  For($$$)
  $$$
}'

# Also search for the alert in YAML/rule files
rg -n -A 20 'alert:.*VirtAPIDown' --type=yaml

# Search for label references (pod, reason) in the alert definition
rg -n -C 5 'reason.*label|pod.*label' pkg/monitoring/rules/alerts/virt-api.go

Repository: kubevirt/monitoring

Length of output: 259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Locate virt-api alert rule source =="
fd -t f 'virt-api.go' . || true

echo "== Search for VirtAPIDown in repo =="
rg -n --hidden --no-ignore -S "VirtAPIDown" || true

# Pick the most likely file and dump the relevant portion
FILE="$(rg -n --hidden --no-ignore -S --files-with-matches "VirtAPIDown" | head -n 1 || true)"
if [[ -n "${FILE}" ]]; then
  echo "== First match file: $FILE =="
  rg -n -C 5 "VirtAPIDown" "$FILE" || true
fi

echo "== Check for pod/reason label templating in virt-api alert rule file =="
# If we have a virt-api.go file, inspect it; otherwise fall back to any matches.
VIRTAPI_FILE="$(fd -t f 'virt-api.go' . | head -n 1 || true)"
if [[ -n "${VIRTAPI_FILE}" ]]; then
  echo "== Inspect: $VIRTAPI_FILE =="
  rg -n -C 3 "pod|reason" "$VIRTAPI_FILE" || true
  rg -n -C 3 "Labels|Annotations|label" "$VIRTAPI_FILE" || true
else
  rg -n -C 3 "pod|reason" -S --files-with-matches "virt-api" . || true
fi

Repository: kubevirt/kubevirt

Length of output: 1971


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== VirtAPIDown rule definition (virt-api.go) =="
FILE="pkg/monitoring/rules/alerts/virt-api.go"
if [[ -f "$FILE" ]]; then
  echo "Using $FILE"
  rg -n "VirtAPIDown" "$FILE"
  echo
  # Show the surrounding code for the alert block
  rg -n -C 15 "Alert: \"VirtAPIDown\"" "$FILE" || true
  echo
  # Check whether pod/reason labels are referenced in the alert rule
  rg -n -C 2 "\b(pod|reason)\b" "$FILE" || true
else
  echo "File not found: $FILE" >&2
fi

Repository: kubevirt/kubevirt

Length of output: 1164


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="docs/runbooks/VirtAPIDown.md"
echo "== File stats =="
wc -l "$FILE"
echo

echo "== Head (runbook text + Meaning) =="
sed -n '1,120p' "$FILE" | nl -ba

echo
echo "== Where runbook mentions pod/reason =="
rg -n "pod|reason|CrashLoopBackOff|ErrImagePull" "$FILE" || true

echo
echo "== Diagnosis/Mitigation sections (relevant range) =="
# Print the Diagnosis and Mitigation sections (small scoped read)
# If line numbers differ, this still gives content for review.
awk 'BEGIN{p=0}
  /^## Diagnosis/{p=1}
  /^## Mitigation/{p=2}
  p==1{print}
  p==2{print}
' "$FILE" | nl -ba

Repository: kubevirt/monitoring

Length of output: 211


🏁 Script executed:

#!/bin/bash
set -euo pipefail

RULE_FILE="pkg/monitoring/rules/alerts/virt-api.go"
echo "== virt-api rule file stats =="
wc -l "$RULE_FILE" || true
echo

echo "== Print VirtAPIDown block fully =="
# Show the specific alert block surrounding VirtAPIDown
rg -n -C 2 'Alert:\s*"VirtAPIDown"' "$RULE_FILE" || true
# Show a bit more context to ensure no dynamic labels exist elsewhere in the block/function
rg -n -C 20 'Alert:\s*"VirtAPIDown"' "$RULE_FILE" || true

echo
echo "== Search for pod/reason/diagnostic strings in kubevirt/kubevirt =="
rg -n --hidden --no-ignore -S "self-diagnos|waiting reason|CrashLoopBackOff|ErrImagePull" . || true

echo
echo "== Search for pod/reason label templating in virt-api alert rules =="
rg -n --hidden --no-ignore -S '\b(pod|reason)\b' "$RULE_FILE" || true

echo
echo "== Search other occurrences of VirtAPIDown in kubevirt/kubevirt =="
rg -n --hidden --no-ignore -S "VirtAPIDown" . || true

Repository: kubevirt/kubevirt

Length of output: 11294


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="docs/runbooks/VirtAPIDown.md"
echo "== Search for key terms in runbook =="
rg -n "self-diagnosing|pod`|pod |reason|CrashLoopBackOff|ErrImagePull|fallback" "$FILE" || true
echo

echo "== Show around top section (Meaning) =="
sed -n '1,60p' "$FILE"
echo

echo "== Show Diagnosis section =="
sed -n '/^## Diagnosis/,/^## Mitigation/p' "$FILE"
echo

echo "== Show Mitigation section =="
sed -n '/^## Mitigation/,$p' "$FILE"

Repository: kubevirt/monitoring

Length of output: 188


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="docs/runbooks/VirtAPIDown.md"

echo "== grep for self-diagnosing / pod / reason keywords =="
rg -n "self-diagnosing|pod|reason|CrashLoopBackOff|ErrImagePull|fallback" "$FILE" || true

echo
echo "== Top section (H1 + Meaning) =="
sed -n '1,60p' "$FILE"

echo
echo "== Diagnosis section =="
sed -n '/^## Diagnosis/,$p' "$FILE" | sed -n '1,60p'

Repository: kubevirt/monitoring

Length of output: 4832


Fix VirtAPIDown runbook: pod/reason “self-diagnosing” labels aren’t emitted by the alert

  • kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.go defines VirtAPIDown as cluster:kubevirt_virt_api_pods_running:count == 0 with for: 10m, and it only sets static labels (severity=critical, operatorHealthImpact=critical) plus a summary annotation—there are no dynamic pod/reason labels or fallback branch produced by the alert.
  • Update docs/runbooks/VirtAPIDown.md to remove/adjust the v1.6+ “self-diagnosing” claim and any logic that branches on a reason label (e.g., ## Meaning lines 7-15, ## Diagnosis step 1 around lines 25-29, and ## Mitigation around lines 72-84) so guidance is driven by the actual pod/deployment/log checks in the runbook.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtAPIDown.md` around lines 7 - 15, The runbook incorrectly
asserts that the VirtAPIDown alert emits dynamic pod/reason labels in KubeVirt
v1.6+; update docs/runbooks/VirtAPIDown.md to remove or reword the
“self-diagnosing” claim and any logic that branches on a `reason` label so
guidance reflects the actual alert in
kubevirt/kubevirt/pkg/monitoring/rules/alerts/virt-api.go (VirtAPIDown defined
as `cluster:kubevirt_virt_api_pods_running:count == 0` with static labels only).
Specifically, edit the ## Meaning section (currently lines ~7–15) to drop the
v1.6+ label claim, change the ## Diagnosis step 1 (around lines ~25–29) to rely
on pod/deployment/log checks rather than a `reason` label, and update the ##
Mitigation steps (around lines ~72–84) to remove any fallback logic that expects
a `reason` label. Ensure the runbook's troubleshooting steps are driven by
inspecting pods, deployments, and API logs instead of alert labels.

Comment thread docs/runbooks/VirtControllerDown.md Outdated
Comment thread docs/runbooks/VirtHandlerDown.md Outdated
Comment thread docs/runbooks/VirtOperatorDown.md Outdated
@sradco sradco force-pushed the update-virt-down-runbooks branch from e788272 to 4bb069a Compare June 2, 2026 06:37
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
docs/runbooks/VirtHandlerDown.md (1)

11-19: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Alert cannot emit pod/node/reason labels claimed here

Past verification confirmed that VirtHandlerDown is defined as cluster:kubevirt_virt_handler_pods_running:count == 0 with only static labels. The recording rule uses count(kube_pod_status_phase{pod=~'virt-handler-.*', ...} == 1) without by(...), which drops pod/node dimensions. Therefore node, pod, and reason labels cannot be present on the firing alert. Lines 30-36 and 75-88 also branch on labels that won't exist.

Remove the self-diagnosing claim and restructure to not depend on these non-existent labels.

As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".

#!/bin/bash
# Verify VirtHandlerDown alert definition and recording rule in kubevirt/kubevirt
echo "=== Searching for VirtHandlerDown alert definition ==="
rg -n 'Alert:\s*"VirtHandlerDown"' --glob '*.go' -A 30
echo ""
echo "=== Searching for recording rule ==="
rg -n 'cluster:kubevirt_virt_handler_pods_running:count' --glob '*.go' -B 5 -A 10
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtHandlerDown.md` around lines 11 - 19, The doc incorrectly
claims the VirtHandlerDown alert emits pod/node/reason labels; update the
runbook to remove the "self-diagnosing" claim and any diagnostic steps that rely
on those labels, and instead present diagnosis paths that work when the
recording rule cluster:kubevirt_virt_handler_pods_running:count (which is
derived from kube_pod_status_phase without a by(...) clause) has stripped
pod/node dimensions — replace label-based branches (the sections referencing
pod/node/reason) with concrete, label-agnostic troubleshooting commands (e.g.,
kubectl get pods -l app=virt-handler --all-namespaces, kubectl describe pod,
check node status, examine virt-handler logs) and include a verification note
showing how to confirm the alert and recording rule in the kubevirt repo (search
for Alert "VirtHandlerDown" and
cluster:kubevirt_virt_handler_pods_running:count) so the runbook matches the
actual alert definition.
docs/runbooks/VirtAPIDown.md (1)

7-15: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Alert does not emit pod/reason labels claimed here

Past verification confirmed that VirtAPIDown is defined as cluster:kubevirt_virt_api_pods_running:count == 0 with only static labels (severity, operatorHealthImpact) and a summary annotation. The alert does not inject dynamic pod or reason labels. Lines 25-29 and 72-84 also branch on a reason label that won't exist.

Remove the v1.6+ self-diagnosing claim and restructure diagnosis/mitigation to not depend on these non-existent labels.

As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".

#!/bin/bash
# Verify VirtAPIDown alert definition in kubevirt/kubevirt
echo "=== Searching for VirtAPIDown alert definition ==="
rg -n 'Alert:\s*"VirtAPIDown"' --glob '*.go' -A 30
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtAPIDown.md` around lines 7 - 15, The runbook incorrectly
claims VirtAPIDown emits dynamic pod/reason labels in v1.6+ and branches on a
non-existent reason label; update the document to remove the "v1.6+
self-diagnosing" claim, eliminate or rewrite the conditional steps that depend
on the pod and reason labels (references: VirtAPIDown, labels `pod` and
`reason`, branches at lines 25-29 and 72-84), and restructure
diagnosis/mitigation to cover both cases: (A) alert solely from the static rule
cluster:kubevirt_virt_api_pods_running:count == 0 (use generic cluster-wide /
API reachability checks), and (B) when external tooling provides pod/reason
info; also add an explicit instruction to verify the actual alert definition in
the kubevirt repo (expression/for/severity/annotations) before making
label-dependent claims.
docs/runbooks/VirtControllerDown.md (1)

7-15: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Alert does not emit pod/reason labels claimed here

Past verification confirmed that VirtControllerDown has For: 10m and only static labels (severity, operator_health_impact) plus a summary annotation. The prom-rule CI fixture also expects no pod/reason labels. Lines 25-31 and 75-92 incorrectly branch on a reason label that the alert doesn't provide.

Remove the v1.6+ self-diagnosing paragraph and adjust diagnosis/mitigation to match actual alert behavior.

As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".

#!/bin/bash
# Verify VirtControllerDown alert definition in kubevirt/kubevirt
echo "=== Searching for VirtControllerDown alert definition ==="
rg -n 'Alert:\s*"VirtControllerDown"' --glob '*.go' -A 30
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtControllerDown.md` around lines 7 - 15, Remove the
incorrect v1.6+ self-diagnosing paragraph and any diagnostic branches that
assume the alert emits pod/reason labels; update the VirtControllerDown doc so
it reflects the actual alert definition (it has For: 10m and only static labels
severity and operator_health_impact plus a summary annotation) by deleting the
text that branches on a reason label (references to the `reason`/`pod`-based
logic around the previous lines 25-31 and 75-92) and rewrite
diagnosis/mitigation steps to operate without those labels, and add a note to
verify these claims against the VirtControllerDown alert in the kubevirt repo
(use the alert name "VirtControllerDown" as the source of truth).
docs/runbooks/VirtOperatorDown.md (1)

20-28: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Alert does not emit pod/reason labels claimed here

Past verification confirmed that VirtOperatorDown only sets severity and operatorHealthImpact labels, driven purely by the "running pods count" expression. There is no source for reason/pod alert labels. Lines 43-49 and 93-106 incorrectly branch on a reason label that doesn't exist.

Remove the v1.6+ self-diagnosing paragraph and ensure diagnosis/mitigation steps are based on actual alert behavior (pod count metrics, not dynamic labels).

As per coding guidelines: "every claim must be verified against the actual alert definition in kubevirt org repos (alert expr/for/severity/annotations are the source of truth)".

#!/bin/bash
# Verify VirtOperatorDown alert definition in kubevirt/kubevirt
echo "=== Searching for VirtOperatorDown alert definition ==="
rg -n 'Alert:\s*"VirtOperatorDown"' --glob '*.go' -A 30
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runbooks/VirtOperatorDown.md` around lines 20 - 28, Remove the
inaccurate "v1.6+ self-diagnosing" paragraph and any diagnostic branches that
assume non-existent `pod`/`reason` labels in the VirtOperatorDown runbook;
instead, update the diagnosis and mitigation steps to rely solely on the alert's
actual signals (the running-pods-count metric and the alert labels that exist
such as `severity` and `operatorHealthImpact`) and delete or rewrite the
sections that branch on `reason`/`pod` (references to those labels and the
described behavior). Ensure the doc explicitly states that `VirtOperatorDown` is
driven by pod count, remove guidance that tells users to read `reason`/`pod`
from the alert, and add a note to verify all claims against the
`VirtOperatorDown` alert definition in the kubevirt repo (use the alert's
expr/for/severity/annotations as source of truth).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/runbooks/VirtOperatorDown.md`:
- Line 18: The claim "virt-operator deployment has a default replica of 2 pods"
in VirtOperatorDown.md is unsupported by the alert/metric; either remove or
soften that sentence and instead cite the authoritative source (e.g., add a link
to the virt-operator deployment/manifest or the specific kubevirt repo file that
shows replicas: 2) or explicitly state "may be configured to X" and reference
the alert definition; search for the deployment YAML (e.g., files containing
"name: virt-operator" and "replicas:") to find the exact manifest, then update
VirtOperatorDown.md to include the exact file/URL and line or drop the numeric
claim so the runbook only references verified alert/metric facts.

---

Duplicate comments:
In `@docs/runbooks/VirtAPIDown.md`:
- Around line 7-15: The runbook incorrectly claims VirtAPIDown emits dynamic
pod/reason labels in v1.6+ and branches on a non-existent reason label; update
the document to remove the "v1.6+ self-diagnosing" claim, eliminate or rewrite
the conditional steps that depend on the pod and reason labels (references:
VirtAPIDown, labels `pod` and `reason`, branches at lines 25-29 and 72-84), and
restructure diagnosis/mitigation to cover both cases: (A) alert solely from the
static rule cluster:kubevirt_virt_api_pods_running:count == 0 (use generic
cluster-wide / API reachability checks), and (B) when external tooling provides
pod/reason info; also add an explicit instruction to verify the actual alert
definition in the kubevirt repo (expression/for/severity/annotations) before
making label-dependent claims.

In `@docs/runbooks/VirtControllerDown.md`:
- Around line 7-15: Remove the incorrect v1.6+ self-diagnosing paragraph and any
diagnostic branches that assume the alert emits pod/reason labels; update the
VirtControllerDown doc so it reflects the actual alert definition (it has For:
10m and only static labels severity and operator_health_impact plus a summary
annotation) by deleting the text that branches on a reason label (references to
the `reason`/`pod`-based logic around the previous lines 25-31 and 75-92) and
rewrite diagnosis/mitigation steps to operate without those labels, and add a
note to verify these claims against the VirtControllerDown alert in the kubevirt
repo (use the alert name "VirtControllerDown" as the source of truth).

In `@docs/runbooks/VirtHandlerDown.md`:
- Around line 11-19: The doc incorrectly claims the VirtHandlerDown alert emits
pod/node/reason labels; update the runbook to remove the "self-diagnosing" claim
and any diagnostic steps that rely on those labels, and instead present
diagnosis paths that work when the recording rule
cluster:kubevirt_virt_handler_pods_running:count (which is derived from
kube_pod_status_phase without a by(...) clause) has stripped pod/node dimensions
— replace label-based branches (the sections referencing pod/node/reason) with
concrete, label-agnostic troubleshooting commands (e.g., kubectl get pods -l
app=virt-handler --all-namespaces, kubectl describe pod, check node status,
examine virt-handler logs) and include a verification note showing how to
confirm the alert and recording rule in the kubevirt repo (search for Alert
"VirtHandlerDown" and cluster:kubevirt_virt_handler_pods_running:count) so the
runbook matches the actual alert definition.

In `@docs/runbooks/VirtOperatorDown.md`:
- Around line 20-28: Remove the inaccurate "v1.6+ self-diagnosing" paragraph and
any diagnostic branches that assume non-existent `pod`/`reason` labels in the
VirtOperatorDown runbook; instead, update the diagnosis and mitigation steps to
rely solely on the alert's actual signals (the running-pods-count metric and the
alert labels that exist such as `severity` and `operatorHealthImpact`) and
delete or rewrite the sections that branch on `reason`/`pod` (references to
those labels and the described behavior). Ensure the doc explicitly states that
`VirtOperatorDown` is driven by pod count, remove guidance that tells users to
read `reason`/`pod` from the alert, and add a note to verify all claims against
the `VirtOperatorDown` alert definition in the kubevirt repo (use the alert's
expr/for/severity/annotations as source of truth).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e3ff539-94f0-40c0-80f1-ffa2154e6fcc

📥 Commits

Reviewing files that changed from the base of the PR and between e788272 and 4bb069a.

📒 Files selected for processing (4)
  • docs/runbooks/VirtAPIDown.md
  • docs/runbooks/VirtControllerDown.md
  • docs/runbooks/VirtHandlerDown.md
  • docs/runbooks/VirtOperatorDown.md

Comment thread docs/runbooks/VirtOperatorDown.md Outdated
@sradco sradco force-pushed the update-virt-down-runbooks branch from 4bb069a to 3c4c465 Compare June 2, 2026 06:46
Update VirtAPIDown, VirtControllerDown, VirtHandlerDown, and
VirtOperatorDown runbooks to reflect the self-diagnosing alert
changes introduced in kubevirt/kubevirt#17985.

In newer versions, these alerts include pod and reason labels
when a container waiting reason is available (e.g.
CrashLoopBackOff, ErrImagePull), allowing immediate diagnosis
from the alert itself.

Each runbook now:
- Explains the new self-diagnosing behavior in the Meaning section
- Notes that older versions fire without diagnostic labels
- Adds "Check the alert labels" as the first diagnosis step
- Includes log-checking steps using label selectors
- Provides structured mitigation per root cause (CrashLoopBackOff,
  ErrImagePull, pods absent, node issues)

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@sradco sradco force-pushed the update-virt-down-runbooks branch from 3c4c465 to ecd4575 Compare June 2, 2026 06:52
@sradco
Copy link
Copy Markdown
Collaborator Author

sradco commented Jun 2, 2026

@avlitman @Ronilerr Please review this PR

@Ronilerr
Copy link
Copy Markdown
Collaborator

Ronilerr commented Jun 2, 2026

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants