feat(alerts): make nova alerts region- and value-aware#902
Conversation
Inline alert rules into each bundle's templates/alerts.yaml so they can be gated on Helm values. Nova: severity of CortexNovaSchedulingDown depends on kvm.enabled, CortexNovaDoesntFindValidKVMHosts only renders when KVM is enabled, memory and reconcile-duration thresholds are configurable via .Values.alerts.thresholds. Other bundles: structural relocation only with Style-B escaping of Prometheus directives. Ironcore: empty rules removed.
|
Warning Review limit reached
More reviews will be available in 50 minutes and 19 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR moves bundle-level Prometheus alert definitions into inline PrometheusRule YAML within Helm templates for multiple bundles, adds Helm escaping guidance for embedded Prometheus templating, introduces Nova alert thresholds in values.yaml, updates a docs reference, and changes the alert-lint workflow to render and validate rendered rules. ChangesAlert Rules Consolidation Across Bundles
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
To get a use full diff run something like: git --no-pager diff --no-index -w <(git show HEAD~1:helm/bundles/cortex-nova/alerts/nova.alerts.yaml) helm/bundles/cortex-nova/templates/alerts.yaml
diff --git a/proc/self/fd/16 b/helm/bundles/cortex-nova/templates/alerts.yaml
index 00000000..6f3fabef 100644
--- a/proc/self/fd/16
+++ b/helm/bundles/cortex-nova/templates/alerts.yaml
@@ -1,3 +1,19 @@
+# Copyright SAP SE
+# SPDX-License-Identifier: Apache-2.0
+
+# NOTE: This file is rendered by Helm. Prometheus templating directives
+# (e.g. {{ "{{" }} $labels.foo {{ "}}" }}) must be escaped using Style B:
+# replace the outer `{{` and `}}` with `{{ "{{" }}` and `{{ "}}" }}`.
+
+{{- if .Values.alerts.enabled }}
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: cortex-nova-alerts
+ labels:
+ type: alerting-rules
+ prometheus: {{ required ".Values.alerts.prometheus missing" .Values.alerts.prometheus | quote }}
+spec:
groups:
- name: cortex-nova-alerts
rules:
@@ -10,7 +26,7 @@ groups:
context: liveness
dashboard: cortex-status-dashboard/cortex-status-dashboard
service: cortex
- severity: critical
+ severity: {{ if .Values.kvm.enabled }}critical{{ else }}warning{{ end }}
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/down
annotations:
@@ -93,7 +109,7 @@ groups:
Thus, no immediate action is needed.
- alert: CortexNovaHighMemoryUsage
- expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > 6000 * 1024 * 1024
+ expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > {{ .Values.alerts.thresholds.highMemoryMiB }} * 1024 * 1024
for: 5m
labels:
context: memory
@@ -103,9 +119,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/deployment
annotations:
- summary: "`{{$labels.component}}` uses too much memory"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` uses too much memory"
description: >
- `{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
+ `{{ "{{" }} $labels.component {{ "}}" }}` should not be using more than {{ .Values.alerts.thresholds.highMemoryMiB }} MiB of memory. Usually it
should use much less, so there may be a memory leak or other changes
that are causing the memory usage to increase significantly.
@@ -120,9 +136,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/deployment
annotations:
- summary: "`{{$labels.component}}` uses too much CPU"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` uses too much CPU"
description: >
- `{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
+ `{{ "{{" }} $labels.component {{ "}}" }}` should not be using more than 50% of a single CPU core. Usually
it should use much less, so there may be a CPU leak or other changes
that are causing the CPU usage to increase significantly.
@@ -137,9 +153,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/database
annotations:
- summary: "`{{$labels.component}}` is trying to connect to the database too often"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` is trying to connect to the database too often"
description: >
- `{{$labels.component}}` is trying to connect to the database too often. This may happen
+ `{{ "{{" }} $labels.component {{ "}}" }}` is trying to connect to the database too often. This may happen
when the database is down or the connection parameters are misconfigured.
- alert: CortexNovaSyncNotSuccessful
@@ -153,9 +169,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "`{{$labels.component}}` Sync not successful"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` Sync not successful"
description: >
- `{{$labels.component}}` experienced an issue syncing data from the datasource `{{$labels.datasource}}`. This may
+ `{{ "{{" }} $labels.component {{ "}}" }}` experienced an issue syncing data from the datasource `{{ "{{" }} $labels.datasource {{ "}}" }}`. This may
happen when the datasource (OpenStack, Prometheus, etc.) is down or
the sync module is misconfigured. No immediate action is needed, since
the sync module will retry the sync operation and the currently synced
@@ -173,9 +189,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "`{{$labels.component}}` is not syncing any new data from `{{$labels.datasource}}`"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` is not syncing any new data from `{{ "{{" }} $labels.datasource {{ "}}" }}`"
description: >
- `{{$labels.component}}` is not syncing any objects from the datasource `{{$labels.datasource}}`. This may happen
+ `{{ "{{" }} $labels.component {{ "}}" }}` is not syncing any objects from the datasource `{{ "{{" }} $labels.datasource {{ "}}" }}`. This may happen
when the datasource (OpenStack, Prometheus, etc.) is down or the sync
module is misconfigured. No immediate action is needed, since the sync
module will retry the sync operation and the currently synced data will
@@ -193,7 +209,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
+ summary: "Datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the datasource
connectivity or configuration. It is recommended to investigate the
@@ -210,7 +226,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Knowledge `{{$labels.knowledge}}` is in `{{$labels.state}}` state"
+ summary: "Knowledge `{{ "{{" }} $labels.knowledge {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the knowledge
configuration. It is recommended to investigate the
@@ -226,7 +242,7 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Some decisions are in error state for operator `{{$labels.operator}}`"
+ summary: "Some decisions are in error state for operator `{{ "{{" }} $labels.operator {{ "}}" }}`"
description: >
The cortex scheduling pipeline generated decisions that are in error state.
This may indicate issues with the decision logic or the underlying infrastructure.
@@ -243,7 +259,7 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Too many decisions are in waiting state for operator `{{$labels.operator}}`"
+ summary: "Too many decisions are in waiting state for operator `{{ "{{" }} $labels.operator {{ "}}" }}`"
description: >
The cortex scheduling pipeline has a high number of decisions for which
no target host has been assigned yet.
@@ -264,7 +280,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "KPI `{{$labels.kpi}}` is in `{{$labels.state}}` state"
+ summary: "KPI `{{ "{{" }} $labels.kpi {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the KPI
configuration. It is recommended to investigate the
@@ -281,12 +297,13 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Pipeline `{{$labels.pipeline}}` is in `{{$labels.state}}` state"
+ summary: "Pipeline `{{ "{{" }} $labels.pipeline {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the pipeline
configuration. It is recommended to investigate the
pipeline status and logs for more details.
+ {{- if .Values.kvm.enabled }}
- alert: CortexNovaDoesntFindValidKVMHosts
expr: sum by (az, hvtype) (increase(cortex_vm_faults{hvtype=~"CH|QEMU",faultmsg=~".*No valid host was found.*",faultmsg!~".*No such host.*"}[5m])) > 0
for: 5m
@@ -300,10 +317,11 @@ groups:
annotations:
summary: "Nova scheduling cannot find valid KVM hosts"
description: >
- Cortex is seeing new faulty vms in `{{$labels.az}}` where Nova scheduling
- failed to find a valid `{{$labels.hvtype}}` host. This may indicate
+ Cortex is seeing new faulty vms in `{{ "{{" }} $labels.az {{ "}}" }}` where Nova scheduling
+ failed to find a valid `{{ "{{" }} $labels.hvtype {{ "}}" }}` host. This may indicate
capacity issues, misconfigured filters, or resource constraints in the
datacenter. Investigate the affected VMs and hypervisor availability.
+ {{- end }}
- alert: CortexNovaNewDatasourcesNotReconciling
expr: count by(datasource) (cortex_datasource_seconds_until_reconcile{queued="false",domain="nova"}) > 0
@@ -316,9 +334,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "New datasource `{{$labels.datasource}}` has not reconciled"
+ summary: "New datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has not reconciled"
description: >
- A new datasource `{{$labels.datasource}}` has been added but has not
+ A new datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has been added but has not
completed its first reconciliation yet. This may indicate issues with
the datasource controller's workqueue overprioritizing other datasources.
@@ -335,9 +353,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "Existing datasource `{{$labels.datasource}}` is lacking behind"
+ summary: "Existing datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` is lacking behind"
description: >
- An existing datasource `{{$labels.datasource}}` has been queued for
+ An existing datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has been queued for
reconciliation for more than 10 minutes. This may indicate issues with
the datasource controller's workqueue or that this or another datasource
is taking an unusually long time to reconcile.
@@ -365,7 +383,7 @@ groups:
- alert: CortexNovaReconcileDurationHigher10Min
expr: |
(sum by (controller) (rate(controller_runtime_reconcile_time_seconds_sum{service="cortex-nova-metrics"}[5m])))
- / (sum by (controller) (rate(controller_runtime_reconcile_time_seconds_count{service="cortex-nova-metrics"}[5m]))) > 600
+ / (sum by (controller) (rate(controller_runtime_reconcile_time_seconds_count{service="cortex-nova-metrics"}[5m]))) > {{ .Values.alerts.thresholds.reconcileDurationSeconds }}
for: 15m
labels:
context: controller-duration
@@ -375,8 +393,8 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/reconciles
annotations:
- summary: "Controller reconciliation takes longer than ({{ $value | humanizeDuration }})"
- description: "Reconcile duration higher than 10m while reconciling {{ $labels.controller }}"
+ summary: "Controller reconciliation takes longer than ({{ "{{" }} $value | humanizeDuration {{ "}}" }})"
+ description: "Reconcile duration higher than 10m while reconciling {{ "{{" }} $labels.controller {{ "}}" }}"
- alert: CortexNovaWorkqueueNotDrained
expr: |
@@ -390,9 +408,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "Controller {{ $labels.name }}'s backlog is not being drained."
+ summary: "Controller {{ "{{" }} $labels.name {{ "}}" }}'s backlog is not being drained."
description: >
- The workqueue for controller {{ $labels.name }} has a backlog that is
+ The workqueue for controller {{ "{{" }} $labels.name {{ "}}" }} has a backlog that is
not being drained. This may indicate that the controller is overwhelmed
with work or is stuck on certain resources. Check the controller logs
and the state of the resources it manages for more details.
@@ -408,9 +426,9 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Controller webhook {{ $labels.webhook }} latency is high"
+ summary: "Controller webhook {{ "{{" }} $labels.webhook {{ "}}" }} latency is high"
description: >
- The latency for webhook {{ $labels.webhook }} is higher than expected (p90 > 200ms).
+ The latency for webhook {{ "{{" }} $labels.webhook {{ "}}" }} is higher than expected (p90 > 200ms).
This may indicate performance issues with the webhook server or the logic it executes.
Check the webhook server logs and monitor its resource usage for more insights.
@@ -426,9 +444,9 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Controller webhook {{ $labels.webhook }} is experiencing errors"
+ summary: "Controller webhook {{ "{{" }} $labels.webhook {{ "}}" }} is experiencing errors"
description: >
- The webhook {{ $labels.webhook }} has experienced errors in the last 5 minutes.
+ The webhook {{ "{{" }} $labels.webhook {{ "}}" }} has experienced errors in the last 5 minutes.
This may indicate issues with the webhook logic, connectivity problems, or
external factors causing failures. Check the webhook server logs for error
details and investigate the affected resources.
@@ -489,7 +507,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/committed-resource-performance
annotations:
- summary: "Committed Resource rejection rate too high ({{ $value | humanizePercentage }})"
+ summary: "Committed Resource rejection rate too high ({{ "{{" }} $value | humanizePercentage {{ "}}" }})"
description: >
More than 30% of commitment changes have been rejected over the last 15 minutes.
This may indicate insufficient capacity to fulfill new commitments. Rejected
@@ -563,10 +581,10 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/committed-resource-capacity
annotations:
- summary: "Committed Resource capacity for {{ $labels.resource }} in {{ $labels.az }} dropped to zero"
+ summary: "Committed Resource capacity for {{ "{{" }} $labels.resource {{ "}}" }} in {{ "{{" }} $labels.az {{ "}}" }} dropped to zero"
description: >
- The reported capacity for committed resource {{ $labels.resource }} in
- availability zone {{ $labels.az }} has dropped from a positive value to zero.
+ The reported capacity for committed resource {{ "{{" }} $labels.resource {{ "}}" }} in
+ availability zone {{ "{{" }} $labels.az {{ "}}" }} has dropped from a positive value to zero.
This may mean hypervisors in that AZ are fully utilized for the corresponding
flavor group and no further committed resources can be placed there.
@@ -607,3 +625,4 @@ groups:
The committed resource quota API (Limes LIQUID integration) is returning
HTTP 5xx errors. This indicates internal problems computing or applying
quota. Limes may not be able to enforce committed resource quotas.
+{{- end }} |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm/bundles/cortex-manila/templates/alerts.yaml`:
- Around line 236-251: The alert definition for CortexManilaPipelineUnready has
the wrong context label; update the labels block in the
CortexManilaPipelineUnready alert (alert name: CortexManilaPipelineUnready,
expr: cortex_pipeline_state{domain="manila",state!="ready"}) to change context:
kpis to context: pipelines so the alert is correctly categorized under pipeline
alerts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bd871581-32a8-44fc-af7e-2ee6f7b1a03b
📒 Files selected for processing (12)
docs/reservations/committed-resource-reservations.mdhelm/bundles/cortex-cinder/alerts/cinder.alerts.yamlhelm/bundles/cortex-cinder/templates/alerts.yamlhelm/bundles/cortex-ironcore/alerts/ironcore.alerts.yamlhelm/bundles/cortex-ironcore/templates/alerts.yamlhelm/bundles/cortex-manila/alerts/manila.alerts.yamlhelm/bundles/cortex-manila/templates/alerts.yamlhelm/bundles/cortex-nova/alerts/nova.alerts.yamlhelm/bundles/cortex-nova/templates/alerts.yamlhelm/bundles/cortex-nova/values.yamlhelm/bundles/cortex-placement-shim/alerts/placement-shim.alerts.yamlhelm/bundles/cortex-placement-shim/templates/alerts.yaml
💤 Files with no reviewable changes (6)
- helm/bundles/cortex-placement-shim/alerts/placement-shim.alerts.yaml
- helm/bundles/cortex-ironcore/alerts/ironcore.alerts.yaml
- helm/bundles/cortex-manila/alerts/manila.alerts.yaml
- helm/bundles/cortex-cinder/alerts/cinder.alerts.yaml
- helm/bundles/cortex-nova/alerts/nova.alerts.yaml
- helm/bundles/cortex-ironcore/templates/alerts.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/check-alerts.yaml (1)
18-18: 💤 Low valueOptional: disable credential persistence on checkout.
This is a read-only lint job that never pushes, so setting
persist-credentials: falseavoids leaving theGITHUB_TOKENin the local git config (theartipackedfinding from zizmor).🛡️ Proposed change
- - uses: actions/checkout@v6 + - uses: actions/checkout@v6 + with: + persist-credentials: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/check-alerts.yaml at line 18, Update the checkout step that uses actions/checkout@v6 to disable credential persistence by adding persist-credentials: false to its step configuration; locate the GitHub Actions step referencing "uses: actions/checkout@v6" and set the persist-credentials option to false so the GITHUB_TOKEN is not stored in the local git config for this read-only lint job.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/check-alerts.yaml:
- Line 41: The workflow uses a mutable tag for the Promtool action (uses:
peimanja/promtool-github-actions@v0.0.2); replace that with the immutable commit
SHA suggested (uses:
peimanja/promtool-github-actions@741be6fd6b8ee6a1d777ea020076b70c6233b3a1 #
v0.0.2) so the action is pinned to a specific commit and cannot be changed by
retagging—update the uses reference accordingly in the workflow step that
currently references peimanja/promtool-github-actions@v0.0.2.
---
Nitpick comments:
In @.github/workflows/check-alerts.yaml:
- Line 18: Update the checkout step that uses actions/checkout@v6 to disable
credential persistence by adding persist-credentials: false to its step
configuration; locate the GitHub Actions step referencing "uses:
actions/checkout@v6" and set the persist-credentials option to false so the
GITHUB_TOKEN is not stored in the local git config for this read-only lint job.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 88b0e91f-270e-4f35-9a90-dad3cd99918c
📒 Files selected for processing (1)
.github/workflows/check-alerts.yaml
Replace mutable tag reference with immutable commit SHA so the action cannot be changed by retagging. Verified via GitHub refs API that the SHA matches the v0.0.2 tag.
Test Coverage ReportTest Coverage 📊: 69.6% |
Inline alert rules into each bundle's templates/alerts.yaml so they can be gated on Helm values. Nova: severity of CortexNovaSchedulingDown depends on kvm.enabled, CortexNovaDoesntFindValidKVMHosts only renders when KVM is enabled, memory and reconcile-duration thresholds are configurable via .Values.alerts.thresholds. Other bundles: structural relocation only with Style-B escaping of Prometheus directives. Ironcore: empty rules removed.