Skip to content

Add IMDSv2 migration command for ROSA Classic clusters#909

Open
givaldolins wants to merge 4 commits into
openshift:masterfrom
givaldolins:imdsv2
Open

Add IMDSv2 migration command for ROSA Classic clusters#909
givaldolins wants to merge 4 commits into
openshift:masterfrom
givaldolins:imdsv2

Conversation

@givaldolins
Copy link
Copy Markdown
Contributor

@givaldolins givaldolins commented Jun 3, 2026

Implements automated IMDSv2 migration workflow with:

  • Pre-flight health checks (ClusterOperators, CPMS, nodes, etcd)
  • Hive MachinePool patching for infra/worker nodes
  • Sequential infra machine replacement with health validation
  • ControlPlaneMachineSet update for master node rollout
  • Post-migration validation

Supports selective migration via --nodes flag (all, master, infra, workers).

This MR automates SOP

Summary by CodeRabbit

  • New Features
    • Added osdctl cluster imdsv2 command to migrate ROSA Classic cluster nodes to require IMDSv2 authentication.
    • Includes comprehensive pre-flight validation, machine pool patching, sequential infrastructure replacement, and rollout monitoring.
    • Enforces AWS-only and non-HCP cluster requirements with confirmation prompts for destructive operations.

Implements automated IMDSv2 migration workflow with:
- Pre-flight health checks (ClusterOperators, CPMS, nodes, etcd)
- Hive MachinePool patching for infra/worker nodes
- Sequential infra machine replacement with health validation
- ControlPlaneMachineSet update for master node rollout
- Post-migration validation

Supports selective migration via --nodes flag (all, master, infra, workers).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Warning

Review limit reached

@givaldolins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 30 minutes and 54 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0a8a332d-6531-4a59-a150-60729aed1453

📥 Commits

Reviewing files that changed from the base of the PR and between 38748b6 and 07e230f.

📒 Files selected for processing (4)
  • cmd/cluster/imdsv2.go
  • docs/README.md
  • docs/osdctl_cluster.md
  • docs/osdctl_cluster_imdsv2.md

Walkthrough

This PR introduces osdctl cluster imdsv2, a new command that orchestrates AWS ROSA Classic cluster node migration to require IMDSv2. The implementation coordinates pre-flight validation, Hive MachinePool patching, infra machine replacement, ControlPlaneMachineSet updates, and final verification across multiple Kubernetes and OCM resources.

Changes

IMDSv2 Migration Command

Layer / File(s) Summary
Command definition and input validation
cmd/cluster/cmd.go, cmd/cluster/imdsv2.go (lines 1–104)
Registers newCmdIMDSv2() subcommand and defines the command with flags (--cluster-id, --reason, --nodes), the imdsv2Options struct, and validation ensuring supported node roles and valid cluster key format.
Client initialization and orchestration
cmd/cluster/imdsv2.go (lines 105–271)
Establishes OCM and Kubernetes clients (including elevated admin client), enforces AWS-classic-only and non-HCP constraints, conditionally initializes Hive clients, and orchestrates the migration sequence across MachinePool patching, infra replacement, CPMS update, and final validation.
Pre-flight safety validation
cmd/cluster/imdsv2.go (lines 273–396)
Verifies cluster operator health via oc get clusteroperators -o json, confirms ControlPlaneMachineSet is Active with 3 ready replicas, validates 3 master nodes are Ready, checks infra node readiness when applicable, and ensures exactly 3 etcd pods are running before mutation.
Hive MachinePool patching and cleanup
cmd/cluster/imdsv2.go (lines 398–480, 631–664, 822–830)
Lists and patches MachinePools (infra/workers) to require IMDSv2 in AWS metadata, adds override annotation for platform spec changes, includes helper to resolve Hive namespace from OCM environment and cluster ID, and provides cleanup to remove the annotation post-migration.
Infra machine sequential replacement and polling
cmd/cluster/imdsv2.go (lines 482–629)
Iteratively deletes infra Machines and polls for Running replacements (excluding original names), waits for infra nodes to reach expected Ready count, re-checks cluster operator health between replacements, and implements timeout/interval-based retry logic.
ControlPlaneMachineSet migration
cmd/cluster/imdsv2.go (lines 666–755)
Unmarshals CPMS AWS provider spec, skips if already configured, prompts for user confirmation, sets metadata authentication to Required, patches via admin client, and monitors rollout until updated and ready replicas match expected counts.
Final validation and verification
cmd/cluster/imdsv2.go (lines 756–820)
Lists all Nodes and fails if any are not Ready, examines each Machine's provider spec to confirm IMDSv2 enforcement, and reports warnings when machines are not configured (e.g., customer-managed workers).

Sequence Diagram

sequenceDiagram
  participant User as User/CLI
  participant Cmd as osdctl imdsv2
  participant OCM as OCM API
  participant K8s as Kubernetes API
  participant Hive as Hive<br/>(MachinePools)
  
  User->>Cmd: cluster imdsv2 --cluster-id=... --nodes=...
  Cmd->>OCM: Fetch cluster environment and ID
  Cmd->>K8s: Pre-flight checks (operators, CPMS, nodes, etcd)
  K8s-->>Cmd: Health status
  
  rect rgba(100, 200, 150, 0.5)
  Cmd->>Hive: List MachinePools in Hive namespace
  Hive-->>Cmd: MachinePool specs
  Cmd->>Hive: Patch selected pools for IMDSv2
  end
  
  rect rgba(150, 150, 200, 0.5)
  Cmd->>K8s: Delete infra Machines (sequential)
  K8s-->>Cmd: Machines deleted
  Cmd->>K8s: Poll for replacement Machines Running
  K8s-->>Cmd: Replacements ready
  Cmd->>K8s: Wait for infra nodes Ready
  Cmd->>K8s: Verify cluster operator health
  end
  
  rect rgba(200, 150, 100, 0.5)
  Cmd->>K8s: Get ControlPlaneMachineSet
  Cmd->>Cmd: User confirmation (destructive)
  Cmd->>K8s: Patch CPMS for IMDSv2
  Cmd->>K8s: Monitor CPMS rollout
  K8s-->>Cmd: Rollout complete
  end
  
  Cmd->>K8s: Final validation (nodes Ready, machines IMDSv2)
  K8s-->>Cmd: Migration verified
  Cmd->>User: Success/warnings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Line 352 logs raw output from "oc get clusteroperators" command in error messages, potentially exposing internal API details, pod configurations, or other sensitive cluster data. Remove raw command output from error message at line 352; log only sanitized error details (e.g., "oc command failed: {error}" without the JSON output containing cluster internals).
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning Code assumes exactly 3 control-plane nodes (line 305: != 3) and 3 etcd pods (line 338: != 3) without checking topology via ControlPlaneTopology. No topology-aware validation present. Add topology check via infrastructure.Status.ControlPlaneTopology to validate 3-node HA assumption before proceeding, or document that command only supports HA ROSA Classic clusters.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a new IMDSv2 migration command for ROSA Classic clusters, which is the primary focus of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not introduce any Ginkgo tests; custom check for stable test names is not applicable as only implementation code was added.
Test Structure And Quality ✅ Passed No Ginkgo test code changes included in this PR. The PR adds 830 lines of implementation code in imdsv2.go and one line in cmd.go, but no test file (imdsv2_test.go) was created or modified.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are introduced in this PR. The changes are CLI command implementations (Cobra commands) in cmd/cluster/, not test code. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. The changes are CLI command implementations in cmd/cluster/ using Cobra framework, not test code. The SNO compatibility check only applies to new e2e tests.
Ote Binary Stdout Contract ✅ Passed osdctl is a standalone SRE CLI tool, not an OTE test binary that communicates with openshift-tests. The OTE contract check is inapplicable to this repository.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds CLI command implementations, not Ginkgo e2e tests. Files are in cmd/cluster/ package with no _test.go suffix and contain no Ginkgo test patterns (It, Describe, Context, etc.).
No-Weak-Crypto ✅ Passed No weak cryptography found. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB usage, custom crypto, or non-constant-time secret comparisons detected in code.
Container-Privileges ✅ Passed PR adds Go CLI code with no container/K8s manifests, Dockerfiles, or privileged container configurations to assess.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from devppratik and petrkotas June 3, 2026 20:35
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: givaldolins
Once this PR has been reviewed and has the lgtm label, please assign tafhim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/cluster/imdsv2.go (1)

348-395: ⚖️ Poor tradeoff

Consider using the Kubernetes client instead of shelling out to oc.

Shelling out to oc introduces a dependency on the CLI being installed and configured. The kubeconfig context used by oc may differ from the client initialized via backplane in init(). Consider using the o.client to fetch ClusterOperator resources directly via the config.openshift.io/v1 API group.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/imdsv2.go` around lines 348 - 395, The checkClusterOperators
function currently shells out to `oc` to list ClusterOperators; replace that
with direct API calls using the existing o.client so the same
kubeconfig/backplane client is used. Use the OpenShift config v1 ClusterOperator
type (e.g., configv1.ClusterOperatorList) or a dynamic client list against
"config.openshift.io/v1" to fetch all ClusterOperator resources with
o.client.List(ctx, &list) (or equivalent), then iterate list.Items and inspect
each item.Status.Conditions exactly as done now to build unhealthyOps, returning
errors on client.List or unmarshalling failures as appropriate; keep the same
final behavior/message and error formatting but remove the exec.CommandContext
call and JSON unmarshal logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 728-754: The function monitorCPMSRollout creates a timeout context
from context.Background(), breaking cancellation propagation; change the context
creation to derive from the passed ctx (use context.WithTimeout(ctx,
imdsv2RolloutPollTimeout) and keep cancel deferred) so parent cancellations
(SIGINT or upstream cancellation) interrupt the polling loop; update the context
creation in monitorCPMSRollout where pollCtx and cancel are defined (and
continue to pass pollCtx into wait.PollUntilContextTimeout as before).

---

Nitpick comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 348-395: The checkClusterOperators function currently shells out
to `oc` to list ClusterOperators; replace that with direct API calls using the
existing o.client so the same kubeconfig/backplane client is used. Use the
OpenShift config v1 ClusterOperator type (e.g., configv1.ClusterOperatorList) or
a dynamic client list against "config.openshift.io/v1" to fetch all
ClusterOperator resources with o.client.List(ctx, &list) (or equivalent), then
iterate list.Items and inspect each item.Status.Conditions exactly as done now
to build unhealthyOps, returning errors on client.List or unmarshalling failures
as appropriate; keep the same final behavior/message and error formatting but
remove the exec.CommandContext call and JSON unmarshal logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 993b8d49-1d85-43e7-9f2d-d04b1a1555f0

📥 Commits

Reviewing files that changed from the base of the PR and between 7117673 and 38748b6.

📒 Files selected for processing (2)
  • cmd/cluster/cmd.go
  • cmd/cluster/imdsv2.go

Comment thread cmd/cluster/imdsv2.go
givaldolins and others added 2 commits June 3, 2026 14:55
- Fix monitorCPMSRollout to derive timeout context from parent ctx
  instead of context.Background(), allowing proper cancellation
  propagation (SIGINT/upstream cancellation)
- Replace checkClusterOperators oc shell-out with direct API calls
  using configv1.ClusterOperatorList and existing o.client
- Register configv1 scheme in init() for ClusterOperator resources
- Remove unused os/exec import

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@givaldolins
Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@givaldolins: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant