Skip to content

OCPSTRAT-3150: Add etcd sharding by resource kind support#8705

Open
jhjaggars wants to merge 5 commits into
openshift:mainfrom
jhjaggars:etcd-sharding
Open

OCPSTRAT-3150: Add etcd sharding by resource kind support#8705
jhjaggars wants to merge 5 commits into
openshift:mainfrom
jhjaggars:etcd-sharding

Conversation

@jhjaggars

@jhjaggars jhjaggars commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Implements etcd sharding by resource kind for HyperShift, as described in openshift/enhancements#1979.

Each etcd shard is registered as an independent ControlPlaneComponent using the CPO component framework. KAS is configured with --etcd-servers-overrides to route resources to the appropriate shard.

Key features:

  • Declarative shard configuration via spec.etcd.managed.shards and spec.etcd.unmanaged.shards
  • Per-shard storage (PersistentVolume or EmptyDir), replicas (1 or 3), and scheduling
  • Feature-gated behind EtcdSharding (TechPreviewNoUpgrade)
  • Full CEL validation: resource overlap prevention, immutability, DNS format validation
  • Per-shard TLS certificates, ServiceMonitors, PDBs
  • Both managed and unmanaged etcd support

API types aligned with enhancement proposal including:

  • EtcdShardResource with +listType=map composite keys
  • Per-field CEL immutability on shard specs
  • Explicit Replicas field (enum 1|3) per shard
  • Cross-shard resource overlap CEL validation

Summary by CodeRabbit

  • New Features

    • Etcd sharding: per-resource routing, per-shard storage, scheduling, per-shard control-plane components, per-shard TLS secrets, and feature gate.
    • Kube API server: per-shard etcd endpoint overrides and startup wait for shard DNS.
    • Etcd defrag: configurable leader-election ID.
  • Chores

    • Added local Dockerfile and updated .gitignore to ignore build outputs.
    • Broadened Prometheus etcd recording rules.
    • Component asset-directory override and related defaults/scheduling refinements.
  • Tests

    • Added unit tests for shard derivation and naming logic.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 9, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

Copy link
Copy Markdown

@jhjaggars: This pull request explicitly references no jira issue.

Details

In response to this:

Implements etcd sharding by resource kind for HyperShift, as described in openshift/enhancements#1979.

Each etcd shard is registered as an independent ControlPlaneComponent using the CPO component framework. KAS is configured with --etcd-servers-overrides to route resources to the appropriate shard.

Key features:

  • Declarative shard configuration via spec.etcd.managed.shards and spec.etcd.unmanaged.shards
  • Per-shard storage (PersistentVolume or EmptyDir), replicas (1 or 3), and scheduling
  • Feature-gated behind EtcdSharding (TechPreviewNoUpgrade)
  • Full CEL validation: resource overlap prevention, immutability, DNS format validation
  • Per-shard TLS certificates, ServiceMonitors, PDBs
  • Both managed and unmanaged etcd support

API types aligned with enhancement proposal including:

  • EtcdShardResource with +listType=map composite keys
  • Per-field CEL immutability on shard specs
  • Explicit Replicas field (enum 1|3) per shard
  • Cross-shard resource overlap CEL validation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request implements EtcdSharding: a feature gate and manifest entry; new hostedcluster API types and CEL validations for managed/unmanaged shards; support package to compute effective shards and service names with tests; per-shard control-plane components (StatefulSet, Services, ServiceMonitor, PDB) and PKI helpers with controller reconciliation; KAS wiring for etcd-servers-overrides and wait-for-etcd updates; asset-directory builder and component changes; and ancillary Dockerfile, .gitignore, and Prometheus rule updates.

Sequence Diagram(s)

sequenceDiagram
  participant HostedController
  participant PKI
  participant ShardComponent
  participant EtcdStatefulSet
  participant KubeAPIServer
  HostedController->>PKI: ReconcileEtcdShardServerSecret(ns, shard)
  HostedController->>PKI: ReconcileEtcdShardPeerSecret(ns, shard)
  HostedController->>ShardComponent: NewShardComponent(shard) / apply manifests
  ShardComponent->>EtcdStatefulSet: adaptStatefulSetForShard (replicas, TLS, init/reset, storage, scheduling)
  KubeAPIServer->>EtcdStatefulSet: connect (via etcd-servers-overrides)
Loading

Suggested reviewers

  • sdminonne
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Topology-Aware Scheduling Compatibility ⚠️ Warning Shard Replicas field allows 1 or 3 without topology awareness. No validation prevents Replicas=3 on 2-node clusters; no anti-affinity prevents 3 pods pending on limited nodes. Add CEL validation to reject Replicas=3 on SingleReplica/DualReplica topologies, or require anti-affinity + appropriate PDB constraints.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 40 test cases use stable, deterministic names with no dynamic content. No UUIDs, timestamps, or generated identifiers present. All tests use standard Go testing, not Ginkgo.
Test Structure And Quality ✅ Passed PR tests use standard Go testing, not Ginkgo; all test files show good quality with table-driven tests, clear assertion messages, and repository pattern consistency.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. Three standard Go unit tests were added, not Ginkgo e2e tests, so the IPv6/disconnected network check does not apply.
No-Weak-Crypto ✅ Passed PR adds etcd sharding with TLS certificate generation but uses only standard Go crypto libraries and existing PKI helpers; no weak algorithms, custom crypto, or insecure comparisons detected.
Container-Privileges ✅ Passed PR does not introduce container privilege escalation: no privileged: true, hostPID/Network/IPC, SYS_ADMIN, or allowPrivilegeEscalation flags in any manifests.
No-Sensitive-Data-In-Logs ✅ Passed No explicit logging of sensitive data. EtcdServersOverrides contain internal hostnames/namespace but are not logged; only generic config errors logged.
Title check ✅ Passed The title clearly and specifically describes the main change: adding etcd sharding by resource kind support, which is the central feature implemented across the entire changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jhjaggars
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 18:07 Inactive
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.82418% with 110 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.71%. Comparing base (4755e9c) to head (0fdea90).

Files with missing lines Patch % Lines
...perator/controllers/hostedcontrolplane/pki/etcd.go 0.00% 23 Missing ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go 20.83% 17 Missing and 2 partials ⚠️
...or/controllers/hostedcontrolplane/v2/etcd/shard.go 93.90% 12 Missing and 3 partials ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 25.00% 12 Missing ⚠️
...t/controlplane-component/controlplane-component.go 31.25% 10 Missing and 1 partial ⚠️
...ontrollers/hostedcontrolplane/v2/kas/deployment.go 0.00% 9 Missing ⚠️
support/etcd/shards.go 89.28% 6 Missing and 3 partials ⚠️
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% 4 Missing ⚠️
...or/controllers/hostedcontrolplane/v2/kas/config.go 0.00% 2 Missing and 1 partial ⚠️
support/controlplane-component/builder.go 0.00% 3 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8705      +/-   ##
==========================================
+ Coverage   41.54%   41.71%   +0.16%     
==========================================
  Files         758      760       +2     
  Lines       93838    94277     +439     
==========================================
+ Hits        38986    39328     +342     
- Misses      52107    52194      +87     
- Partials     2745     2755      +10     
Files with missing lines Coverage Δ
...or/controllers/hostedcontrolplane/v2/kas/params.go 90.00% <100.00%> (+1.20%) ⬆️
hypershift-operator/featuregate/feature.go 81.81% <100.00%> (+0.86%) ⬆️
support/controlplane-component/defaults.go 65.84% <100.00%> (ø)
support/controlplane-component/status.go 72.28% <100.00%> (+0.33%) ⬆️
etcd-defrag/main.go 0.00% <0.00%> (ø)
...or/controllers/hostedcontrolplane/v2/kas/config.go 88.94% <0.00%> (-0.73%) ⬇️
support/controlplane-component/builder.go 50.00% <0.00%> (-2.00%) ⬇️
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% <0.00%> (ø)
...ontrollers/hostedcontrolplane/v2/kas/deployment.go 26.16% <0.00%> (-0.63%) ⬇️
support/etcd/shards.go 89.28% <89.28%> (ø)
... and 5 more

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 35.10% <79.27%> (+0.14%) ⬆️
cpo-hostedcontrolplane 44.28% <75.07%> (+0.68%) ⬆️
cpo-other 43.17% <ø> (ø)
hypershift-operator 51.62% <100.00%> (+<0.01%) ⬆️
other 31.56% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.gitignore:
- Around line 62-64: The .gitignore entry "*.egg_build-amd64/" mismatches the
actual build directory pattern used by the Dockerfile (which uses
"_build-${TARGETARCH}/" -> "_build-amd64/"); update the ignore list so amd64
artifacts are ignored consistently by replacing or adding the pattern
"_build-amd64/" (alongside the existing "_build-arm64/" and "_build/") so all
architecture-specific build dirs match the Dockerfile copy source.

In `@Dockerfile.local`:
- Around line 1-18: The Dockerfile currently leaves the container running as
root (ENTRYPOINT ["/usr/bin/hypershift"]); add a non-root user and switch to it
with a USER instruction before ENTRYPOINT. Create/ensure a runtime user (e.g.,
adduser or groupadd/useradd or use a fixed UID like 1000), chown or adjust
permissions on the copied binaries in /usr/bin (hypershift, hcp,
hypershift-operator, karpenter-operator, control-plane-operator,
control-plane-pki-operator) so the non-root user can execute them, and then add
USER <non-root-user-or-uid> immediately before the ENTRYPOINT to comply with the
non-root requirement.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 40924115-0d68-416c-9094-7fc72f13c8ed

📥 Commits

Reviewing files that changed from the base of the PR and between 832b848 and 7db3f5e.

⛔ Files ignored due to path filters (22)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml is excluded by !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/EtcdSharding.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/EtcdSharding.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/etcdshardresource.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/etcdshardschedulingspec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/managedetcdshardpersistentvolumespec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/managedetcdshardspec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/managedetcdshardstoragespec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/managedetcdspec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/unmanagedetcdshardspec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/unmanagedetcdspec.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (16)
  • .gitignore
  • Dockerfile.local
  • api/hypershift/v1beta1/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yaml
  • api/hypershift/v1beta1/hostedcluster_types.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/prometheus-recording-rules.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go
  • support/controlplane-component/builder.go
  • support/controlplane-component/controlplane-component.go
  • support/controlplane-component/defaults.go
  • support/controlplane-component/status.go
  • support/etcd/shards.go

Comment thread .gitignore Outdated
Comment thread Dockerfile.local Outdated
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 19:26 Inactive

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)

245-250: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Shard component registration is static and misses runtime shard spec updates.

registerComponents is only called during setup, so r.components is frozen from startup shard config. If spec.etcd.managed.shards changes later, new shard components won’t reconcile (and removed shards may continue) until CPO restarts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 245 - 250, registerComponents currently builds r.components once at
setup using hcp.Spec.Etcd.Managed.Shards, so runtime changes to
spec.etcd.managed.shards are ignored; instead, rebuild or extend the component
list at reconcile time. Update the logic so Reconcile (or the reconcile loop
that iterates r.components) computes the etcd shard components from the current
hcp.Spec.Etcd.Managed.Shards by calling etcdv2.NewShardComponent(shard) for each
shard and merging those into the component list used for that reconciliation
(rather than relying on the startup-populated r.components from
registerComponents); ensure you only mutate a local slice or replace
r.components atomically so removed shards stop being reconciled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 245-250: registerComponents currently builds r.components once at
setup using hcp.Spec.Etcd.Managed.Shards, so runtime changes to
spec.etcd.managed.shards are ignored; instead, rebuild or extend the component
list at reconcile time. Update the logic so Reconcile (or the reconcile loop
that iterates r.components) computes the etcd shard components from the current
hcp.Spec.Etcd.Managed.Shards by calling etcdv2.NewShardComponent(shard) for each
shard and merging those into the component list used for that reconciliation
(rather than relying on the startup-populated r.components from
registerComponents); ensure you only mutate a local slice or replace
r.components atomically so removed shards stop being reconciled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: df4ec306-22e9-43bd-8a0c-4914e8ed2cf6

📥 Commits

Reviewing files that changed from the base of the PR and between 7db3f5e and e0b7962.

⛔ Files ignored due to path filters (3)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • client/applyconfiguration/hypershift/v1beta1/unmanagedetcdshardspec.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/tests/hostedclusters.hypershift.openshift.io/featuregated.hostedclusters.etcdsharding.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
📒 Files selected for processing (12)
  • .gitignore
  • Dockerfile.local
  • api/hypershift/v1beta1/hostedcluster_types.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params_test.go
  • hypershift-operator/featuregate/feature.go
  • support/etcd/shards.go
  • support/etcd/shards_test.go
💤 Files with no reviewable changes (1)
  • support/etcd/shards.go
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (3)
  • Dockerfile.local
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
  • api/hypershift/v1beta1/hostedcluster_types.go

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 20:04 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 20:27 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 20:49 Inactive

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go (1)

76-136: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add test coverage for non-empty leaderElectionID.

The test only covers the empty leaderElectionID case. The new conditional logic at lines 169-171 in statefulset.go that appends --leader-election-id when non-empty is untested.

💚 Suggested additional test case
{
	name:      "When called with a leaderElectionID, it should pass the leader-election-id as an arg",
	namespace: "test-namespace",
	leaderElectionID: "etcd-shard-foo-defrag",
	validate: func(g Gomega, c corev1.Container) {
		g.Expect(c.Args).To(Equal([]string{
			"etcd-defrag-controller",
			"--namespace",
			"test-namespace",
			"--leader-election-id",
			"etcd-shard-foo-defrag",
		}))
	},
},

This requires adding leaderElectionID string to the test case struct and updating the call:

-			c := buildEtcdDefragControllerContainer(tc.namespace, "")
+			c := buildEtcdDefragControllerContainer(tc.namespace, tc.leaderElectionID)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go`
around lines 76 - 136, The test TestBuildEtcdDefragControllerContainer only
validates the empty leaderElectionID path; add a test case that supplies a
non-empty leaderElectionID and asserts the container args include
"--leader-election-id" and the provided ID. Update the test case struct in
TestBuildEtcdDefragControllerContainer to include leaderElectionID string, pass
that value into the call to buildEtcdDefragControllerContainer(namespace,
leaderElectionID), and add the suggested case (name: "...leaderElectionID...",
leaderElectionID: "etcd-shard-foo-defrag") that expects Args to equal
["etcd-defrag-controller","--namespace","test-namespace","--leader-election-id","etcd-shard-foo-defrag"]
so the conditional branch in statefulset.go is covered.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go (1)

102-114: ⚡ Quick win

Extract duplicate scheduling logic into a shared helper.

This code duplicates adaptShardScheduling in shard.go:329-341 (from the relevant code snippets). Both apply NodeSelector and Tolerations identically to a StatefulSet pod template. Extract a shared helper to avoid divergence.

♻️ Suggested refactor

Create a shared helper (e.g., in this file or a common location):

func applySchedulingToStatefulSet(sts *appsv1.StatefulSet, nodeSelector map[string]string, tolerations []corev1.Toleration) {
	if len(nodeSelector) > 0 {
		if sts.Spec.Template.Spec.NodeSelector == nil {
			sts.Spec.Template.Spec.NodeSelector = map[string]string{}
		}
		for k, v := range nodeSelector {
			sts.Spec.Template.Spec.NodeSelector[k] = v
		}
	}
	if len(tolerations) > 0 {
		sts.Spec.Template.Spec.Tolerations = append(sts.Spec.Template.Spec.Tolerations, tolerations...)
	}
}

Then use it in both places:

-	if managedEtcdSpec != nil {
-		if len(managedEtcdSpec.Scheduling.NodeSelector) > 0 {
-			if sts.Spec.Template.Spec.NodeSelector == nil {
-				sts.Spec.Template.Spec.NodeSelector = map[string]string{}
-			}
-			for k, v := range managedEtcdSpec.Scheduling.NodeSelector {
-				sts.Spec.Template.Spec.NodeSelector[k] = v
-			}
-		}
-		if len(managedEtcdSpec.Scheduling.Tolerations) > 0 {
-			sts.Spec.Template.Spec.Tolerations = append(sts.Spec.Template.Spec.Tolerations, managedEtcdSpec.Scheduling.Tolerations...)
-		}
-	}
+	if managedEtcdSpec != nil {
+		applySchedulingToStatefulSet(sts, managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 102 - 114, The scheduling merge logic duplicated between the
StatefulSet block and adaptShardScheduling should be factored into a single
helper; add a function named applySchedulingToStatefulSet that accepts
(*appsv1.StatefulSet, nodeSelector map[string]string, tolerations
[]corev1.Toleration) and moves the NodeSelector merge and Tolerations append
logic into it, then replace the inline block in the StatefulSet creation (the
code handling managedEtcdSpec.Scheduling.NodeSelector/Tolerations) and the calls
in adaptShardScheduling to call applySchedulingToStatefulSet(sts,
managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations)
so both sites reuse the same implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go`:
- Around line 76-136: The test TestBuildEtcdDefragControllerContainer only
validates the empty leaderElectionID path; add a test case that supplies a
non-empty leaderElectionID and asserts the container args include
"--leader-election-id" and the provided ID. Update the test case struct in
TestBuildEtcdDefragControllerContainer to include leaderElectionID string, pass
that value into the call to buildEtcdDefragControllerContainer(namespace,
leaderElectionID), and add the suggested case (name: "...leaderElectionID...",
leaderElectionID: "etcd-shard-foo-defrag") that expects Args to equal
["etcd-defrag-controller","--namespace","test-namespace","--leader-election-id","etcd-shard-foo-defrag"]
so the conditional branch in statefulset.go is covered.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`:
- Around line 102-114: The scheduling merge logic duplicated between the
StatefulSet block and adaptShardScheduling should be factored into a single
helper; add a function named applySchedulingToStatefulSet that accepts
(*appsv1.StatefulSet, nodeSelector map[string]string, tolerations
[]corev1.Toleration) and moves the NodeSelector merge and Tolerations append
logic into it, then replace the inline block in the StatefulSet creation (the
code handling managedEtcdSpec.Scheduling.NodeSelector/Tolerations) and the calls
in adaptShardScheduling to call applySchedulingToStatefulSet(sts,
managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations)
so both sites reuse the same implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b94f399c-88b4-420f-8838-81a9b6394891

📥 Commits

Reviewing files that changed from the base of the PR and between e27c482 and 7b4aaf1.

📒 Files selected for processing (4)
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
  • etcd-defrag/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 21:01 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 21:49 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 9, 2026 21:54 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 10, 2026 00:03 Inactive
@jhjaggars jhjaggars changed the title NO-JIRA: Add etcd sharding by resource kind support OCPSTRAT-3150: Add etcd sharding by resource kind support Jun 10, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 10, 2026

Copy link
Copy Markdown

@jhjaggars: This pull request references OCPSTRAT-3150 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature to target either version "5.0." or "openshift-5.0.", but it targets "openshift-5.1" instead.

Details

In response to this:

Implements etcd sharding by resource kind for HyperShift, as described in openshift/enhancements#1979.

Each etcd shard is registered as an independent ControlPlaneComponent using the CPO component framework. KAS is configured with --etcd-servers-overrides to route resources to the appropriate shard.

Key features:

  • Declarative shard configuration via spec.etcd.managed.shards and spec.etcd.unmanaged.shards
  • Per-shard storage (PersistentVolume or EmptyDir), replicas (1 or 3), and scheduling
  • Feature-gated behind EtcdSharding (TechPreviewNoUpgrade)
  • Full CEL validation: resource overlap prevention, immutability, DNS format validation
  • Per-shard TLS certificates, ServiceMonitors, PDBs
  • Both managed and unmanaged etcd support

API types aligned with enhancement proposal including:

  • EtcdShardResource with +listType=map composite keys
  • Per-field CEL immutability on shard specs
  • Explicit Replicas field (enum 1|3) per shard
  • Cross-shard resource overlap CEL validation

Summary by CodeRabbit

  • New Features

  • Etcd sharding: per-resource routing, per-shard storage, scheduling, per-shard control-plane components, per-shard TLS secrets, and feature gate.

  • Kube API server: per-shard etcd endpoint overrides and startup wait for shard DNS.

  • Etcd defrag: configurable leader-election ID.

  • Chores

  • Added local Dockerfile and updated .gitignore to ignore build outputs.

  • Broadened Prometheus etcd recording rules.

  • Component asset-directory override and related defaults/scheduling refinements.

  • Tests

  • Added unit tests for shard derivation and naming logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 10, 2026 01:03 Inactive
@jhjaggars jhjaggars force-pushed the etcd-sharding branch 2 times, most recently from 188ab63 to 78d1b97 Compare June 10, 2026 01:17
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 10, 2026 01:22 Inactive
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8705 June 10, 2026 12:47 Inactive
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

The "needs update" messages come from git update-index --refresh — this is the first command that runs and it detects that the on-disk EtcdSharding.yaml files were modified by make generate update and no longer match the index. The shell runs with set -e (-e {0} in the bash options), but git update-index --refresh actually exits 0 even when printing "needs update". The actual exit code 1 comes from one of the subsequent commands (git diff-index, git diff-files, or git diff --exit-code), which detects the modification.

I now have everything needed for the report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/EtcdSharding.yaml: needs update
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/EtcdSharding.yaml: needs update
##[error]Process completed with exit code 1.

Summary

The Verify job runs make generate update followed by a dirty-tree check (git diff --exit-code). The codegen step regenerated the two feature-gated CRD manifests for EtcdSharding and the result differs from what is committed in the PR. The PR branch is 3 commits behind main — specifically, it was branched from 832b8480 (CNTRLPLANE-3022, June 9) and is missing the subsequent CRD-affecting merge 4755e9c9 (CNTRLPLANE-3250/CNTRLPLANE-430: API-driven Azure topology and private connectivity, merged June 10 10:24 UTC). The EtcdSharding.yaml files in the PR were generated against a pre-CNTRLPLANE-3250 API base and are now stale.

Root Cause

The PR branch (etcd-sharding) is 3 commits behind main. Its merge base is commit 832b8480 (CNTRLPLANE-3022, June 9 16:34 UTC), but main has since received at least one CRD-affecting merge:

  • 4755e9c9CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1) — merged June 10 10:24 UTC. This PR regenerated CRD manifests (including the base HostedCluster and HostedControlPlane CRDs) by adding new Azure Swift/topology fields.

When the Verify workflow runs make generate update, it first deletes the entire zz_generated.featuregated-crd-manifests/ directory, then regenerates all feature-gated CRD variants from the current API types. The generated EtcdSharding.yaml files are full CRD manifests for HostedCluster and HostedControlPlane — they include all fields, not just the EtcdSharding-specific ones. Because the PR was generated before the Azure Swift fields were added to the base types, the regenerated CRDs now include those new fields while the committed versions do not.

The dirty-tree check (git update-index --refreshgit diff --exit-code HEAD) detects this difference and fails the job with exit code 1.

Recommendations
  1. Rebase the PR branch onto the latest main and re-run code generation:

    git fetch origin main
    git rebase origin/main
    make generate update
    git add -A
    git commit -m "chore: regenerate CRDs after rebase on main"
  2. Push the updated branch — this will automatically re-trigger the Verify workflow.

  3. No code changes are needed — the PR's functional code (API types, feature gate definitions, etcd shard logic) is not at fault. Only the generated CRD manifests need to be refreshed to incorporate upstream changes from main.

Evidence
Evidence Detail
Failing step git update-index --refresh / git diff --exit-code HEAD -- in the dirty-tree check after make generate update
Stale files api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/EtcdSharding.yaml and …/hostedcontrolplanes.hypershift.openshift.io/EtcdSharding.yaml
PR merge base 832b8480 (CNTRLPLANE-3022, June 9 16:34 UTC)
Missing upstream commit 4755e9c9 (CNTRLPLANE-3250/CNTRLPLANE-430: Azure topology, June 10 10:24 UTC) — regenerated CRDs with new Azure Swift fields
Branch status 6 commits ahead, 3 commits behind main
Codegen behavior make generate update deletes zz_generated.featuregated-crd-manifests/ then regenerates — full CRD manifests include all API fields, not just feature-specific ones
Workflow reference verify-reusable.yaml@main — uses main branch workflow definition, runs against PR branch code

jhjaggars and others added 5 commits June 10, 2026 09:44
- Add EtcdSharding feature gate (TechPreviewNoUpgrade)
- Add EtcdShardResource, ManagedEtcdShardSpec, ManagedEtcdShardStorageSpec,
  EtcdShardSchedulingSpec, and UnmanagedEtcdShardSpec types
- Add Shards field to ManagedEtcdSpec and UnmanagedEtcdSpec
- Add CEL validation for shard resource routing (built-in types only)
- Regenerate deepcopy and featuregated CRD manifests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndor

- Regenerate apply configurations for new etcd shard types
- Update CRD manifests for CustomNoUpgrade and TechPreviewNoUpgrade
- Add envtest suite for EtcdSharding feature gate validation
- Vendor updated API types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…POv2 framework

- Register EtcdSharding feature gate
- Add WithAssetDir() to component builder for shared asset directories
- Support etcd-prefixed components in priorityClass, DefaultReplicas,
  FSGroup, and checkDependencies
- Add EffectiveShards() helper to compute shard list from HCP spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add NewShardComponent() for shard etcd StatefulSet instances
- Add per-shard TLS certificate generation in PKI controller
- Add --etcd-servers-overrides KAS flag for shard routing
- Update wait-for-etcd to check all shard DNS entries
- Update prometheus recording rules to match all etcd shards
- Isolate etcd-defrag to skip shard members
- Register shard components conditionally in HCP reconciler

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document EtcdShardResource, ManagedEtcdShardSpec, and related types
- Document built-in-only routing limitation for shard resources

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jhjaggars jhjaggars marked this pull request as ready for review June 10, 2026 14:18
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and enxebre June 10, 2026 14:20
@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@jhjaggars: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants