OCPSTRAT-3150: Add etcd sharding by resource kind support#8705
OCPSTRAT-3150: Add etcd sharding by resource kind support#8705jhjaggars wants to merge 5 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@jhjaggars: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request implements EtcdSharding: a feature gate and manifest entry; new hostedcluster API types and CEL validations for managed/unmanaged shards; support package to compute effective shards and service names with tests; per-shard control-plane components (StatefulSet, Services, ServiceMonitor, PDB) and PKI helpers with controller reconciliation; KAS wiring for etcd-servers-overrides and wait-for-etcd updates; asset-directory builder and component changes; and ancillary Dockerfile, .gitignore, and Prometheus rule updates. Sequence Diagram(s)sequenceDiagram
participant HostedController
participant PKI
participant ShardComponent
participant EtcdStatefulSet
participant KubeAPIServer
HostedController->>PKI: ReconcileEtcdShardServerSecret(ns, shard)
HostedController->>PKI: ReconcileEtcdShardPeerSecret(ns, shard)
HostedController->>ShardComponent: NewShardComponent(shard) / apply manifests
ShardComponent->>EtcdStatefulSet: adaptStatefulSetForShard (replicas, TLS, init/reset, storage, scheduling)
KubeAPIServer->>EtcdStatefulSet: connect (via etcd-servers-overrides)
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jhjaggars The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.gitignore:
- Around line 62-64: The .gitignore entry "*.egg_build-amd64/" mismatches the
actual build directory pattern used by the Dockerfile (which uses
"_build-${TARGETARCH}/" -> "_build-amd64/"); update the ignore list so amd64
artifacts are ignored consistently by replacing or adding the pattern
"_build-amd64/" (alongside the existing "_build-arm64/" and "_build/") so all
architecture-specific build dirs match the Dockerfile copy source.
In `@Dockerfile.local`:
- Around line 1-18: The Dockerfile currently leaves the container running as
root (ENTRYPOINT ["/usr/bin/hypershift"]); add a non-root user and switch to it
with a USER instruction before ENTRYPOINT. Create/ensure a runtime user (e.g.,
adduser or groupadd/useradd or use a fixed UID like 1000), chown or adjust
permissions on the copied binaries in /usr/bin (hypershift, hcp,
hypershift-operator, karpenter-operator, control-plane-operator,
control-plane-pki-operator) so the non-root user can execute them, and then add
USER <non-root-user-or-uid> immediately before the ENTRYPOINT to comply with the
non-root requirement.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 40924115-0d68-416c-9094-7fc72f13c8ed
⛔ Files ignored due to path filters (22)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yamlis excluded by!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/EtcdSharding.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/EtcdSharding.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/etcdshardresource.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/etcdshardschedulingspec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/managedetcdshardpersistentvolumespec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/managedetcdshardspec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/managedetcdshardstoragespec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/managedetcdspec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/unmanagedetcdshardspec.gois excluded by!client/**client/applyconfiguration/hypershift/v1beta1/unmanagedetcdspec.gois excluded by!client/**client/applyconfiguration/utils.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yamlis excluded by!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (16)
.gitignoreDockerfile.localapi/hypershift/v1beta1/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yamlapi/hypershift/v1beta1/hostedcluster_types.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/etcd.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/prometheus-recording-rules.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/params.gosupport/controlplane-component/builder.gosupport/controlplane-component/controlplane-component.gosupport/controlplane-component/defaults.gosupport/controlplane-component/status.gosupport/etcd/shards.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)
245-250:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftShard component registration is static and misses runtime shard spec updates.
registerComponentsis only called during setup, sor.componentsis frozen from startup shard config. Ifspec.etcd.managed.shardschanges later, new shard components won’t reconcile (and removed shards may continue) until CPO restarts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go` around lines 245 - 250, registerComponents currently builds r.components once at setup using hcp.Spec.Etcd.Managed.Shards, so runtime changes to spec.etcd.managed.shards are ignored; instead, rebuild or extend the component list at reconcile time. Update the logic so Reconcile (or the reconcile loop that iterates r.components) computes the etcd shard components from the current hcp.Spec.Etcd.Managed.Shards by calling etcdv2.NewShardComponent(shard) for each shard and merging those into the component list used for that reconciliation (rather than relying on the startup-populated r.components from registerComponents); ensure you only mutate a local slice or replace r.components atomically so removed shards stop being reconciled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 245-250: registerComponents currently builds r.components once at
setup using hcp.Spec.Etcd.Managed.Shards, so runtime changes to
spec.etcd.managed.shards are ignored; instead, rebuild or extend the component
list at reconcile time. Update the logic so Reconcile (or the reconcile loop
that iterates r.components) computes the etcd shard components from the current
hcp.Spec.Etcd.Managed.Shards by calling etcdv2.NewShardComponent(shard) for each
shard and merging those into the component list used for that reconciliation
(rather than relying on the startup-populated r.components from
registerComponents); ensure you only mutate a local slice or replace
r.components atomically so removed shards stop being reconciled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: df4ec306-22e9-43bd-8a0c-4914e8ed2cf6
⛔ Files ignored due to path filters (3)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*client/applyconfiguration/hypershift/v1beta1/unmanagedetcdshardspec.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/tests/hostedclusters.hypershift.openshift.io/featuregated.hostedclusters.etcdsharding.testsuite.yamlis excluded by!cmd/install/assets/**/*.yaml
📒 Files selected for processing (12)
.gitignoreDockerfile.localapi/hypershift/v1beta1/hostedcluster_types.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/pki.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/params.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/params_test.gohypershift-operator/featuregate/feature.gosupport/etcd/shards.gosupport/etcd/shards_test.go
💤 Files with no reviewable changes (1)
- support/etcd/shards.go
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (3)
- Dockerfile.local
- control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
- api/hypershift/v1beta1/hostedcluster_types.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go (1)
76-136:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd test coverage for non-empty
leaderElectionID.The test only covers the empty
leaderElectionIDcase. The new conditional logic at lines 169-171 instatefulset.gothat appends--leader-election-idwhen non-empty is untested.💚 Suggested additional test case
{ name: "When called with a leaderElectionID, it should pass the leader-election-id as an arg", namespace: "test-namespace", leaderElectionID: "etcd-shard-foo-defrag", validate: func(g Gomega, c corev1.Container) { g.Expect(c.Args).To(Equal([]string{ "etcd-defrag-controller", "--namespace", "test-namespace", "--leader-election-id", "etcd-shard-foo-defrag", })) }, },This requires adding
leaderElectionID stringto the test case struct and updating the call:- c := buildEtcdDefragControllerContainer(tc.namespace, "") + c := buildEtcdDefragControllerContainer(tc.namespace, tc.leaderElectionID)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go` around lines 76 - 136, The test TestBuildEtcdDefragControllerContainer only validates the empty leaderElectionID path; add a test case that supplies a non-empty leaderElectionID and asserts the container args include "--leader-election-id" and the provided ID. Update the test case struct in TestBuildEtcdDefragControllerContainer to include leaderElectionID string, pass that value into the call to buildEtcdDefragControllerContainer(namespace, leaderElectionID), and add the suggested case (name: "...leaderElectionID...", leaderElectionID: "etcd-shard-foo-defrag") that expects Args to equal ["etcd-defrag-controller","--namespace","test-namespace","--leader-election-id","etcd-shard-foo-defrag"] so the conditional branch in statefulset.go is covered.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go (1)
102-114: ⚡ Quick winExtract duplicate scheduling logic into a shared helper.
This code duplicates
adaptShardSchedulinginshard.go:329-341(from the relevant code snippets). Both applyNodeSelectorandTolerationsidentically to a StatefulSet pod template. Extract a shared helper to avoid divergence.♻️ Suggested refactor
Create a shared helper (e.g., in this file or a common location):
func applySchedulingToStatefulSet(sts *appsv1.StatefulSet, nodeSelector map[string]string, tolerations []corev1.Toleration) { if len(nodeSelector) > 0 { if sts.Spec.Template.Spec.NodeSelector == nil { sts.Spec.Template.Spec.NodeSelector = map[string]string{} } for k, v := range nodeSelector { sts.Spec.Template.Spec.NodeSelector[k] = v } } if len(tolerations) > 0 { sts.Spec.Template.Spec.Tolerations = append(sts.Spec.Template.Spec.Tolerations, tolerations...) } }Then use it in both places:
- if managedEtcdSpec != nil { - if len(managedEtcdSpec.Scheduling.NodeSelector) > 0 { - if sts.Spec.Template.Spec.NodeSelector == nil { - sts.Spec.Template.Spec.NodeSelector = map[string]string{} - } - for k, v := range managedEtcdSpec.Scheduling.NodeSelector { - sts.Spec.Template.Spec.NodeSelector[k] = v - } - } - if len(managedEtcdSpec.Scheduling.Tolerations) > 0 { - sts.Spec.Template.Spec.Tolerations = append(sts.Spec.Template.Spec.Tolerations, managedEtcdSpec.Scheduling.Tolerations...) - } - } + if managedEtcdSpec != nil { + applySchedulingToStatefulSet(sts, managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go` around lines 102 - 114, The scheduling merge logic duplicated between the StatefulSet block and adaptShardScheduling should be factored into a single helper; add a function named applySchedulingToStatefulSet that accepts (*appsv1.StatefulSet, nodeSelector map[string]string, tolerations []corev1.Toleration) and moves the NodeSelector merge and Tolerations append logic into it, then replace the inline block in the StatefulSet creation (the code handling managedEtcdSpec.Scheduling.NodeSelector/Tolerations) and the calls in adaptShardScheduling to call applySchedulingToStatefulSet(sts, managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations) so both sites reuse the same implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.go`:
- Around line 76-136: The test TestBuildEtcdDefragControllerContainer only
validates the empty leaderElectionID path; add a test case that supplies a
non-empty leaderElectionID and asserts the container args include
"--leader-election-id" and the provided ID. Update the test case struct in
TestBuildEtcdDefragControllerContainer to include leaderElectionID string, pass
that value into the call to buildEtcdDefragControllerContainer(namespace,
leaderElectionID), and add the suggested case (name: "...leaderElectionID...",
leaderElectionID: "etcd-shard-foo-defrag") that expects Args to equal
["etcd-defrag-controller","--namespace","test-namespace","--leader-election-id","etcd-shard-foo-defrag"]
so the conditional branch in statefulset.go is covered.
---
Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`:
- Around line 102-114: The scheduling merge logic duplicated between the
StatefulSet block and adaptShardScheduling should be factored into a single
helper; add a function named applySchedulingToStatefulSet that accepts
(*appsv1.StatefulSet, nodeSelector map[string]string, tolerations
[]corev1.Toleration) and moves the NodeSelector merge and Tolerations append
logic into it, then replace the inline block in the StatefulSet creation (the
code handling managedEtcdSpec.Scheduling.NodeSelector/Tolerations) and the calls
in adaptShardScheduling to call applySchedulingToStatefulSet(sts,
managedEtcdSpec.Scheduling.NodeSelector, managedEtcdSpec.Scheduling.Tolerations)
so both sites reuse the same implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b94f399c-88b4-420f-8838-81a9b6394891
📒 Files selected for processing (4)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/etcd_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.goetcd-defrag/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-operator/controllers/hostedcontrolplane/v2/etcd/shard.go
|
@jhjaggars: This pull request references OCPSTRAT-3150 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature to target either version "5.0." or "openshift-5.0.", but it targets "openshift-5.1" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
188ab63 to
78d1b97
Compare
|
The "needs update" messages come from I now have everything needed for the report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Verify job runs Root CauseThe PR branch (
When the Verify workflow runs The dirty-tree check ( Recommendations
Evidence
|
- Add EtcdSharding feature gate (TechPreviewNoUpgrade) - Add EtcdShardResource, ManagedEtcdShardSpec, ManagedEtcdShardStorageSpec, EtcdShardSchedulingSpec, and UnmanagedEtcdShardSpec types - Add Shards field to ManagedEtcdSpec and UnmanagedEtcdSpec - Add CEL validation for shard resource routing (built-in types only) - Regenerate deepcopy and featuregated CRD manifests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndor - Regenerate apply configurations for new etcd shard types - Update CRD manifests for CustomNoUpgrade and TechPreviewNoUpgrade - Add envtest suite for EtcdSharding feature gate validation - Vendor updated API types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…POv2 framework - Register EtcdSharding feature gate - Add WithAssetDir() to component builder for shared asset directories - Support etcd-prefixed components in priorityClass, DefaultReplicas, FSGroup, and checkDependencies - Add EffectiveShards() helper to compute shard list from HCP spec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add NewShardComponent() for shard etcd StatefulSet instances - Add per-shard TLS certificate generation in PKI controller - Add --etcd-servers-overrides KAS flag for shard routing - Update wait-for-etcd to check all shard DNS entries - Update prometheus recording rules to match all etcd shards - Isolate etcd-defrag to skip shard members - Register shard components conditionally in HCP reconciler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document EtcdShardResource, ManagedEtcdShardSpec, and related types - Document built-in-only routing limitation for shard resources Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@jhjaggars: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implements etcd sharding by resource kind for HyperShift, as described in openshift/enhancements#1979.
Each etcd shard is registered as an independent ControlPlaneComponent using the CPO component framework. KAS is configured with
--etcd-servers-overridesto route resources to the appropriate shard.Key features:
spec.etcd.managed.shardsandspec.etcd.unmanaged.shardsEtcdSharding(TechPreviewNoUpgrade)API types aligned with enhancement proposal including:
EtcdShardResourcewith+listType=mapcomposite keysReplicasfield (enum 1|3) per shardSummary by CodeRabbit
New Features
Chores
Tests