Skip to content

OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907

Open
sdminonne wants to merge 2 commits into
openshift:mainfrom
sdminonne:OCPBUGS-77557
Open

OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907
sdminonne wants to merge 2 commits into
openshift:mainfrom
sdminonne:OCPBUGS-77557

Conversation

@sdminonne

@sdminonne sdminonne commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add DeploymentAddAWSCABundleVolume helper that creates a combined CA bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE on the main container
  • Wire trust bundle propagation into all AWS control plane components when AdditionalTrustBundle is set on the HostedControlPlane spec:
    • aws-cloud-controller-manager
    • capi-provider
    • ingress-operator
    • karpenter
    • karpenter-operator
    • aws-node-termination-handler
    • kube-apiserver AWS KMS sidecars (aws-kms-active, aws-kms-backup) when SecretEncryption.KMS.Provider is AWS
  • Add unit tests for all components and an e2e test for aws-cloud-controller-manager

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE with a combined bundle?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and uses it instead of the system CA bundle — it creates a new empty x509.CertPool and loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs from additionalTrustBundle into a single combined PEM file. AWS_CA_BUNDLE points to this combined file, ensuring the AWS SDK trusts both system and custom CAs.

KAS KMS sidecars

When secret encryption uses AWS KMS (SecretEncryption.KMS.Provider == AWS), the aws-kms-active and aws-kms-backup sidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. The aws-kms-token-minter sidecar is intentionally excluded as it does not make AWS API calls.

Test plan

  • Unit tests verify volume, init container, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator, karpenter-operator)
  • Unit tests verify aws-kms-active and aws-kms-backup get volume mount and AWS_CA_BUNDLE env var
  • Unit tests verify aws-kms-token-minter is not wired
  • Unit tests verify no KMS wiring when KMS containers are absent
  • E2E test verifies AWS_CA_BUNDLE wiring on aws-cloud-controller-manager
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

@openshift-ci-robot

Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds support for wiring an AWS CA bundle into deployments when a HostedControlPlane has Spec.AdditionalTrustBundle set and the platform is AWS. A new utility, DeploymentAddAWSCABundleVolume, constructs user/system CA volumes, an init container to produce a combined bundle, mounts it into main containers, and sets AWS_CA_BUNDLE. Multiple hosted control plane components now call this utility during their deployment adaptation; tests and e2e checks were added to validate presence/absence of volumes, mounts, init containers, and the AWS_CA_BUNDLE env var.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant HCP as HostedControlPlane
participant Controller as Control Plane Operator
participant Util as support/util
participant Deployment as Kubernetes Deployment
participant KubeAPI as Kubernetes API
HCP->>Controller: Reconcile / adaptDeployment invoked
Controller->>Controller: check Platform == AWS && AdditionalTrustBundle != nil
Controller->>Util: DeploymentAddAWSCABundleVolume(trustBundleConfigMap, deployment, initImage)
Util->>Deployment: add user-ca ConfigMap volume
Util->>Deployment: add aws-ca-bundle EmptyDir + init container (concat CAs)
Util->>Deployment: add volumeMount to main container + set AWS_CA_BUNDLE env
Controller->>KubeAPI: apply updated Deployment
KubeAPI-->>Deployment: Deployment updated/applied

Changes

Cohort / File(s) Summary
Utility
support/util/volumes.go
Adds DeploymentAddAWSCABundleVolume(...) to add user CA ConfigMap volume, aws-ca-bundle EmptyDir, a setup init-container to combine CAs, mount the combined bundle into containers, and set AWS_CA_BUNDLE.
CAPI Provider
control-plane-operator/controllers/hostedcontrolplane/v2/capi_provider/deployment.go and test
Calls DeploymentAddAWSCABundleVolume when platform is AWS and AdditionalTrustBundle is present; adds tests validating volumes, mounts, init container, and AWS_CA_BUNDLE.
AWS Cloud Controller Manager
control-plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/aws/component.go, deployment.go, tests
Registers and implements adaptDeployment to invoke DeploymentAddAWSCABundleVolume for AWS+AdditionalTrustBundle; adds tests asserting expected wiring.
Ingress Operator
control-plane-operator/controllers/hostedcontrolplane/v2/ingressoperator/deployment.go and test
Adds conditional call to DeploymentAddAWSCABundleVolume in adaptDeployment for AWS with AdditionalTrustBundle; tests added.
AWS Node Determination Handler
control-plane-operator/controllers/hostedcontrolplane/v2/awsnodeterminationhandler/deployment.go and test
Injects AWS CA bundle wiring when AdditionalTrustBundle exists; tests added to verify volumes, init container, mounts, and env var.
Karpenter & Karpenter Operator
control-plane-operator/controllers/hostedcontrolplane/v2/karpenter/deployment.go, karpenteroperator/deployment.go and tests
Adds conditional AWS CA bundle wiring in adaptDeployment for Karpenter and its operator; tests validate presence/absence of CA resources and AWS_CA_BUNDLE.
E2E test
test/e2e/nodepool_additionalTrustBundlePropagation_test.go
Adds runtime checks (AWS-only) verifying aws-cloud-controller-manager deployment has aws-ca-bundle EmptyDir, setup-aws-ca-bundle init container, and AWS_CA_BUNDLE env var present/absent across bundle add/remove scenarios.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: propagating additionalTrustBundle to AWS control plane components, which is the central theme of all modifications across multiple deployment files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from bryan-cox and enxebre March 10, 2026 15:48
@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Mar 10, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/assign @enxebre

@enxebre

enxebre commented Mar 10, 2026

Copy link
Copy Markdown
Member

how about karpenter-aws and aws-node-termination-handler?

@enxebre

enxebre commented Mar 10, 2026

Copy link
Copy Markdown
Member

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and appends those CAs to the system cert pool.

That statement seems to contradict the docs https://docs.aws.amazon.com/sdk-for-go/api/aws/session/
"Custom CA Bundle" section

"Path to a custom Credentials Authority (CA) bundle PEM file that the SDK will use instead of the default system's root CA bundle. Use this only if you want to replace the CA bundle the SDK uses for TLS requests."

@sdminonne sdminonne changed the title fix(cpo): propagate additionalTrustBundle to AWS control plane components fix(OCPBUGS-77557): propagate additionalTrustBundle to AWS control plane components Mar 11, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne sdminonne changed the title fix(OCPBUGS-77557): propagate additionalTrustBundle to AWS control plane components OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components Mar 11, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Add DeploymentAddAWSCABundleVolume helper that mounts the user-ca-bundle ConfigMap at a non-conflicting path (/etc/pki/ca-trust/extracted/hypershift/) and sets the AWS_CA_BUNDLE environment variable
  • Wire trust bundle propagation into aws-cloud-controller-manager, capi-provider, and ingress-operator deployments when AdditionalTrustBundle is set on the HostedControlPlane spec
  • Add unit tests for all three components

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to three control plane components: aws-cloud-controller-manager, ingress-operator, and capi-provider. This causes TLS verification failures when these components call AWS STS endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the three affected components are third-party binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates on those connections. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and appends those CAs to the system cert pool. This means standard AWS endpoints continue to work via system CAs while also trusting custom CAs needed in isolated regions.

Test plan

  • Unit tests verify volume, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator)
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added AWS CA bundle support across control plane deployments. When an additional trust bundle is configured on AWS platforms, it is now properly mounted and integrated into deployments, enabling components to use custom CA certificates.

  • Tests

  • Added test coverage for AWS CA bundle deployment configuration across multiple components.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (yli2@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label Mar 16, 2026
@openshift-ci

openshift-ci Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdminonne
Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne

Copy link
Copy Markdown
Contributor Author

@enxebre PTAL

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

@enxebre PTAL

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2026
@openshift-ci openshift-ci Bot added area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform labels May 5, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@openshift-ci

openshift-ci Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 3aaf940 link true /test e2e-aws
ci/prow/e2e-v2-gke 13667a3 link false /test e2e-v2-gke
ci/prow/e2e-gke 13667a3 link false /test e2e-gke

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bryan-cox

Copy link
Copy Markdown
Member

@sdminonne are you still looking to take this PR forward?

@hypershift-jira-solve-ci

Copy link
Copy Markdown

I now have the complete root cause analysis. Here is the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

# github.com/openshift/hypershift/test/e2e [github.com/openshift/hypershift/test/e2e.test]
test/e2e/nodepool_additionalTrustBundlePropagation_test.go:186:19: undefined: util
test/e2e/nodepool_additionalTrustBundlePropagation_test.go:261:19: undefined: util
make: *** [Makefile:470: e2e] Error 1

Summary

Both CI jobs fail due to a compilation error in the new test file test/e2e/nodepool_additionalTrustBundlePropagation_test.go. The file references util.IsDeploymentReady(...) at lines 186 and 261, but util is not imported and no such identifier exists in scope. The correct call is podspec.IsDeploymentReady(...) — the podspec package is already imported on line 13 and used correctly elsewhere in the same file (line 223). This is a copy-paste error introduced when adding the new AWS CCM deployment readiness checks.

Root Cause

The PR adds two new code blocks in test/e2e/nodepool_additionalTrustBundlePropagation_test.go that check AWS Cloud Controller Manager (CCM) deployment readiness. These blocks call util.IsDeploymentReady(k.ctx, obj) at lines 186 and 261.

However, util is not a valid identifier in this file's scope:

  • There is no import of a util package.
  • The test/e2e/util/util.go package does not export an IsDeploymentReady function.
  • The function IsDeploymentReady is defined in the podspec package (github.com/openshift/hypershift/support/podspec), which is already correctly imported on line 13 of this file.
  • An existing usage on line 223 of the same file correctly calls podspec.IsDeploymentReady(k.ctx, obj).

This is a copy-paste error — the author likely copied from internal code in test/e2e/util/util.go (which calls podspec.IsDeploymentReady internally) and forgot to change the qualifier from util to podspec.

Fix required: Replace util.IsDeploymentReady with podspec.IsDeploymentReady on lines 186 and 261. No import changes needed.

Recommendations
  1. Fix the two broken references in test/e2e/nodepool_additionalTrustBundlePropagation_test.go:

    • Line 186: util.IsDeploymentReady(k.ctx, obj)podspec.IsDeploymentReady(k.ctx, obj)
    • Line 261: util.IsDeploymentReady(k.ctx, obj)podspec.IsDeploymentReady(k.ctx, obj)
  2. No import changes are needed — podspec is already imported on line 13.

  3. Run make vet and make e2e locally before pushing to confirm the fix compiles.

Evidence
Evidence Detail
Prow build error test/e2e/nodepool_additionalTrustBundlePropagation_test.go:186:19: undefined: util
Prow build error test/e2e/nodepool_additionalTrustBundlePropagation_test.go:261:19: undefined: util
GH Actions error vet: test/e2e/nodepool_additionalTrustBundlePropagation_test.go:186:19: undefined: util
Correct function location podspec.IsDeploymentReady() in support/podspec/deployment.go
Existing correct usage Line 223 of same file: podspec.IsDeploymentReady(k.ctx, obj) — works correctly
Import present Line 13: "github.com/openshift/hypershift/support/podspec" already imported
Failed Makefile target (Prow) Makefile:470: e2e — builds test-e2e binary with -tags e2e
Failed Makefile target (GHA) Makefile:507: vet — runs go vet across all packages

sdminonne and others added 2 commits June 10, 2026 21:26
…ents

AWS control plane components fail TLS verification when calling AWS
endpoints in isolated environments (e.g. US-ISO regions) because they
do not honor the additionalTrustBundle from the HostedCluster spec.

The AWS SDK replaces the system CA bundle when AWS_CA_BUNDLE is set,
rather than appending to it (both v1 and v2 create a new empty
x509.CertPool). To handle this, add a DeploymentAddAWSCABundleVolume
helper in support/podspec that uses an init container (CPO image) to
concatenate the system CA bundle with the user CAs from the
additionalTrustBundle ConfigMap into a combined PEM file. AWS_CA_BUNDLE
points to this combined file, ensuring the AWS SDK trusts both system
and user CAs.

The init container runs with a restricted security context
(AllowPrivilegeEscalation=false, drop ALL capabilities) and minimal
resource requests (cpu: 10m, memory: 10Mi), consistent with other
lightweight init containers in the codebase.

Wire the helper into all affected AWS components:
- aws-cloud-controller-manager
- capi-provider
- ingress-operator
- karpenter
- karpenter-operator
- aws-node-termination-handler

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rs in KAS

The aws-kms-active and aws-kms-backup sidecar containers in the
kube-apiserver deployment call AWS KMS API endpoints using the AWS SDK.
In isolated AWS environments with custom CAs, these calls fail with
x509 certificate verification errors because the sidecars do not
honor the additionalTrustBundle from the HostedCluster spec.

Reuse the DeploymentAddAWSCABundleVolume helper to set up the combined
CA bundle volumes and init container, then wire the aws-kms-active and
aws-kms-backup containers with the AWS_CA_BUNDLE env var. The
aws-kms-token-minter container is excluded because it only mints
ServiceAccount tokens via the kube-apiserver and does not call AWS
APIs directly.

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

  • Add DeploymentAddAWSCABundleVolume helper that creates a combined CA bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE on the main container
  • Wire trust bundle propagation into all AWS control plane components when AdditionalTrustBundle is set on the HostedControlPlane spec:
  • aws-cloud-controller-manager
  • capi-provider
  • ingress-operator
  • karpenter
  • karpenter-operator
  • aws-node-termination-handler
  • kube-apiserver AWS KMS sidecars (aws-kms-active, aws-kms-backup) when SecretEncryption.KMS.Provider is AWS
  • Add unit tests for all components and an e2e test for aws-cloud-controller-manager

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE with a combined bundle?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and uses it instead of the system CA bundle — it creates a new empty x509.CertPool and loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs from additionalTrustBundle into a single combined PEM file. AWS_CA_BUNDLE points to this combined file, ensuring the AWS SDK trusts both system and custom CAs.

KAS KMS sidecars

When secret encryption uses AWS KMS (SecretEncryption.KMS.Provider == AWS), the aws-kms-active and aws-kms-backup sidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. The aws-kms-token-minter sidecar is intentionally excluded as it does not make AWS API calls.

Test plan

  • Unit tests verify volume, init container, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator, karpenter-operator)
  • Unit tests verify aws-kms-active and aws-kms-backup get volume mount and AWS_CA_BUNDLE env var
  • Unit tests verify aws-kms-token-minter is not wired
  • Unit tests verify no KMS wiring when KMS containers are absent
  • E2E test verifies AWS_CA_BUNDLE wiring on aws-cloud-controller-manager
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 11, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wewang@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

},
}

cpContext := controlplanecomponent.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pull this block outside the loop instead of making it over and over?

})

if hcp.Spec.AdditionalTrustBundle != nil {
podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the cpContext had a copy of the hcp you could pass in. Do you know if that is true?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are actually using it here https://github.com/openshift/hypershift/pull/7907/changes#diff-db92db45ba677a590c2e7e4cd186a414d3496b6f62954e4f98218c44518e24cbR25. So maybe the HCP should be referenced from that.

//
// The initContainerImage should be a RHEL-based image that has /bin/sh and cat available
// (e.g. the control-plane-operator image).
func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just pass the cpContext in and drop the number of parameters down to two?

t.Run(tc.name, func(t *testing.T) {
g := NewGomegaWithT(t)

hcp := &hyperv1.HostedControlPlane{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this block outside the for loop and then just change the AdditionalTrustBundle each time?

},
}

cpContext := controlplanecomponent.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved outside the for loop instead of being recreated each time?

},
}

cpContext := component.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as the other tests

if err := applyKMSConfig(&deployment.Spec.Template.Spec, secretEncryption, newKMSImages(hcp), hcp); err != nil {
return err
}
if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this not be in the block on L113?

}

podspec.UpdateContainer("aws-kms-active", podSpec.Containers, wireCABundle)
podspec.UpdateContainer("aws-kms-backup", podSpec.Containers, wireCABundle)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always deploy the backup container if there is not a key set? 🤔

If so, this would trip things up if there was no container I think.

})
}

// DeploymentAddAWSCABundleVolume creates a combined CA bundle containing both the system CAs from

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out to its own AWS platform file? This file to date has been platform agnostic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that could be moved over to v2 e2e rather than adding new tests to v1 e2e?

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — the init container approach for combining system + user CAs is the right design given the AWS SDK's AWS_CA_BUNDLE replacement behavior. A few items to address:

func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) {
const (
userCAVolumeName = "user-ca-bundle"
combinedCAVolumeName = "aws-ca-bundle"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume name, mount path, and filename constants are duplicated in kas/deployment.go:applyAWSCABundleToKMSContainers. If either copy drifts, KMS sidecars silently break in isolated environments. Could you export the shared constants?

const (
    AWSCABundleVolumeName = "aws-ca-bundle"
    AWSCABundleMountPath  = "/etc/pki/ca-trust/extracted/hypershift"
    AWSCABundleFileName   = "combined-ca-bundle.pem"
)

return err
}
if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil {
podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeploymentAddAWSCABundleVolume sets AWS_CA_BUNDLE on Containers[0], which is kube-apiserver here. KAS doesn't use the AWS SDK — only the KMS sidecars do, and those are correctly wired by applyAWSCABundleToKMSContainers below. Could we split the helper so the volume/init-container setup is separate from the Containers[0] env var wiring? That way KAS only gets the volumes + init container, and the KMS sidecars get the env var via applyAWSCABundleToKMSContainers.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type fakeReleaseProvider struct{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is duplicated identically in 7 test files in this PR. There's already a shared support/releaseinfo/fake.FakeReleaseProvider used across 40+ tests in the codebase — could you use that instead?


proxy.SetEnvVars(&deployment.Spec.Template.Spec.Containers[0].Env)

if cpContext.HCP.Spec.Platform.Type == hyperv1.AWSPlatform && cpContext.HCP.Spec.AdditionalTrustBundle != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The other files in this PR (awsnodeterminationhandler, karpenter, kas) create a local hcp := cpContext.HCP and use hcp.Spec.*. For consistency, consider doing the same here and in ingressoperator/deployment.go.

deployment.Spec.Template.Spec.InitContainers = append(deployment.Spec.Template.Spec.InitContainers, corev1.Container{
Name: initContainerName,
Image: initContainerImage,
Command: []string{"/bin/sh", "-c",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If the system CA file is missing for some reason, this crashes the init container. A defensive fallback is cheap:

cat /etc/pki/tls/certs/ca-bundle.crt /user-ca/user-ca-bundle.pem > /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem 2>/dev/null || cp /user-ca/user-ca-bundle.pem /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants