OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907
OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907sdminonne wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughAdds support for wiring an AWS CA bundle into deployments when a HostedControlPlane has Spec.AdditionalTrustBundle set and the platform is AWS. A new utility, DeploymentAddAWSCABundleVolume, constructs user/system CA volumes, an init container to produce a combined bundle, mounts it into main containers, and sets AWS_CA_BUNDLE. Multiple hosted control plane components now call this utility during their deployment adaptation; tests and e2e checks were added to validate presence/absence of volumes, mounts, init containers, and the AWS_CA_BUNDLE env var. Sequence Diagram(s)mermaid Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/assign @enxebre |
|
how about karpenter-aws and aws-node-termination-handler? |
That statement seems to contradict the docs https://docs.aws.amazon.com/sdk-for-go/api/aws/session/ "Path to a custom Credentials Authority (CA) bundle PEM file that the SDK will use instead of the default system's root CA bundle. Use this only if you want to replace the CA bundle the SDK uses for TLS requests." |
|
/jira refresh |
|
@sdminonne: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (yli2@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c9c61d8 to
3aaf940
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdminonne The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws |
1c08219 to
1a4ffcc
Compare
|
@enxebre PTAL |
1a4ffcc to
b62e6a8
Compare
|
@enxebre PTAL |
|
@sdminonne: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@sdminonne are you still looking to take this PR forward? |
|
I now have the complete root cause analysis. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth CI jobs fail due to a compilation error in the new test file Root CauseThe PR adds two new code blocks in However,
This is a copy-paste error — the author likely copied from internal code in Fix required: Replace Recommendations
Evidence
|
…ents AWS control plane components fail TLS verification when calling AWS endpoints in isolated environments (e.g. US-ISO regions) because they do not honor the additionalTrustBundle from the HostedCluster spec. The AWS SDK replaces the system CA bundle when AWS_CA_BUNDLE is set, rather than appending to it (both v1 and v2 create a new empty x509.CertPool). To handle this, add a DeploymentAddAWSCABundleVolume helper in support/podspec that uses an init container (CPO image) to concatenate the system CA bundle with the user CAs from the additionalTrustBundle ConfigMap into a combined PEM file. AWS_CA_BUNDLE points to this combined file, ensuring the AWS SDK trusts both system and user CAs. The init container runs with a restricted security context (AllowPrivilegeEscalation=false, drop ALL capabilities) and minimal resource requests (cpu: 10m, memory: 10Mi), consistent with other lightweight init containers in the codebase. Wire the helper into all affected AWS components: - aws-cloud-controller-manager - capi-provider - ingress-operator - karpenter - karpenter-operator - aws-node-termination-handler Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rs in KAS The aws-kms-active and aws-kms-backup sidecar containers in the kube-apiserver deployment call AWS KMS API endpoints using the AWS SDK. In isolated AWS environments with custom CAs, these calls fail with x509 certificate verification errors because the sidecars do not honor the additionalTrustBundle from the HostedCluster spec. Reuse the DeploymentAddAWSCABundleVolume helper to set up the combined CA bundle volumes and init container, then wire the aws-kms-active and aws-kms-backup containers with the AWS_CA_BUNDLE env var. The aws-kms-token-minter container is excluded because it only mints ServiceAccount tokens via the kube-apiserver and does not call AWS APIs directly. Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (wewang@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| }, | ||
| } | ||
|
|
||
| cpContext := controlplanecomponent.WorkloadContext{ |
There was a problem hiding this comment.
Could you pull this block outside the loop instead of making it over and over?
| }) | ||
|
|
||
| if hcp.Spec.AdditionalTrustBundle != nil { | ||
| podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName)) |
There was a problem hiding this comment.
I thought the cpContext had a copy of the hcp you could pass in. Do you know if that is true?
There was a problem hiding this comment.
Yeah you are actually using it here https://github.com/openshift/hypershift/pull/7907/changes#diff-db92db45ba677a590c2e7e4cd186a414d3496b6f62954e4f98218c44518e24cbR25. So maybe the HCP should be referenced from that.
| // | ||
| // The initContainerImage should be a RHEL-based image that has /bin/sh and cat available | ||
| // (e.g. the control-plane-operator image). | ||
| func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) { |
There was a problem hiding this comment.
Could you just pass the cpContext in and drop the number of parameters down to two?
| t.Run(tc.name, func(t *testing.T) { | ||
| g := NewGomegaWithT(t) | ||
|
|
||
| hcp := &hyperv1.HostedControlPlane{ |
There was a problem hiding this comment.
could you move this block outside the for loop and then just change the AdditionalTrustBundle each time?
| }, | ||
| } | ||
|
|
||
| cpContext := controlplanecomponent.WorkloadContext{ |
There was a problem hiding this comment.
Could this be moved outside the for loop instead of being recreated each time?
| }, | ||
| } | ||
|
|
||
| cpContext := component.WorkloadContext{ |
There was a problem hiding this comment.
similar comment as the other tests
| if err := applyKMSConfig(&deployment.Spec.Template.Spec, secretEncryption, newKMSImages(hcp), hcp); err != nil { | ||
| return err | ||
| } | ||
| if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil { |
There was a problem hiding this comment.
Why would this not be in the block on L113?
| } | ||
|
|
||
| podspec.UpdateContainer("aws-kms-active", podSpec.Containers, wireCABundle) | ||
| podspec.UpdateContainer("aws-kms-backup", podSpec.Containers, wireCABundle) |
There was a problem hiding this comment.
Do we always deploy the backup container if there is not a key set? 🤔
If so, this would trip things up if there was no container I think.
| }) | ||
| } | ||
|
|
||
| // DeploymentAddAWSCABundleVolume creates a combined CA bundle containing both the system CAs from |
There was a problem hiding this comment.
Can we move this out to its own AWS platform file? This file to date has been platform agnostic.
There was a problem hiding this comment.
Is this something that could be moved over to v2 e2e rather than adding new tests to v1 e2e?
bryan-cox
left a comment
There was a problem hiding this comment.
Thanks for the PR — the init container approach for combining system + user CAs is the right design given the AWS SDK's AWS_CA_BUNDLE replacement behavior. A few items to address:
| func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) { | ||
| const ( | ||
| userCAVolumeName = "user-ca-bundle" | ||
| combinedCAVolumeName = "aws-ca-bundle" |
There was a problem hiding this comment.
The volume name, mount path, and filename constants are duplicated in kas/deployment.go:applyAWSCABundleToKMSContainers. If either copy drifts, KMS sidecars silently break in isolated environments. Could you export the shared constants?
const (
AWSCABundleVolumeName = "aws-ca-bundle"
AWSCABundleMountPath = "/etc/pki/ca-trust/extracted/hypershift"
AWSCABundleFileName = "combined-ca-bundle.pem"
)| return err | ||
| } | ||
| if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil { | ||
| podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName)) |
There was a problem hiding this comment.
DeploymentAddAWSCABundleVolume sets AWS_CA_BUNDLE on Containers[0], which is kube-apiserver here. KAS doesn't use the AWS SDK — only the KMS sidecars do, and those are correctly wired by applyAWSCABundleToKMSContainers below. Could we split the helper so the volume/init-container setup is separate from the Containers[0] env var wiring? That way KAS only gets the volumes + init container, and the KMS sidecars get the env var via applyAWSCABundleToKMSContainers.
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| ) | ||
|
|
||
| type fakeReleaseProvider struct{} |
There was a problem hiding this comment.
This struct is duplicated identically in 7 test files in this PR. There's already a shared support/releaseinfo/fake.FakeReleaseProvider used across 40+ tests in the codebase — could you use that instead?
|
|
||
| proxy.SetEnvVars(&deployment.Spec.Template.Spec.Containers[0].Env) | ||
|
|
||
| if cpContext.HCP.Spec.Platform.Type == hyperv1.AWSPlatform && cpContext.HCP.Spec.AdditionalTrustBundle != nil { |
There was a problem hiding this comment.
nit: The other files in this PR (awsnodeterminationhandler, karpenter, kas) create a local hcp := cpContext.HCP and use hcp.Spec.*. For consistency, consider doing the same here and in ingressoperator/deployment.go.
| deployment.Spec.Template.Spec.InitContainers = append(deployment.Spec.Template.Spec.InitContainers, corev1.Container{ | ||
| Name: initContainerName, | ||
| Image: initContainerImage, | ||
| Command: []string{"/bin/sh", "-c", |
There was a problem hiding this comment.
nit: If the system CA file is missing for some reason, this crashes the init container. A defensive fallback is cheap:
cat /etc/pki/tls/certs/ca-bundle.crt /user-ca/user-ca-bundle.pem > /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem 2>/dev/null || cp /user-ca/user-ca-bundle.pem /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem
Summary
DeploymentAddAWSCABundleVolumehelper that creates a combined CA bundle (system + user CAs) via an init container and setsAWS_CA_BUNDLEon the main containerAdditionalTrustBundleis set on the HostedControlPlane spec:aws-cloud-controller-managercapi-provideringress-operatorkarpenterkarpenter-operatoraws-node-termination-handlerkube-apiserverAWS KMS sidecars (aws-kms-active,aws-kms-backup) whenSecretEncryption.KMS.Provideris AWSaws-cloud-controller-managerProblem
In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via
HostedCluster.Spec.AdditionalTrustBundleare not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:Why not reuse
DeploymentAddTrustBundleVolume?The existing helper mounts a ConfigMap as a directory at
/etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from/etc/pki/tls/certsto verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.Why
AWS_CA_BUNDLEwith a combined bundle?The AWS SDK (both v1 and v2) reads
AWS_CA_BUNDLEand uses it instead of the system CA bundle — it creates a new emptyx509.CertPooland loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs fromadditionalTrustBundleinto a single combined PEM file.AWS_CA_BUNDLEpoints to this combined file, ensuring the AWS SDK trusts both system and custom CAs.KAS KMS sidecars
When secret encryption uses AWS KMS (
SecretEncryption.KMS.Provider == AWS), theaws-kms-activeandaws-kms-backupsidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. Theaws-kms-token-mintersidecar is intentionally excluded as it does not make AWS API calls.Test plan
AdditionalTrustBundleis setAdditionalTrustBundleis nilaws-kms-activeandaws-kms-backupget volume mount andAWS_CA_BUNDLEenv varaws-kms-token-minteris not wiredAWS_CA_BUNDLEwiring onaws-cloud-controller-managermake testpassesmake verifypassesFixes: https://issues.redhat.com/browse/OCPBUGS-77557
🤖 Generated with Claude Code