diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..433ccf268 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,71 @@ +# AI Agent Instructions for cluster-openshift-controller-manager-operator + +> Also read [ARCHITECTURE.md](ARCHITECTURE.md) for design decisions and [CONTRIBUTING.md](CONTRIBUTING.md) for workflow. + +## What This Repo Is + +This is the **operator** for the openshift-controller-manager and route-controller-manager on OpenShift. It reconciles the `openshiftcontrollermanagers.operator.openshift.io/cluster` CR and manages two operand Deployments, their ConfigMaps, RBAC, network policies, and CA bundles. It reports status via the `openshift-controller-manager` ClusterOperator object. + +The **operand** (the controller-manager binary itself) lives in [openshift/openshift-controller-manager](https://github.com/openshift/openshift-controller-manager). This repo only manages how that binary is deployed and configured. + +## Repository Layout + +```text +cmd/ # Entry points + cluster-openshift-controller-manager-operator/ # Operator binary + cluster-openshift-controller-manager-operator-tests-ext/ # OTE test binary +pkg/ + operator/ # Core operator logic + starter.go # RunOperator — wires all controllers + operator.go # Main reconciler struct and work queue + sync_openshiftcontrollermanager_v311_00.go # Sync loop for both operands + configobservation/ # Config observers (builds, images, network, controllers) + usercaobservation/ # Proxy CA bundle sync controller + internalimageregistry/ # Image pull secret cleanup controller + util/consts.go # Namespace constants +bindata/assets/ # Static manifests applied by the operator +manifests/ # CVO-managed manifests (install/upgrade) +test/e2e/ # E2E tests (Ginkgo) +``` + +## Build and Test Commands + +```bash +make build # Build operator and OTE test binaries +make test # Unit tests +make verify # Lint and vet +make test-e2e # E2E tests (requires KUBECONFIG) +``` + +## Critical Rules + +1. **Network policy ordering matters.** In `starter.go`, allow-rules must appear before default-deny rules in the StaticResourceController asset list. Reversing this order causes traffic interruption during reconciliation. + +2. **Never remove the "always managed" and "not removable" flags.** The operator calls `management.SetOperatorAlwaysManaged()` and `management.SetOperatorNotRemovable()` because the openshift-controller-manager is a core platform component. + +3. **Config observers must return previous config on error.** If an observer encounters an error, it must return `previousConfig` (not empty config) to prevent flapping the operand configuration. + +4. **All informer caches must sync before controllers start.** The operator explicitly calls `WaitForCacheSync` for every informer group and returns an error if any fail. Skipping this check causes transiently incorrect operations from partially-synced caches (see OCPBUGS-81472). + +5. **Vendor is committed.** After dependency changes, run `go mod tidy && go mod vendor` and commit the vendor update separately from code changes. + +## Key Patterns + +- **library-go operator framework**: Uses `StaticResourceController`, `ConfigObserver`, `ClusterOperatorStatusController`, `ResourceSyncController`, and `LogLevelController` from `openshift/library-go`. +- **Dual-operand sync**: One sync function manages both the openshift-controller-manager and route-controller-manager Deployments. Errors and status conditions are tracked separately with the `RouteControllerManager` prefix for the route-controller-manager. +- **Capability-gated controllers**: Controllers are disabled by prepending `-` to their name in the operand ConfigMap (e.g., `-openshift.io/build`). The mapping from controllers to capabilities is in `controllerCapabilities` in `sync_openshiftcontrollermanager_v311_00.go`. +- **Input hash-driven rollouts**: ConfigMap resource versions and content hashes are embedded as annotations on the Deployment spec template, triggering a rolling update when upstream inputs change. +- **TLS security profile propagation**: The cluster-wide TLS profile from `apiservers.config.openshift.io` is observed and propagated into the operand config. TLS config changes trigger an operator restart. + +## What NOT to Do + +- Do not add new operand controllers here — controllers belong in `openshift/openshift-controller-manager`. This repo only manages deployment and configuration. +- Do not modify `manifests/` without coordinating with the release team — these are CVO-managed and affect install/upgrade. +- Do not change the rate limiter parameters in `operator.go` without understanding the impact on API server load during rapid config changes. +- Do not add bindata assets after default-deny network policies — always add allow rules first. + +## Test Suites + +- **Unit tests**: Colocated `_test.go` files in `pkg/`. Focus on config observer logic and deployment generation. +- **E2E tests**: `test/e2e/` using Ginkgo. Validate network policy enforcement and operator availability on a live cluster. +- **OTE**: The test extension binary (`cluster-openshift-controller-manager-operator-tests-ext`) is built alongside the operator and included in the production image. diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 000000000..129d57923 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,113 @@ +# Architecture: cluster-openshift-controller-manager-operator + +## Scope + +This operator manages two operand Deployments: + +- **openshift-controller-manager** — runs OpenShift-specific controllers (builds, deployments, image triggers, service accounts, templates, unidling) +- **route-controller-manager** — runs the ingress-to-route and ingress IP controllers (split from the controller-manager but still sharing the same operator CR) + +The operator does **not** manage the upstream kube-controller-manager (that is `cluster-kube-controller-manager-operator`). + +## Namespace Map + +| Namespace | Purpose | +|-----------|---------| +| `openshift-controller-manager-operator` | Operator Deployment runs here | +| `openshift-controller-manager` | Operand: openshift-controller-manager Deployment, ConfigMaps, ServiceAccount | +| `openshift-route-controller-manager` | Operand: route-controller-manager Deployment, ConfigMaps, ServiceAccount | +| `openshift-config` | Source for user-specified global configuration (proxy CA, build defaults) | +| `openshift-config-managed` | Source for machine-managed global configuration | + +## Component Overview + +The operator binary starts a set of library-go controllers that collectively reconcile the desired state: + +``` +RunOperator() + ├── OpenShiftControllerManagerOperator — main sync loop for both operand Deployments + ├── StaticResourceController — reconciles RBAC, Namespaces, Services, NetworkPolicies from bindata/ + ├── ConfigObserver — watches cluster config and transforms it into operand configuration + ├── UserCAObservationController — watches proxy config and syncs trusted CA bundles + ├── ResourceSyncController — copies Secrets/ConfigMaps between namespaces + ├── ClusterOperatorStatusController — aggregates conditions into the ClusterOperator object + ├── LogLevelController — reconciles operator log verbosity from CR spec + └── ImagePullSecretCleanupController — cleans up internal-registry pull secrets when registry is disabled +``` + +## Reconciliation Flow + +The main sync loop (`syncOpenShiftControllerManager_v311_00_to_latest`) runs on every change to the operator CR, operand namespaces, or watched cluster config: + +1. **ConfigMaps** — merge default config with observed config and operator overrides, compute input hashes for rollout detection +2. **Client CA** — sync the kube-apiserver client CA bundle into each operand namespace +3. **Trust bundles** — ensure service-ca and global-ca ConfigMaps exist with injection annotations +4. **Deployments** — apply the operand Deployment with correct image, replica count (one per master node), log level, proxy env vars, and spec annotations +5. **Status** — set Available/Progressing/Degraded/Upgradeable conditions on the operator CR, then let the ClusterOperatorStatusController aggregate them + +## Capabilities Integration + +The operator dynamically disables operand controllers when cluster capabilities are not enabled: + +| Capability | Disabled Controllers | +|-----------|---------------------| +| `Build` | build, build-config-change, builder-sa, builder-rolebindings | +| `DeploymentConfig` | deployer, deployer-sa, deployment-config, deployer-rolebindings | +| `ImageRegistry` | service-account-pull-secrets, image-puller-rolebindings | + +When the Build capability is initially disabled, the operator polls every 5 minutes and exits (triggering a restart) if the capability becomes enabled — ensuring a full re-initialization with build controllers active. + +## Config Observation + +Config observers watch cluster-level resources and merge their state into the operand ConfigMap: + +| Observer | Watches | Configures | +|---------|---------|-----------| +| `builds` | `builds.config.openshift.io/cluster` | Build defaults and overrides (only when Build capability is enabled) | +| `images` | `images.config.openshift.io/cluster` | Internal/external registry hostnames | +| `network` | `networks.config.openshift.io/cluster` | Cluster network CIDR for egress/ingress | +| `deployimages` | Operator env vars | Deployer and builder image pull specs | +| `controllers` | `clusterversions.config.openshift.io/version` | Controller enable/disable list based on capabilities | +| `featuregates` | `featuregates.config.openshift.io/cluster` | Feature flags (e.g., `BuildCSIVolumes`) | +| `apiserver` (TLS) | `apiservers.config.openshift.io/cluster` | TLS security profile (cipher suites, min TLS version) propagated to operand | + +## CRDs and API Types + +- **Owned:** `openshiftcontrollermanagers.operator.openshift.io` — the operator's configuration CR (singleton named `cluster`) +- **Consumed:** `clusteroperators.config.openshift.io`, `clusterversions.config.openshift.io`, `proxies.config.openshift.io`, `builds.config.openshift.io`, `images.config.openshift.io`, `networks.config.openshift.io`, `featuregates.config.openshift.io` + +## Manifest and Resource Management + +- **CVO-managed** (`manifests/`): Operator Deployment, Namespace, ServiceAccount, RBAC, ServiceMonitors, ClusterOperator, NetworkPolicies — applied by the Cluster Version Operator during install/upgrade +- **Operator-applied** (`bindata/assets/`): Operand Namespaces, Deployments, ConfigMaps, RBAC, Services, NetworkPolicies — reconciled by the StaticResourceController and the main sync loop + +## Dependencies + +| Dependency | Usage | +|-----------|-------| +| `openshift/library-go` | Operator framework: static resource controller, config observer, status aggregation, resource apply, log level controller | +| `openshift/api` | CRD types (`operator/v1`, `config/v1`, `openshiftcontrolplane/v1`) | +| `openshift/client-go` | Typed clients and informers for OpenShift API groups | +| `k8s.io/client-go` | Kubernetes clients, informers, work queues | + +## Testing Strategy + +- **Unit tests** (`pkg/...`): config observer logic, deployment generation, controller enable/disable based on capabilities +- **E2E tests** (`test/e2e/`): cluster-level validation using Ginkgo — network policy enforcement, operator availability +- **OTE** (`.openshift-tests-extension/`): test extension binary included in the production image for CI/CD integration + +## Design Decisions + +1. **Two operands, one operator**: The route-controller-manager was split from the openshift-controller-manager but both share the same operator CR and operator binary. This avoids a separate operator for a small set of controllers, but means the sync loop manages two independent Deployments with separate status condition prefixes. + +2. **Capability-aware controller disable**: Rather than separate operand images per capability, the operator passes a controller enable/disable list via the operand ConfigMap. The `-controllerName` convention (prefixing with `-`) disables individual controllers without changing the binary. + +3. **Rate-limited sync loop**: The main operator uses a token bucket rate limiter (0.05 tokens/sec, burst 4) to avoid hot-looping on rapid config changes. This is in addition to the workqueue's exponential backoff. + +4. **Always managed, not removable**: The operator opts out of library-go's "unmanaged" and "removable" states via `management.SetOperatorAlwaysManaged()` and `management.SetOperatorNotRemovable()`. The openshift-controller-manager is a core platform component that must always run. + +5. **Informer cache sync before controller start**: All informer caches must be fully synced before any controller starts making decisions. A bug (OCPBUGS-81472) caused transiently incorrect operations when controllers acted on partially-synced caches. The fix adds explicit `WaitForCacheSync` checks with error returns for each informer group (operator, kube, config) before starting any controllers. + +6. **TLS security profile propagation**: The operator observes the cluster-wide TLS security profile from `apiservers.config.openshift.io/cluster` and propagates it to the operand configuration via the config observer. When TLS config changes, the operator restarts to pick up the new settings (CNTRLPLANE-2620). + +7. **Capability poll-and-restart**: When the Build capability is disabled at startup, the operator skips the build config informer (avoiding a deadlock in the ConfigObserver — OCPBUGS-22956) and polls ClusterVersion every 5 minutes. If the capability becomes enabled, the operator exits to trigger a pod restart with full re-initialization. This is intentional: capabilities can only be enabled post-install (never disabled), and the newly-enabled informers require a clean startup. The [component-selection enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md) keeps capability awareness at the CVO layer and does not prescribe an operator-level reactive pattern. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 000000000..47dc3e3d8 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 000000000..3638b2d2d --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,71 @@ +# Contributing to cluster-openshift-controller-manager-operator + +## Development Workflow + +1. Fork the repo and clone your fork +2. Create a feature branch from `master` +3. Make your changes, add or update tests +4. Run verification locally: + ```bash + make build verify test + ``` +5. If you changed dependencies: `go mod tidy && go mod vendor`, commit separately from code changes +6. Push your branch and open a PR + +## Pull Request Guidelines + +- Keep PRs focused — one logical change per PR +- Reference JIRA tickets in PR title: `OCPBUGS-XXXXX: description` or `CNTRLPLANE-XXXXX: description` +- Include tests for new functionality +- PRs require `/lgtm` from a reviewer and `/approve` from an approver (see OWNERS) +- All PRs require `/verified` before merge (see OpenShift verification workflow) + +## Testing + +| Command | What It Runs | +|---------|-------------| +| `make build` | Builds the operator and OTE test binaries | +| `make test` | Unit tests in `pkg/...` and `cmd/...` | +| `make verify` | Linting, formatting, vet checks | +| `make test-e2e` | E2E tests against a live cluster (requires `KUBECONFIG`) | + +### E2E Tests + +E2E tests use Ginkgo and require a running OpenShift cluster. They validate network policy enforcement and operator health. + +```bash +make test-e2e +``` + +## Code Conventions + +- **Error handling**: Use `fmt.Errorf` with `%w` for wrapping. Accumulate errors in slices when syncing multiple resources, then report all at once. +- **Status conditions**: Use library-go `v1helpers.SetOperatorCondition` helpers. Prefix condition types with `RouteControllerManager` for route-controller-manager conditions to distinguish from openshift-controller-manager conditions. +- **Config observation**: Each observer returns `(observedConfig, errors)`. On error, return the previous config to avoid flapping. +- **Static resources**: RBAC and namespace manifests go in `bindata/assets/`. Network policy allow rules must be listed before default-deny rules in the StaticResourceController to avoid traffic interruption during reconciliation. + +## Areas Requiring Extra Care + +- **`bindata/assets/`**: Changes to static manifests affect what the operator reconciles. Network policy ordering matters (allow before deny). +- **`manifests/`**: CVO-managed resources. Changes here affect install/upgrade — coordinate with the release team. +- **Capability integration**: Adding or removing controllers from the capability-disable lists affects which controllers run on clusters with reduced capability sets. Test with capabilities both enabled and disabled. +- **Vendor directory**: Committed. Run `go mod tidy && go mod vendor` and commit separately from code changes. + +## CI Pipeline + +CI is managed by Prow via [ci-operator](https://docs.ci.openshift.org/docs/architecture/ci-operator/). The build root image is configured in `.ci-operator.yaml`. Required checks include unit tests, build verification, and e2e tests on a live cluster. + +## Review and Approval + +Reviews and approvals follow the [OWNERS](OWNERS) file. The primary approver groups are `control-plane-approvers` and build-related maintainers. See [OWNERS_ALIASES](OWNERS_ALIASES) for the full list of control-plane approvers. + +## Rebase Checklist + +When rebasing to a new Kubernetes release: + +- [ ] Reference the target [kubernetes release branch](https://github.com/kubernetes/kubernetes/branches) `go.mod` and `CHANGELOG` +- [ ] Bump Go version, all `k8s.io/`, `github.com/openshift/`, and other relevant dependencies +- [ ] Run `go mod vendor && go mod tidy`, commit separately +- [ ] Bump image versions (Dockerfile, `.ci-operator.yaml`) if needed +- [ ] Run `make build verify test` +- [ ] Make code changes as needed until the above pass diff --git a/README.md b/README.md index 7ef260e2e..5af5af21e 100644 --- a/README.md +++ b/README.md @@ -1,65 +1,69 @@ # cluster-openshift-controller-manager-operator -The openshift-controller-manager operator is an -[OpenShift ClusterOperator](https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md#what-is-an-openshift-clusteroperator). -It installs and maintains the OpenShiftControllerManager [Custom Resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) in a cluster and can be viewed with: -``` -oc get clusteroperator openshift-controller-manager -o yaml -``` -The [Custom Resource Definition](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions) -`openshiftcontrollermanagers.operator.openshift.io` -can be viewed in a cluster with: +This operator installs and manages the [openshift-controller-manager](https://github.com/openshift/openshift-controller-manager) and the [route-controller-manager](https://github.com/openshift/route-controller-manager) on an OpenShift cluster. It is an [OpenShift ClusterOperator](https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md#what-is-an-openshift-clusteroperator) that reconciles the `openshiftcontrollermanagers.operator.openshift.io` custom resource, manages operand Deployments, ConfigMaps, RBAC, and reports status via the `openshift-controller-manager` ClusterOperator object. -```console -$ oc get crd openshiftcontrollermanagers.operator.openshift.io -o yaml -``` +The operator dynamically enables and disables operand controllers based on [cluster capabilities](https://docs.openshift.com/container-platform/latest/installing/cluster-capabilities.html) (Build, DeploymentConfig, ImageRegistry), and handles proxy configuration, CA bundle injection, and feature gate observation. -## Test -Many OpenShift ClusterOperators share common build, test, deployment, and update methods. -For more information about how to build, deploy, test, update, and develop OpenShift ClusterOperators, see -[OpenShift ClusterOperator and Operand Developer Document](https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md#how-do-i-buildupdateverifyrun-unit-tests) +## Quick Start -This section explains how to deploy OpenShift with your test openshift-controller-manager image: -[Testing a ClusterOperator/Operand image in a cluster](https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md#how-can-i-test-changes-to-an-openshift-operatoroperandrelease-component) +### Prerequisites -## Rebase -Follow this checklist and copy into the PR: +- Go 1.25+ +- Access to an OpenShift cluster (for e2e tests) -- [ ] Select the desired [kubernetes release branch](https://github.com/kubernetes/kubernetes/branches), and use its `go.mod` and `CHANGELOG` as references for the rest of the work. -- [ ] Bump go version, all `k8s.io/`, `github.com/openshift/`, and any other relevant dependencies as needed. -- [ ] Run `go mod vendor && go mod tidy`, commit that separately from all other changes. -- [ ] Bump image versions (Dockerfile, ci...) if needed. -- [ ] Run `make build verify test`. -- [ ] Make code changes as needed until the above pass. -- [ ] Any other minor update, like documentation. +### Building -## OpenShift Tests Extension (OTE) +```bash +make build +``` -This repository is compatible with the "OpenShift Tests Extension (OTE)" framework. +### Running Tests -### Building the test binary ```bash -make build +# Unit tests +make test + +# E2E tests (requires a running OpenShift cluster) +make test-e2e ``` -### Running test suites and tests +### Running Locally + +This operator runs as a Deployment in the `openshift-controller-manager-operator` namespace. To test a custom image in a cluster, see [Testing a ClusterOperator/Operand image in a cluster](https://github.com/openshift/enhancements/blob/master/dev-guide/operators.md#how-can-i-test-changes-to-an-openshift-operatoroperandrelease-component). + +## Inspecting Cluster State + ```bash -# Run a specific test suite or test -./cluster-openshift-controller-manager-operator-tests-ext run-suite openshift/openshift-controller-manager-operator/all -./cluster-openshift-controller-manager-operator-tests-ext run-test "test-name" +# View the ClusterOperator status +oc get clusteroperator openshift-controller-manager -o yaml -# Run with JUnit output -./cluster-openshift-controller-manager-operator-tests-ext run-suite openshift/openshift-controller-manager-operator/all --junit-path=/tmp/junit-results/junit.xml -./cluster-openshift-controller-manager-operator-tests-ext run-test "test-name" --junit-path=/tmp/junit-results/junit.xml +# View the operator CR +oc get openshiftcontrollermanagers.operator.openshift.io cluster -o yaml ``` -### Listing available tests and suites +## Documentation + +- [ARCHITECTURE.md](ARCHITECTURE.md) — Design decisions and component architecture +- [CONTRIBUTING.md](CONTRIBUTING.md) — How to submit changes +- [AGENTS.md](AGENTS.md) — AI agent instructions + +## OpenShift Tests Extension (OTE) + +This repository includes an OTE-compatible test binary for CI integration. + ```bash -# List all test suites +# List available test suites ./cluster-openshift-controller-manager-operator-tests-ext list-suites -# List tests in a specific suite -./cluster-openshift-controller-manager-operator-tests-ext list-tests openshift/openshift-controller-manager-operator/all +# Run a test suite +./cluster-openshift-controller-manager-operator-tests-ext run-suite openshift/openshift-controller-manager-operator/all ``` -The test extension binary is included in the production image for CI/CD integration. \ No newline at end of file +## Related Repositories + +| Repository | Relationship | +|-----------|-------------| +| [openshift/openshift-controller-manager](https://github.com/openshift/openshift-controller-manager) | Operand — the controller-manager binary this operator deploys | +| [openshift/route-controller-manager](https://github.com/openshift/route-controller-manager) | Operand — the route-controller-manager binary this operator deploys | +| [openshift/library-go](https://github.com/openshift/library-go) | Shared operator framework (static resource controller, config observer, status controller) | +| [openshift/api](https://github.com/openshift/api) | API types including `OpenShiftControllerManager` operator config |