Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# AI Agent Instructions for cluster-openshift-controller-manager-operator

> Also read [ARCHITECTURE.md](ARCHITECTURE.md) for design decisions and [CONTRIBUTING.md](CONTRIBUTING.md) for workflow.

## What This Repo Is

This is the **operator** for the openshift-controller-manager and route-controller-manager on OpenShift. It reconciles the `openshiftcontrollermanagers.operator.openshift.io/cluster` CR and manages two operand Deployments, their ConfigMaps, RBAC, network policies, and CA bundles. It reports status via the `openshift-controller-manager` ClusterOperator object.

The **operand** (the controller-manager binary itself) lives in [openshift/openshift-controller-manager](https://github.com/openshift/openshift-controller-manager). This repo only manages how that binary is deployed and configured.

## Repository Layout

```text
cmd/ # Entry points
cluster-openshift-controller-manager-operator/ # Operator binary
cluster-openshift-controller-manager-operator-tests-ext/ # OTE test binary
pkg/
operator/ # Core operator logic
starter.go # RunOperator — wires all controllers
operator.go # Main reconciler struct and work queue
sync_openshiftcontrollermanager_v311_00.go # Sync loop for both operands
configobservation/ # Config observers (builds, images, network, controllers)
usercaobservation/ # Proxy CA bundle sync controller
internalimageregistry/ # Image pull secret cleanup controller
util/consts.go # Namespace constants
bindata/assets/ # Static manifests applied by the operator
manifests/ # CVO-managed manifests (install/upgrade)
test/e2e/ # E2E tests (Ginkgo)
```

## Build and Test Commands

```bash
make build # Build operator and OTE test binaries
make test # Unit tests
make verify # Lint and vet
make test-e2e # E2E tests (requires KUBECONFIG)
```

## Critical Rules

1. **Network policy ordering matters.** In `starter.go`, allow-rules must appear before default-deny rules in the StaticResourceController asset list. Reversing this order causes traffic interruption during reconciliation.

2. **Never remove the "always managed" and "not removable" flags.** The operator calls `management.SetOperatorAlwaysManaged()` and `management.SetOperatorNotRemovable()` because the openshift-controller-manager is a core platform component.

3. **Config observers must return previous config on error.** If an observer encounters an error, it must return `previousConfig` (not empty config) to prevent flapping the operand configuration.

4. **All informer caches must sync before controllers start.** The operator explicitly calls `WaitForCacheSync` for every informer group and returns an error if any fail. Skipping this check causes transiently incorrect operations from partially-synced caches (see OCPBUGS-81472).

5. **Vendor is committed.** After dependency changes, run `go mod tidy && go mod vendor` and commit the vendor update separately from code changes.

## Key Patterns

- **library-go operator framework**: Uses `StaticResourceController`, `ConfigObserver`, `ClusterOperatorStatusController`, `ResourceSyncController`, and `LogLevelController` from `openshift/library-go`.
- **Dual-operand sync**: One sync function manages both the openshift-controller-manager and route-controller-manager Deployments. Errors and status conditions are tracked separately with the `RouteControllerManager` prefix for the route-controller-manager.
- **Capability-gated controllers**: Controllers are disabled by prepending `-` to their name in the operand ConfigMap (e.g., `-openshift.io/build`). The mapping from controllers to capabilities is in `controllerCapabilities` in `sync_openshiftcontrollermanager_v311_00.go`.
- **Input hash-driven rollouts**: ConfigMap resource versions and content hashes are embedded as annotations on the Deployment spec template, triggering a rolling update when upstream inputs change.
- **TLS security profile propagation**: The cluster-wide TLS profile from `apiservers.config.openshift.io` is observed and propagated into the operand config. TLS config changes trigger an operator restart.

## What NOT to Do

- Do not add new operand controllers here — controllers belong in `openshift/openshift-controller-manager`. This repo only manages deployment and configuration.
- Do not modify `manifests/` without coordinating with the release team — these are CVO-managed and affect install/upgrade.
- Do not change the rate limiter parameters in `operator.go` without understanding the impact on API server load during rapid config changes.
- Do not add bindata assets after default-deny network policies — always add allow rules first.

## Test Suites

- **Unit tests**: Colocated `_test.go` files in `pkg/`. Focus on config observer logic and deployment generation.
- **E2E tests**: `test/e2e/` using Ginkgo. Validate network policy enforcement and operator availability on a live cluster.
- **OTE**: The test extension binary (`cluster-openshift-controller-manager-operator-tests-ext`) is built alongside the operator and included in the production image.
113 changes: 113 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Architecture: cluster-openshift-controller-manager-operator

## Scope

This operator manages two operand Deployments:

- **openshift-controller-manager** — runs OpenShift-specific controllers (builds, deployments, image triggers, service accounts, templates, unidling)
- **route-controller-manager** — runs the ingress-to-route and ingress IP controllers (split from the controller-manager but still sharing the same operator CR)

The operator does **not** manage the upstream kube-controller-manager (that is `cluster-kube-controller-manager-operator`).

## Namespace Map

| Namespace | Purpose |
|-----------|---------|
| `openshift-controller-manager-operator` | Operator Deployment runs here |
| `openshift-controller-manager` | Operand: openshift-controller-manager Deployment, ConfigMaps, ServiceAccount |
| `openshift-route-controller-manager` | Operand: route-controller-manager Deployment, ConfigMaps, ServiceAccount |
| `openshift-config` | Source for user-specified global configuration (proxy CA, build defaults) |
| `openshift-config-managed` | Source for machine-managed global configuration |

## Component Overview

The operator binary starts a set of library-go controllers that collectively reconcile the desired state:

```
RunOperator()
├── OpenShiftControllerManagerOperator — main sync loop for both operand Deployments
├── StaticResourceController — reconciles RBAC, Namespaces, Services, NetworkPolicies from bindata/
├── ConfigObserver — watches cluster config and transforms it into operand configuration
├── UserCAObservationController — watches proxy config and syncs trusted CA bundles
├── ResourceSyncController — copies Secrets/ConfigMaps between namespaces
├── ClusterOperatorStatusController — aggregates conditions into the ClusterOperator object
├── LogLevelController — reconciles operator log verbosity from CR spec
└── ImagePullSecretCleanupController — cleans up internal-registry pull secrets when registry is disabled
```
Comment on lines +26 to +36

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Label the diagram fence.

This block will trip markdownlint as written. Use an explicit language tag such as text.

Suggested fix
-```
+```text
 RunOperator()
   ├── OpenShiftControllerManagerOperator  — main sync loop for both operand Deployments
   ├── StaticResourceController            — reconciles RBAC, Namespaces, Services, NetworkPolicies from bindata/
   ...
-```
+```
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
RunOperator()
├── OpenShiftControllerManagerOperator — main sync loop for both operand Deployments
├── StaticResourceController — reconciles RBAC, Namespaces, Services, NetworkPolicies from bindata/
├── ConfigObserver — watches cluster config and transforms it into operand configuration
├── UserCAObservationController — watches proxy config and syncs trusted CA bundles
├── ResourceSyncController — copies Secrets/ConfigMaps between namespaces
├── ClusterOperatorStatusController — aggregates conditions into the ClusterOperator object
├── LogLevelController — reconciles operator log verbosity from CR spec
└── ImagePullSecretCleanupController — cleans up internal-registry pull secrets when registry is disabled
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 26 - 36, The architecture diagram block is an
unlabeled fenced code block, which triggers markdownlint. Update the fence
around the RunOperator() tree to use an explicit language tag such as text so
the diagram remains rendered correctly. Make this change in the markdown section
containing the controller list and keep the existing content unchanged.

Source: Linters/SAST tools


## Reconciliation Flow

The main sync loop (`syncOpenShiftControllerManager_v311_00_to_latest`) runs on every change to the operator CR, operand namespaces, or watched cluster config:

1. **ConfigMaps** — merge default config with observed config and operator overrides, compute input hashes for rollout detection
2. **Client CA** — sync the kube-apiserver client CA bundle into each operand namespace
3. **Trust bundles** — ensure service-ca and global-ca ConfigMaps exist with injection annotations
4. **Deployments** — apply the operand Deployment with correct image, replica count (one per master node), log level, proxy env vars, and spec annotations
5. **Status** — set Available/Progressing/Degraded/Upgradeable conditions on the operator CR, then let the ClusterOperatorStatusController aggregate them

## Capabilities Integration

The operator dynamically disables operand controllers when cluster capabilities are not enabled:

| Capability | Disabled Controllers |
|-----------|---------------------|
| `Build` | build, build-config-change, builder-sa, builder-rolebindings |
| `DeploymentConfig` | deployer, deployer-sa, deployment-config, deployer-rolebindings |
| `ImageRegistry` | service-account-pull-secrets, image-puller-rolebindings |

When the Build capability is initially disabled, the operator polls every 5 minutes and exits (triggering a restart) if the capability becomes enabled — ensuring a full re-initialization with build controllers active.

## Config Observation

Config observers watch cluster-level resources and merge their state into the operand ConfigMap:

| Observer | Watches | Configures |
|---------|---------|-----------|
| `builds` | `builds.config.openshift.io/cluster` | Build defaults and overrides (only when Build capability is enabled) |
| `images` | `images.config.openshift.io/cluster` | Internal/external registry hostnames |
| `network` | `networks.config.openshift.io/cluster` | Cluster network CIDR for egress/ingress |
| `deployimages` | Operator env vars | Deployer and builder image pull specs |
| `controllers` | `clusterversions.config.openshift.io/version` | Controller enable/disable list based on capabilities |
| `featuregates` | `featuregates.config.openshift.io/cluster` | Feature flags (e.g., `BuildCSIVolumes`) |
| `apiserver` (TLS) | `apiservers.config.openshift.io/cluster` | TLS security profile (cipher suites, min TLS version) propagated to operand |

## CRDs and API Types

- **Owned:** `openshiftcontrollermanagers.operator.openshift.io` — the operator's configuration CR (singleton named `cluster`)
- **Consumed:** `clusteroperators.config.openshift.io`, `clusterversions.config.openshift.io`, `proxies.config.openshift.io`, `builds.config.openshift.io`, `images.config.openshift.io`, `networks.config.openshift.io`, `featuregates.config.openshift.io`

## Manifest and Resource Management

- **CVO-managed** (`manifests/`): Operator Deployment, Namespace, ServiceAccount, RBAC, ServiceMonitors, ClusterOperator, NetworkPolicies — applied by the Cluster Version Operator during install/upgrade
- **Operator-applied** (`bindata/assets/`): Operand Namespaces, Deployments, ConfigMaps, RBAC, Services, NetworkPolicies — reconciled by the StaticResourceController and the main sync loop

## Dependencies

| Dependency | Usage |
|-----------|-------|
| `openshift/library-go` | Operator framework: static resource controller, config observer, status aggregation, resource apply, log level controller |
| `openshift/api` | CRD types (`operator/v1`, `config/v1`, `openshiftcontrolplane/v1`) |
| `openshift/client-go` | Typed clients and informers for OpenShift API groups |
| `k8s.io/client-go` | Kubernetes clients, informers, work queues |

## Testing Strategy

- **Unit tests** (`pkg/...`): config observer logic, deployment generation, controller enable/disable based on capabilities
- **E2E tests** (`test/e2e/`): cluster-level validation using Ginkgo — network policy enforcement, operator availability
- **OTE** (`.openshift-tests-extension/`): test extension binary included in the production image for CI/CD integration

## Design Decisions

1. **Two operands, one operator**: The route-controller-manager was split from the openshift-controller-manager but both share the same operator CR and operator binary. This avoids a separate operator for a small set of controllers, but means the sync loop manages two independent Deployments with separate status condition prefixes.

2. **Capability-aware controller disable**: Rather than separate operand images per capability, the operator passes a controller enable/disable list via the operand ConfigMap. The `-controllerName` convention (prefixing with `-`) disables individual controllers without changing the binary.

3. **Rate-limited sync loop**: The main operator uses a token bucket rate limiter (0.05 tokens/sec, burst 4) to avoid hot-looping on rapid config changes. This is in addition to the workqueue's exponential backoff.

4. **Always managed, not removable**: The operator opts out of library-go's "unmanaged" and "removable" states via `management.SetOperatorAlwaysManaged()` and `management.SetOperatorNotRemovable()`. The openshift-controller-manager is a core platform component that must always run.

5. **Informer cache sync before controller start**: All informer caches must be fully synced before any controller starts making decisions. A bug (OCPBUGS-81472) caused transiently incorrect operations when controllers acted on partially-synced caches. The fix adds explicit `WaitForCacheSync` checks with error returns for each informer group (operator, kube, config) before starting any controllers.

6. **TLS security profile propagation**: The operator observes the cluster-wide TLS security profile from `apiservers.config.openshift.io/cluster` and propagates it to the operand configuration via the config observer. When TLS config changes, the operator restarts to pick up the new settings (CNTRLPLANE-2620).

7. **Capability poll-and-restart**: When the Build capability is disabled at startup, the operator skips the build config informer (avoiding a deadlock in the ConfigObserver — OCPBUGS-22956) and polls ClusterVersion every 5 minutes. If the capability becomes enabled, the operator exits to trigger a pod restart with full re-initialization. This is intentional: capabilities can only be enabled post-install (never disabled), and the newly-enabled informers require a clean startup. The [component-selection enhancement](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md) keeps capability awareness at the CVO layer and does not prescribe an operator-level reactive pattern.
1 change: 1 addition & 0 deletions CLAUDE.md
71 changes: 71 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Contributing to cluster-openshift-controller-manager-operator

## Development Workflow

1. Fork the repo and clone your fork
2. Create a feature branch from `master`
3. Make your changes, add or update tests
4. Run verification locally:
```bash
make build verify test
```
5. If you changed dependencies: `go mod tidy && go mod vendor`, commit separately from code changes
6. Push your branch and open a PR

## Pull Request Guidelines

- Keep PRs focused — one logical change per PR
- Reference JIRA tickets in PR title: `OCPBUGS-XXXXX: description` or `CNTRLPLANE-XXXXX: description`
- Include tests for new functionality
- PRs require `/lgtm` from a reviewer and `/approve` from an approver (see OWNERS)
- All PRs require `/verified` before merge (see OpenShift verification workflow)

## Testing

| Command | What It Runs |
|---------|-------------|
| `make build` | Builds the operator and OTE test binaries |
| `make test` | Unit tests in `pkg/...` and `cmd/...` |
| `make verify` | Linting, formatting, vet checks |
| `make test-e2e` | E2E tests against a live cluster (requires `KUBECONFIG`) |

### E2E Tests

E2E tests use Ginkgo and require a running OpenShift cluster. They validate network policy enforcement and operator health.

```bash
make test-e2e
```

## Code Conventions

- **Error handling**: Use `fmt.Errorf` with `%w` for wrapping. Accumulate errors in slices when syncing multiple resources, then report all at once.
- **Status conditions**: Use library-go `v1helpers.SetOperatorCondition` helpers. Prefix condition types with `RouteControllerManager` for route-controller-manager conditions to distinguish from openshift-controller-manager conditions.
- **Config observation**: Each observer returns `(observedConfig, errors)`. On error, return the previous config to avoid flapping.
- **Static resources**: RBAC and namespace manifests go in `bindata/assets/`. Network policy allow rules must be listed before default-deny rules in the StaticResourceController to avoid traffic interruption during reconciliation.

## Areas Requiring Extra Care

- **`bindata/assets/`**: Changes to static manifests affect what the operator reconciles. Network policy ordering matters (allow before deny).
- **`manifests/`**: CVO-managed resources. Changes here affect install/upgrade — coordinate with the release team.
- **Capability integration**: Adding or removing controllers from the capability-disable lists affects which controllers run on clusters with reduced capability sets. Test with capabilities both enabled and disabled.
- **Vendor directory**: Committed. Run `go mod tidy && go mod vendor` and commit separately from code changes.

## CI Pipeline

CI is managed by Prow via [ci-operator](https://docs.ci.openshift.org/docs/architecture/ci-operator/). The build root image is configured in `.ci-operator.yaml`. Required checks include unit tests, build verification, and e2e tests on a live cluster.

## Review and Approval

Reviews and approvals follow the [OWNERS](OWNERS) file. The primary approver groups are `control-plane-approvers` and build-related maintainers. See [OWNERS_ALIASES](OWNERS_ALIASES) for the full list of control-plane approvers.

## Rebase Checklist

When rebasing to a new Kubernetes release:

- [ ] Reference the target [kubernetes release branch](https://github.com/kubernetes/kubernetes/branches) `go.mod` and `CHANGELOG`
- [ ] Bump Go version, all `k8s.io/`, `github.com/openshift/`, and other relevant dependencies
- [ ] Run `go mod vendor && go mod tidy`, commit separately
- [ ] Bump image versions (Dockerfile, `.ci-operator.yaml`) if needed
- [ ] Run `make build verify test`
- [ ] Make code changes as needed until the above pass
Loading