Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,17 @@ jobs:
with:
path: platform

- name: Checkout seam-core (replace dep)
- name: Checkout seam (replace dep)
uses: actions/checkout@v4
with:
repository: ontai-dev/seam-core
path: seam-core
repository: ontai-dev/seam
path: seam

- name: Checkout seam-sdk (replace dep)
uses: actions/checkout@v4
with:
repository: ontai-dev/seam-sdk
path: seam-sdk

- name: Checkout conductor (replace dep)
uses: actions/checkout@v4
Expand Down
44 changes: 34 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,52 @@
> Read ~/ontai/CLAUDE.md first. The constraints below extend the root constitutional document.

### Schema authority

Primary: docs/platform-schema.md
CRD schema authority: ~/ontai/seam-core/docs/seam-core-schema.md (Decision G: seam-core owns InfrastructureTalosCluster and InfrastructureRunnerConfig type definitions; platform owns reconciliation behavior)
Supporting: ~/ontai/conductor/docs/conductor-schema.md (Conductor capabilities and job protocol for operational Jobs)
Supporting: ~/ontai/guardian/docs/guardian-schema.md (RBACProfile gate and enable phase order)
Supporting: ~/ontai/wrapper/docs/wrapper-schema.md (PackInstance gate for Cilium deployment)

Supporting (read before any design or implementation work):
- ~/ontai/seam/docs/seam-schema.md -- RunnerConfig and TalosCluster CRD schema (seam is the canonical module; not seam-core)
- ~/ontai/conductor/docs/conductor-schema.md -- Conductor capabilities and Job protocol for all operational Jobs
- ~/ontai/guardian/docs/guardian-schema.md -- RBACProfile gate and enable phase order
- ~/ontai/dispatcher/docs/dispatcher-schema.md -- PackInstalled gate for Cilium deployment (not wrapper)

### Invariants

INV-015 -- Deletion of TalosCluster never triggers physical cluster destruction. ClusterReset is the only path to cluster destruction.
CP-INV-001 -- The talos goclient is restricted to SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler only. Every other reconciler in platform has zero talos goclient access. (root INV-013)

CP-INV-001 -- The talos goclient is restricted to SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler only. Every other reconciler in platform has zero talos goclient access. Any other file importing the talos goclient is an invariant violation. (root INV-013)

CP-INV-002 -- All reconcilers outside the Seam Infrastructure Provider observe cluster state through CAPI Machine status conditions and Kubernetes node labels only. No direct Talos API queries outside the provider.

CP-INV-003 -- RunnerConfig is generated by the operator using the shared runner library for all operational Job CRDs. Never hand-coded. Not generated for CAPI-managed lifecycle operations.

CP-INV-004 -- platform creates tenant namespaces. It is the sole namespace creation authority. No other component creates seam-tenant-{cluster-name} namespaces.

CP-INV-006 -- TalosClusterReset requires ontai.dev/reset-approved=true annotation before any reconciliation proceeds.

CP-INV-007 -- Leader election required. Lease name: platform-leader. Lease namespace: seam-system.

CP-INV-008 -- TalosCluster owns all CAPI objects for target clusters via ownerReference. No CAPI object exists in a tenant namespace without a TalosCluster ownerReference.

CP-INV-009 -- Every TalosConfigTemplate includes cluster.network.cni.name: none and Cilium-required BPF kernel parameters. Omitting them leaves nodes permanently NotReady.
CP-INV-010 -- Kueue is not used for any operation in platform. Operational runner Jobs submit directly. Kueue governs wrapper pack-deploy Jobs exclusively.

CP-INV-010 -- Kueue is not used for any operation in platform. Operational runner Jobs submit directly. Kueue governs dispatcher pack-deploy Jobs exclusively.

CP-INV-011 -- The Seam Infrastructure Provider binary is distroless. Contains talos goclient and kube goclient only. (root INV-022)
CP-INV-012 -- platform is installed after guardian reaches operational state and its RBACProfile reaches provisioned=true.

CP-INV-012 -- platform installs after guardian reaches operational state and its RBACProfile reaches provisioned=true.

CP-INV-013 -- CiliumPending on TalosCluster is not a degraded state. It is the expected state between CAPI cluster Running and Cilium PackInstance Ready.

### Session protocol additions
Step 4a -- Read platform-design.md in this repository.
Step 4b -- Determine which category the target CRD belongs to before implementing any reconciler: CAPI-managed lifecycle (TalosCluster target path, SeamInfrastructureCluster, SeamInfrastructureMachine -- no RunnerConfig); operational runner Job CRDs (TalosBackup, TalosEtcdMaintenance, TalosPKIRotation, TalosRecovery, TalosHardeningApply, TalosNodePatch, TalosCredentialRotation, TalosClusterReset -- verify capability in conductor-schema.md first). PlatformTenant is dropped: tenant coordination is handled by InfrastructureTalosCluster (mode=import or mode=bootstrap) plus the conductor role=tenant Deployment managed by the compiler enable bundle.
Step 4c -- For any Seam Infrastructure Provider session: confirm talos goclient access is bounded to SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler only. Any other file importing talos goclient is a CP-INV-001 violation.

Step 4a -- Read platform-design.md in this repository before any implementation session.

Step 4b -- Determine which category the target CRD belongs to before implementing any reconciler:
- CAPI-managed lifecycle path (TalosCluster target path, SeamInfrastructureCluster, SeamInfrastructureMachine): no RunnerConfig generated. These reconcilers must not import the talos goclient (only the Seam Infrastructure Provider reconcilers may).
- Dual-path CRDs (UpgradePolicy, NodeOperation, ClusterMaintenance): check spec.capi.enabled on the owning TalosCluster. CAPI path uses native CAPI machinery. Non-CAPI path submits a Conductor executor Job via RunnerConfig. Verify the named capability in conductor-schema.md before implementing.
- Direct Conductor Job CRDs (EtcdMaintenance, NodeMaintenance, PKIRotation, ClusterReset, TalosMachineConfigBackup, TalosMachineConfigRestore, MaintenanceBundle): always submit a Conductor executor Job regardless of capi.enabled. Verify the named capability in conductor-schema.md before implementing.
- Configuration-only CRDs (HardeningProfile): no Job submission. Validates spec and sets status conditions only.
- Schedule CRDs (TalosEtcdBackupSchedule, TalosMachineConfigBackupSchedule): create child operation CRs on interval. No direct Job submission.

Step 4c -- For any Seam Infrastructure Provider session: confirm talos goclient access is bounded to SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler only. Run a grep for talos goclient imports across all reconciler files before and after any change. Any other file importing the talos goclient is a CP-INV-001 violation and must be corrected before the session closes.
176 changes: 81 additions & 95 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,139 +1,125 @@
# platform

**Seam Platform operator**
**API Group:** `platform.ontai.dev` (ONT-native), `infrastructure.cluster.x-k8s.io` (CAPI)
**Image:** `registry.ontai.dev/ontai-dev/platform:<semver>`
Platform is the CAPI management plane operator and ONT-native Infrastructure Provider for Talos. It owns the complete lifecycle of Talos clusters under Seam governance and all day-2 operational CRDs.

---
## API Groups

## What this repository is
### seam.ontai.dev/v1alpha1

`platform` is the CAPI management plane operator and the ONT-native Infrastructure
Provider for Talos. It owns the complete lifecycle of Talos clusters and all tenant
coordination.
| Kind | Short | Scope | Purpose |
|------|-------|-------|---------|
| TalosCluster | tc | Namespaced | Root CR for every cluster under Seam governance |
| ClusterLog | clog | Namespaced | Accumulated day-2 operation history per cluster per revision |

---
These types are defined in `api/seam/v1alpha1/`. TalosCluster and ClusterLog live under `seam.ontai.dev`, not `platform.ontai.dev`.

### platform.ontai.dev/v1alpha1

| Kind | Short | Scope | Purpose |
|------|-------|-------|---------|
| EtcdMaintenance | em | Namespaced | Etcd backup, restore, and defrag operations |
| TalosEtcdBackupSchedule | etcdbs | Namespaced | Recurring etcd backup schedule (creates EtcdMaintenance CRs) |
| NodeMaintenance | nm | Namespaced | Node-level patch, hardening-apply, credential-rotate |
| NodeOperation | nop | Namespaced | Node scale-up, decommission, reboot |
| PKIRotation | pkir | Namespaced | Cluster PKI certificate rotation |
| ClusterReset | crst | Namespaced | Destructive factory reset (human gate required) |
| ClusterMaintenance | cmaint | Namespaced | Maintenance window gate with CAPI pause integration |
| UpgradePolicy | upgp | Namespaced | Talos OS, Kubernetes, or combined stack upgrades |
| HardeningProfile | hp | Namespaced | Reusable hardening ruleset (configuration CR, not a Job trigger) |
| MaintenanceBundle | mb | Namespaced | Pre-compiled scheduling artifact from `compiler maintenance` |
| TalosMachineConfigBackup | mcb | Namespaced | Node machine config backup to S3 |
| TalosMachineConfigBackupSchedule | mcbs | Namespaced | Recurring machine config backup schedule |
| TalosMachineConfigRestore | mcr | Namespaced | Node machine config restore from S3 |

### infrastructure.cluster.x-k8s.io (CAPI -- frozen)

| Kind | Purpose |
|------|---------|
| SeamInfrastructureCluster | Cluster-level CAPI infrastructure reference |
| SeamInfrastructureMachine | Per-node CAPI infrastructure reference |

## CRDs

### ONT-native (`platform.ontai.dev`)

| Kind | Role |
|---|---|
| `TalosCluster` | Root declaration for a Talos target cluster (CAPI composition root) |
| `TalosClusterReset` | Affirmative CR for cluster destruction with human approval gate |
| `TalosBackup` | Operational runner Job for etcd snapshot backup |
| `TalosEtcdMaintenance` | Operational runner Job for etcd defragmentation and compaction |
| `TalosPKIRotation` | Operational runner Job for PKI certificate rotation |
| `TalosRecovery` | Operational runner Job for cluster recovery from etcd snapshot |
| `TalosHardeningApply` | Operational runner Job for CIS benchmark hardening |
| `TalosNodePatch` | Operational runner Job for targeted node configuration patch |
| `TalosNodeOperation` | Operational runner Job for node cordon, drain, and reboot sequences |
| `TalosCredentialRotation` | Operational runner Job for credential rotation |
| `ClusterMaintenance` | Operational runner Job for scheduled maintenance windows |
| `UpgradePolicy` | Declared upgrade policy for a cluster or node pool |
| `HardeningProfile` | Declared hardening target profile |
| `MaintenanceBundle` | Aggregate maintenance intent record |

### CAPI Infrastructure Provider (`infrastructure.cluster.x-k8s.io`)

| Kind | Role |
|---|---|
| `SeamInfrastructureCluster` | CAPI InfrastructureCluster implementation for Talos |
| `SeamInfrastructureMachine` | CAPI InfrastructureMachine implementation for Talos nodes |
These implement the CAPI InfrastructureCluster and InfrastructureMachine contracts. Schema is frozen and out of scope for platform development.

---

## Architecture

Platform operates in three modes.
Platform operates in three modes simultaneously on the management cluster.

### CAPI target cluster lifecycle

For `spec.capi.enabled=true` TalosCluster CRs, Platform creates and owns CAPI objects (SeamInfrastructureCluster, cluster.x-k8s.io/Cluster, TalosControlPlane, MachineDeployment, TalosConfigTemplate, SeamInfrastructureMachineTemplate) in the tenant namespace via ownerReference (CP-INV-008). CAPI controllers reconcile those objects to actual cluster state through the Seam Infrastructure Provider.

The Seam Infrastructure Provider (SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler) is the only part of Platform that uses the talos goclient. It watches SeamInfrastructureMachine objects and delivers CABPT-rendered machineconfigs to pre-provisioned Talos nodes on port 50000.

Dual-path CRDs (UpgradePolicy, NodeOperation, ClusterMaintenance) delegate to CAPI native machinery on this path. No Conductor Job is submitted for CAPI-managed lifecycle operations.

### Direct bootstrap management cluster

**CAPI composition (target cluster lifecycle):**
`TalosCluster` is the root object. The platform reconciler creates and owns CAPI
objects (`Cluster`, `TalosControlPlane`, `MachineDeployment`, `SeamInfrastructureCluster`,
`SeamInfrastructureMachine`) as children of `TalosCluster`. The Seam Infrastructure
Provider reconcilers deliver machineconfigs to pre-provisioned nodes on port 50000
via the talos goclient.
For the management cluster TalosCluster CR (`spec.capi.enabled=false`), CAPI is not used. Management cluster bootstrap is Seam-native: the Compiler generates machineconfigs, Platform submits a bootstrap Conductor Job, and the cluster forms without CAPI intermediation.

**Direct bootstrap Job (management cluster):**
The ONT bootstrap path via conductor Jobs is used for management cluster bootstrap.
CAPI is not involved in management cluster provisioning.
All operational CRDs apply to the management cluster via direct Conductor executor Job submission regardless of `capi.enabled`.

**Operational runner Jobs (Talos operational CRDs):**
Seven CRDs (`TalosBackup`, `TalosEtcdMaintenance`, `TalosPKIRotation`, `TalosRecovery`,
`TalosHardeningApply`, `TalosNodePatch`, `TalosCredentialRotation`) submit conductor
executor Jobs directly. Kueue is not used for any platform operation.
### Operational runner Jobs

**Tenant coordination:**
Platform creates `seam-tenant-{cluster-name}` namespaces. It is the sole namespace
creation authority. Tenant coordination CRDs (`UpgradePolicy`, `HardeningProfile`,
`MaintenanceBundle`) are pure record-keeping reconcilers with no runner Jobs.
For operational CRDs (EtcdMaintenance, NodeMaintenance, PKIRotation, ClusterReset, and the non-CAPI paths of UpgradePolicy and NodeOperation), Platform generates a RunnerConfig using the shared runner library and submits a Conductor executor Job directly. Kueue is not involved (CP-INV-010). Jobs submit directly without Kueue admission control.

---

## Key invariants

- The talos goclient is restricted exclusively to `SeamInfrastructureClusterReconciler`
and `SeamInfrastructureMachineReconciler`. All other reconcilers have zero talos
goclient access.
- `TalosCluster` deletion never triggers cluster destruction. `TalosClusterReset`
is the only destruction path, and requires `ontai.dev/reset-approved=true`.
- Kueue is not used for any operation in platform.
- Platform installs after guardian reaches `provisioned=true` on its `RBACProfile`.
**talos goclient restriction (CP-INV-001):** The talos goclient is restricted to SeamInfrastructureClusterReconciler and SeamInfrastructureMachineReconciler only. Every other reconciler in Platform has zero talos goclient access.

---
**TalosCluster deletion never destroys a cluster (INV-015):** Deleting a TalosCluster CR cascades to owned CAPI objects through Kubernetes garbage collection but does not factory reset any node. ClusterReset is the only path to physical cluster destruction.

## Building
**Kueue is not used (CP-INV-010):** Platform does not use Kueue for any operation. Operational runner Jobs submit directly. Kueue governs dispatcher pack-deploy Jobs exclusively.

```sh
go build ./cmd/platform
```
**RunnerConfig is generated by the operator (CP-INV-003):** RunnerConfig is always generated by Platform using the shared runner library. It is never hand-coded and is not generated for CAPI-managed lifecycle operations.

The binary is built into a distroless container image:
**ClusterReset requires human approval (CP-INV-006):** The `ontai.dev/reset-approved=true` annotation must be present on the ClusterReset CR before any reconciliation proceeds.

```sh
docker build -t registry.ontai.dev/ontai-dev/platform:<semver> .
```
**Tenant namespaces (CP-INV-004):** Platform is the sole authority for creating `seam-tenant-{cluster-name}` namespaces.

---
**Cilium install order (CP-INV-009, CP-INV-013):** Every TalosConfigTemplate includes `cluster.network.cni.name: none` and Cilium BPF kernel parameters. CiliumPending on TalosCluster is not a degraded state; it is the expected state between CAPI cluster Running and Cilium PackInstance Ready.

## Testing
**Install gate (CP-INV-012):** Platform installs after Guardian reaches operational state and its RBACProfile reaches `provisioned=true`.

```sh
go test ./test/unit/...
```
**Leader election (CP-INV-007):** Leader election is required. Lease name: `platform-leader`. Lease namespace: `seam-system`.

---

## Schema and design reference
## Build and test

- `docs/platform-schema.md` - API contract, field definitions, status conditions
- `platform-design.md` - Implementation architecture and reconciler design
```
make build
make test
make e2e # requires MGMT_KUBECONFIG
make docker-build IMAGE_REGISTRY=10.20.0.1:5000/ontai-dev
make docker-push IMAGE_REGISTRY=10.20.0.1:5000/ontai-dev
```

Operator Deployments and enable bundles always reference `:dev` in lab and development environments (INV-023).

---

## Status
## Schema

Alpha. Deployed and tested on management cluster (ccs-mgmt).
Tenant cluster onboarding is not yet verified end to end.
See [docs/platform-schema.md](./docs/platform-schema.md)
for current capability and known gaps.
Primary schema reference: `docs/platform-schema.md`

CRDs are deployed and reconciling on the live management cluster.
The schema specification is published at:
https://schema.ontai.dev/v1alpha1/
Supporting references:

## Contributing
- `~/ontai/seam/docs/seam-schema.md` -- RunnerConfig and TalosCluster CRD schema
- `~/ontai/conductor/docs/conductor-schema.md` -- Conductor capabilities and Job protocol
- `~/ontai/guardian/docs/guardian-schema.md` -- RBACProfile gate and enable phase order
- `~/ontai/dispatcher/docs/dispatcher-schema.md` -- PackInstalled gate for Cilium

---

Read [CONTRIBUTING.md](./CONTRIBUTING.md) before opening a pull
request. Every new reconciliation behavior requires a written
specification and senior engineer sign-off before any code is
written.
## Issues

File issues at https://github.com/ontai-dev/platform/issues.
For security issues contact security@ontai.dev directly.
https://github.com/ontai-dev/platform/issues

---

*platform - Seam Platform Operator*
*Apache License, Version 2.0*
platform - Seam Platform Operator
Apache License, Version 2.0
6 changes: 3 additions & 3 deletions api/infrastructure/v1alpha1/lineage_conditions.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ package v1alpha1
//
// Seam Infrastructure Provider reconcilers reference these via the infrav1alpha1
// package alias; they continue to compile without modification. New code should
// prefer importing github.com/ontai-dev/seam-core/pkg/conditions directly.
// prefer importing github.com/ontai-dev/seam/pkg/conditions directly.

import "github.com/ontai-dev/seam-core/pkg/conditions"
import "github.com/ontai-dev/seam/pkg/conditions"

const (
// ConditionTypeLineageSynced is the reserved condition type for lineage
// synchronization status on every root declaration CR.
// Canonical source: github.com/ontai-dev/seam-core/pkg/conditions.
// Canonical source: github.com/ontai-dev/seam/pkg/conditions.
ConditionTypeLineageSynced = conditions.ConditionTypeLineageSynced

// ReasonLineageControllerAbsent is set when the reconciler initialises
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"github.com/ontai-dev/seam-core/pkg/lineage"
"github.com/ontai-dev/seam/pkg/lineage"
)

// Condition type constants for SeamInfrastructureCluster.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"github.com/ontai-dev/seam-core/pkg/lineage"
"github.com/ontai-dev/seam/pkg/lineage"
)

// NodeRole defines the role of a node in a Talos cluster.
Expand Down
Loading
Loading