simplyblock · boddumanohar · May 15, 2026 · May 20, 2026
diff --git a/docs/maintenance-operations/node-drain-coordination.md b/docs/maintenance-operations/node-drain-coordination.md
@@ -0,0 +1,70 @@
+---
+title: "Draining Coordination of a Kubernetes Worker Node"
+description: "How the Simplyblock operator automatically protects storage availability during Kubernetes node maintenance such as cordon, drain, and rolling OS upgrades."
+weight: 10800
+---
+
+When a Kubernetes worker node is cordoned or drained, for example, during a rolling OS upgrade or node replacement,
+the Simplyblock Operator automatically coordinates the shutdown and restart of the backend storage node running on
+that worker. No manual intervention is required.
+
+Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. It defines the at-most number of Kubernetes
+workers that can be drained at the same time. This prevents the cluster from entering a degraded state during bulk
+maintenance operations and restarting cycles.
+
+## How It Works
+
+When the operator detects that a worker node has become cordoned, it executes the following sequence:
+
+1. Creates a `PodDisruptionBudget` to prevent premature pod eviction.
+2. Calls the simplyblock shutdown API for the backend storage node and wait until `offline`.
+3. Relaxes the `PodDisruptionBudget` to allow pod eviction. Kubernetes can now drain the worker.
+4. Waits for the worker to return to a ready, uncordoned state.
+5. Calls the simplyblock restart API and wait until the storage nodes are `online` and cluster `rebalancing` is `false`.
+6. Marks drain coordination `complete` and remove the `PodDisruptionBudget`.
+
+!!! warning
+    If another worker is already in the drain window and `maxFaultTolerance` would be exceeded, the operator holds
+    the new worker in the `detected` phase until an in-progress drain completes to ensure that the cluster remains
+    available and connection loss is mitigated.
+
+## Drain Phases
+
+Each worker being drained progresses through the following phases, tracked in
+`StorageNode.status.drainCoordination`:
+
+| Phase             | Description                                                                   |
+|-------------------|-------------------------------------------------------------------------------|
+| `detected`        | Worker is cordoned. Waiting for a drain slot within `maxFaultTolerance`.      |
+| `shutdown_called` | Backend shutdown API has been called. Waiting for `offline`.                  |
+| `draining`        | Shutdown confirmed. `PodDisruptionBudget` relaxed. Kubernetes may evict pods. |
+| `restart_called`  | Worker is back. Backend restart API has been called. Waiting for `online`.    |
+| `complete`        | Node is back online and cluster rebalancing has finished.                     |
+| `failed`          | An unrecoverable error occurred. Manual intervention may be required.         |
+
+## Monitoring Drain State
+
+The progress of the drain coordination can be monitored using the `StorageNode` custom resource.
+
+```bash title="Inspecting drain coordination status"
+kubectl get storagenode simplyblock-node -n simplyblock \
+  -o jsonpath='{.status.drainCoordination}' | jq .
+```
+
+```bash title="Streaming live changes"
+kubectl get storagenode simplyblock-node -n simplyblock -w
+```
+
+## Configuring Fault Tolerance
+
+To control the number of workers that can be simultaneously drained, the property `spec.maxFaultTolerance` on the
+`StorageCluster` resource can be configured.
+
+```yaml title="Example: allow one worker in the drain window at a time"
+spec:
+  maxFaultTolerance: 1
+```
+
+A value of `1` is the safest default. The safe-maximum of this value depends on the selected erasure coding scheme and
+replication factor. It reflects the maximum number of toleratable simultaneous node outages without connection loss and
+traffic interruption.
diff --git a/docs/maintenance-operations/operator-cluster-operations.md b/docs/maintenance-operations/operator-cluster-operations.md
@@ -0,0 +1,170 @@
+---
+title: "Operating Storage Clusters via Simplyblock Operator"
+description: "How to perform lifecycle operations on a Simplyblock storage cluster and its nodes using the Kubernetes operator and Custom Resource Definitions."
+weight: 10750
+---
+
+When simplyblock is deployed on OpenShift or Kubernetes, cluster and node lifecycle operations are performed by patching
+the `StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the
+changes, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`.
+
+!!! info
+    For CLI-based node operations on non-Kubernetes deployments, see
+    [Stopping and Manually Restarting a Storage Node](manual-restarting-nodes.md).
+
+## StorageCluster Actions
+
+Storage cluster actions are cluster-wide operations that affect all nodes in the cluster.
+
+To trigger a storage cluster action, the `spec.action` property on a `StorageCluster` resource must be patchec. Only
+one action can run at any given time. The operator sets `.status.actionStatus.state` to `running` while the action is in
+progress and to `success` or `failed` when it completes.
+
+### Shutdown
+
+```bash title="Shutting down the storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "shutdown"}}'
+```
+
+The operator calls the backend shutdown API and polls until the cluster reports `suspended`.
+
+### Start
+
+```bash title="Starting a suspended storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "start"}}'
+```
+
+The operator calls the backend start API and polls until the cluster reports `active`.
+
+### Restart
+
+```bash title="Restarting the storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "restart"}}'
+```
+
+The operator runs a shutdown, waits for `suspended`, runs start, and waits for `active`. The current sub-phase is stored
+in `.status.actionStatus.message`.
+
+### Activate and Reactivate
+
+```bash title="Activating a newly created cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "activate"}}'
+```
+
+The operator calls the backend activate API and waits until the cluster reports `active`.
+
+### Expand
+
+```bash title="Finalizing a cluster expansion"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "expand"}}'
+```
+
+The operator calls the backend expand API and waits until the cluster returns to `active`.
+
+!!! info
+    More information on how to add new worker nodes to the storage fabric first is available in
+    [Expanding a Storage Cluster](scaling/expanding-storage-cluster.md).
+
+### Node Recycle
+
+Node recycle sequentially restarts every backend storage node in the cluster. Use it after updating the storage-node
+container image or changing node configuration.
+
+```bash title="Restarting all storage nodes"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "node-recycle"}}'
+```
+
+To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart add
+`nodeRecycle.refreshSNodeAPI: true`. Situations include when rolling out a new container image:
+
+```bash title="Restarting all storage nodes and refreshing DaemonSet pods"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "node-recycle", "nodeRecycle": {"refreshSNodeAPI": true}}}'
+```
+
+For each backend storage node the operator executes:
+
+1. Shuts down the node and wait until `offline` or `in_restart`.
+2. If `refreshSNodeAPI: true`, restarts the DaemonSet pod and wait for the storage-node API to become reachable.
+3. Restarts the node and wait until `online`.
+4. Waits until cluster `rebalancing` is `false`.
+5. Proceeds to the next node.
+
+Progress is tracked in `.status.actionStatus` and `.status.nodeRecycleStatus`:
+
+```bash title="Watching node recycle progress"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.nodeRecycleStatus}' | jq .
+```
+
+## StorageNode Actions
+
+Direct operations on individual backend storage nodes are triggered by patching `spec.action` and `spec.nodeUUID`
+on the `StorageNode` resource. Both fields are required together. The CRD validation rejects an `action` without a
+`nodeUUID`.
+
+```bash title="Restarting a specific storage node"
+kubectl patch storagenode simplyblock-node -n simplyblock \
+  --type=merge -p '{
+    "spec": {
+      "action": "restart",
+      "nodeUUID": "<node-uuid>"
+    }
+  }'
+```
+
+After the action completes, `spec.action` and `spec.nodeUUID` must be cleared from the custom resource. The operator
+does not automatically clear them.
+
+### Supported Actions and Terminal States
+
+| Action     | Expected backend state after success                            |
+|------------|-----------------------------------------------------------------|
+| `shutdown` | `offline`                                                       |
+| `restart`  | `online`                                                        |
+| `suspend`  | `suspended`                                                     |
+| `resume`   | `online`                                                        |
+| `remove`   | Node no longer present. A `404` response is treated as success. |
+
+### Moving a Storage Node to a Different Worker Node (Storage Node Relocation)
+
+For a `restart` action, two additional fields are available:
+
+| Field            | Type   | Description                                                                                                                                                                   |
+|------------------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `workerNode`     | string | Kubernetes worker to restart the storage node on. The operator labels the worker and waits for the storage node API to become reachable before triggering the move operation. |
+| `reattachVolume` | bool   | Reattach volumes during restart where the backend supports it.                                                                                                                |
+| `force`          | bool   | Force the action where supported by the backend.                                                                                                                              |
+
+## Monitoring Action Progress
+
+### Watch Cluster Action State
+
+```bash title="Getting current action status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.actionStatus}' | jq .
+```
+
+```bash title="Streaming live status changes"
+kubectl get storagecluster simplyblock-cluster -n simplyblock -w
+```
+
+### Read Backend Cluster Status
+
+```bash title="Getting backend lifecycle status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.status}{"\n"}'
+```
+
+### Inspecting individual node states
+
+```bash title="Getting all storage node states"
+kubectl get storagenode simplyblock-node -n simplyblock \
+  -o jsonpath='{.status.nodes}' | jq .
+```
diff --git a/docs/maintenance-operations/scaling/expanding-storage-cluster.md b/docs/maintenance-operations/scaling/expanding-storage-cluster.md
@@ -31,6 +31,38 @@ Once all newly added nodes are healthy/ready, finalize the expansion:
 
 After the expansion is complete, the cluster returns to **ACTIVE** and resumes normal operation mode.
 
+## Adding Worker Nodes with the Kubernetes Operator
+
+When running simplyblock on Kubernetes, adding new worker nodes to the storage fabric is achieved by appending them to
+the current `StorageNode.spec.workerNodes` configuration:
+
+```bash title="Add worker nodes via the operator"
+kubectl patch storagenode simplyblock-node -n simplyblock \
+  --type=json -p '[
+    {"op":"add","path":"/spec/workerNodes/-","value":"new-node-4"},
+    {"op":"add","path":"/spec/workerNodes/-","value":"new-node-5"}
+  ]'
+```
+
+The Simplyblock Operator automatically picks up on the change and will deploy the storage-node DaemonSet to the newly
+added workers, register them with the simplyblock backend, and wait for each node to come online.
+
+The backend transitions to **IN_EXPANSION** during this process.
+
+Once the nodes are online, finalize the expansion using the `StorageCluster` action:
+
+```bash title="Finalize expansion via the operator"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "expand"}}'
+```
+
+Progress can be monitored using the `StorageCluster` status:
+
+```bash title="Watch expansion status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.status}{"\n"}' -w
+```
+
 ```plain title="Example output for finalizing cluster expansion"
 [demo@demo ~]# {{ cliname }} cluster complete-expand e2cda3fe-e9f2-42ce-bb2d-eecd10f58ccf
 2026-02-19 11:28:49,995: 139892426475328: INFO: Connecting to remote_jm_af8d10c1-6613-47a9-8ed0-ebdf1f873738