From 311815acb42b6c6e17b47e31c8df30cf7332351f Mon Sep 17 00:00:00 2001 From: Manohar Reddy Date: Fri, 15 May 2026 19:16:25 +0200 Subject: [PATCH 1/2] cluster operations: node drain and cluster expansion --- .../node-drain-coordination.md | 65 +++++++ .../operator-cluster-operations.md | 168 ++++++++++++++++++ .../scaling/expanding-storage-cluster.md | 30 ++++ 3 files changed, 263 insertions(+) create mode 100644 docs/maintenance-operations/node-drain-coordination.md create mode 100644 docs/maintenance-operations/operator-cluster-operations.md diff --git a/docs/maintenance-operations/node-drain-coordination.md b/docs/maintenance-operations/node-drain-coordination.md new file mode 100644 index 00000000..631d249b --- /dev/null +++ b/docs/maintenance-operations/node-drain-coordination.md @@ -0,0 +1,65 @@ +--- +title: "Kubernetes Node Drain Coordination" +description: "How the Simplyblock operator automatically protects storage availability during Kubernetes node maintenance such as cordon, drain, and rolling OS upgrades." +weight: 10800 +--- + +When a Kubernetes worker node is cordoned or drained — for example during a rolling OS upgrade or node replacement — +the Simplyblock operator automatically coordinates the shutdown and restart of the backend storage node running on +that worker. No manual intervention is required. + +Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. At most that many workers may be inside the +active drain window at once, preventing the cluster from entering a degraded state during bulk maintenance. + +## How It Works + +When the operator detects that a worker node has become cordoned, it executes the following sequence: + +1. Create a PodDisruptionBudget to prevent premature pod eviction. +2. Call the Simplyblock shutdown API for the backend storage node and wait until `offline`. +3. Relax the PDB to allow pod eviction — Kubernetes can now drain the worker. +4. Wait for the worker to return to a ready, uncordoned state. +5. Call the Simplyblock restart API and wait until `online` and cluster `rebalancing` is `false`. +6. Mark drain coordination `complete` and remove the PDB. + +!!! warning + If another worker is already in the drain window and `maxFaultTolerance` would be exceeded, the operator holds + the new worker in the `detected` phase until an in-progress drain completes. + +## Drain Phases + +Each worker being drained progresses through the following phases, tracked in +`StorageNode.status.drainCoordination`: + +| Phase | Description | +|-------------------|-----------------------------------------------------------------------------| +| `detected` | Worker is cordoned; waiting for a drain slot within `maxFaultTolerance`. | +| `shutdown_called` | Backend shutdown API has been called; waiting for `offline`. | +| `draining` | Shutdown confirmed; PDB relaxed — Kubernetes may evict pods. | +| `restart_called` | Worker is back; backend restart API has been called; waiting for `online`. | +| `complete` | Node is back online and cluster rebalancing has finished. | +| `failed` | An unrecoverable error occurred; manual intervention may be required. | + +## Monitoring Drain State + +```bash title="Inspect drain coordination status" +kubectl get storagenode simplyblock-node -n simplyblock \ + -o jsonpath='{.status.drainCoordination}' | jq . +``` + +```bash title="Stream live changes" +kubectl get storagenode simplyblock-node -n simplyblock -w +``` + +## Configuring Fault Tolerance + +Set `spec.maxFaultTolerance` on the `StorageCluster` resource to control how many workers can be simultaneously +inside the drain window: + +```yaml title="Example: allow one worker in the drain window at a time" +spec: + maxFaultTolerance: 1 +``` + +A value of `1` is the safest default. Increase it only if your erasure coding scheme and replication factor can +tolerate multiple simultaneous node outages without data unavailability. diff --git a/docs/maintenance-operations/operator-cluster-operations.md b/docs/maintenance-operations/operator-cluster-operations.md new file mode 100644 index 00000000..4ba69b3a --- /dev/null +++ b/docs/maintenance-operations/operator-cluster-operations.md @@ -0,0 +1,168 @@ +--- +title: "Cluster and Node Operations via the Kubernetes Operator" +description: "How to perform lifecycle operations on a Simplyblock storage cluster and its nodes using the Kubernetes operator and Custom Resource Definitions." +weight: 10750 +--- + +When Simplyblock is deployed on Kubernetes, cluster and node lifecycle operations are performed by patching the +`StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the +change, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`. + +!!! info + For CLI-based node operations on non-Kubernetes deployments, see + [Stopping and Manually Restarting a Storage Node](manual-restarting-nodes.md). + +## StorageCluster Actions + +Trigger a cluster-wide action by patching `spec.action` on the `StorageCluster` resource. Only one action runs at +a time. The operator sets `.status.actionStatus.state` to `running` while the action is in progress and to +`success` or `failed` when it completes. + +### Shutdown + +```bash title="Shut down the storage cluster" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "shutdown"}}' +``` + +The operator calls the backend shutdown API and polls until the cluster reports `suspended`. + +### Start + +```bash title="Start a suspended storage cluster" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "start"}}' +``` + +The operator calls the backend start API and polls until the cluster reports `active`. + +### Restart + +```bash title="Restart the storage cluster" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "restart"}}' +``` + +Runs shutdown → waits for `suspended` → runs start → waits for `active`. The current sub-phase is stored in +`.status.actionStatus.message`. + +### Activate + +```bash title="Activate a newly created cluster" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "activate"}}' +``` + +The operator calls the backend activate API and waits until the cluster reports `active`. + +### Expand + +```bash title="Finalize a cluster expansion" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "expand"}}' +``` + +The operator calls the backend expand API and waits until the cluster returns to `active`. + +!!! info + To add new worker nodes to the storage fabric first, see + [Expanding a Storage Cluster](scaling/expanding-storage-cluster.md). + +### Node Recycle + +Node recycle sequentially restarts every backend storage node in the cluster. Use it after updating the storage-node +container image or changing node configuration. + +```bash title="Recycle all storage nodes" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "node-recycle"}}' +``` + +To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart — for example when +rolling out a new container image — add `nodeRecycle.refreshSNodeAPI: true`: + +```bash title="Recycle all storage nodes and refresh DaemonSet pods" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "node-recycle", "nodeRecycle": {"refreshSNodeAPI": true}}}' +``` + +For each backend storage node the operator executes: + +1. Shut down the node and wait until `offline` or `in_restart`. +2. If `refreshSNodeAPI: true`, restart the DaemonSet pod and wait for the storage-node API to become reachable. +3. Restart the node and wait until `online`. +4. Wait until cluster `rebalancing` is `false`. +5. Proceed to the next node. + +Progress is tracked in `.status.actionStatus` and `.status.nodeRecycleStatus`: + +```bash title="Watch node recycle progress" +kubectl get storagecluster simplyblock-cluster -n simplyblock \ + -o jsonpath='{.status.nodeRecycleStatus}' | jq . +``` + +## StorageNode Actions + +Direct operations on individual backend storage nodes are triggered by patching `spec.action` and `spec.nodeUUID` +on the `StorageNode` resource. Both fields are required together — CRD validation rejects an `action` without a +`nodeUUID`. + +```bash title="Restart a specific storage node" +kubectl patch storagenode simplyblock-node -n simplyblock \ + --type=merge -p '{ + "spec": { + "action": "restart", + "nodeUUID": "" + } + }' +``` + +After the action completes, clear `spec.action` and `spec.nodeUUID` from the CR — the operator does not clear them +automatically. + +### Supported Actions and Terminal States + +| Action | Expected backend state after success | +|------------|------------------------------------------------| +| `shutdown` | `offline` | +| `restart` | `online` | +| `suspend` | `suspended` | +| `resume` | `online` | +| `remove` | node no longer present; `404` treated as success | + +### Restart with Worker Relocation + +For a `restart` action, two additional fields are available: + +| Field | Type | Description | +|------------------|------|-------------| +| `workerNode` | string | Kubernetes worker to restart the node on. The operator labels the worker and waits for the storage-node API to become reachable before triggering restart. | +| `reattachVolume` | bool | Reattach volumes during restart where the backend supports it. | +| `force` | bool | Force the action where supported by the backend. | + +## Monitoring Action Progress + +### Watch cluster action state + +```bash title="Get current action status" +kubectl get storagecluster simplyblock-cluster -n simplyblock \ + -o jsonpath='{.status.actionStatus}' | jq . +``` + +```bash title="Stream live status changes" +kubectl get storagecluster simplyblock-cluster -n simplyblock -w +``` + +### Read backend cluster status + +```bash title="Get backend lifecycle status" +kubectl get storagecluster simplyblock-cluster -n simplyblock \ + -o jsonpath='{.status.status}{"\n"}' +``` + +### Inspect individual node states + +```bash title="Get all storage node states" +kubectl get storagenode simplyblock-node -n simplyblock \ + -o jsonpath='{.status.nodes}' | jq . +``` diff --git a/docs/maintenance-operations/scaling/expanding-storage-cluster.md b/docs/maintenance-operations/scaling/expanding-storage-cluster.md index 08e87b6a..b3365927 100644 --- a/docs/maintenance-operations/scaling/expanding-storage-cluster.md +++ b/docs/maintenance-operations/scaling/expanding-storage-cluster.md @@ -31,6 +31,36 @@ Once all newly added nodes are healthy/ready, finalize the expansion: After the expansion is complete, the cluster returns to **ACTIVE** and resumes normal operation mode. +## Adding Worker Nodes with the Kubernetes Operator + +When running Simplyblock on Kubernetes, add new worker nodes to the storage fabric by appending them to +`StorageNode.spec.workerNodes`: + +```bash title="Add worker nodes via the operator" +kubectl patch storagenode simplyblock-node -n simplyblock \ + --type=json -p '[ + {"op":"add","path":"/spec/workerNodes/-","value":"new-node-4"}, + {"op":"add","path":"/spec/workerNodes/-","value":"new-node-5"} + ]' +``` + +The operator deploys the storage-node DaemonSet to the new workers, registers them with the Simplyblock backend, +and waits for each node to come online. The backend transitions to **IN_EXPANSION** during this process. + +Once the nodes are online, finalize the expansion using the `StorageCluster` action: + +```bash title="Finalize expansion via the operator" +kubectl patch storagecluster simplyblock-cluster -n simplyblock \ + --type=merge -p '{"spec": {"action": "expand"}}' +``` + +Monitor progress: + +```bash title="Watch expansion status" +kubectl get storagecluster simplyblock-cluster -n simplyblock \ + -o jsonpath='{.status.status}{"\n"}' -w +``` + ```plain title="Example output for finalizing cluster expansion" [demo@demo ~]# {{ cliname }} cluster complete-expand e2cda3fe-e9f2-42ce-bb2d-eecd10f58ccf 2026-02-19 11:28:49,995: 139892426475328: INFO: Connecting to remote_jm_af8d10c1-6613-47a9-8ed0-ebdf1f873738 From 87b8de421aed4a1bf935b76c1d589396f5415979 Mon Sep 17 00:00:00 2001 From: "Christoph Engelbert (noctarius)" Date: Wed, 20 May 2026 18:01:49 +0200 Subject: [PATCH 2/2] Documentation cleanup --- .../node-drain-coordination.md | 57 +++++----- .../operator-cluster-operations.md | 102 +++++++++--------- .../scaling/expanding-storage-cluster.md | 12 ++- 3 files changed, 90 insertions(+), 81 deletions(-) diff --git a/docs/maintenance-operations/node-drain-coordination.md b/docs/maintenance-operations/node-drain-coordination.md index 631d249b..70c10d44 100644 --- a/docs/maintenance-operations/node-drain-coordination.md +++ b/docs/maintenance-operations/node-drain-coordination.md @@ -1,65 +1,70 @@ --- -title: "Kubernetes Node Drain Coordination" +title: "Draining Coordination of a Kubernetes Worker Node" description: "How the Simplyblock operator automatically protects storage availability during Kubernetes node maintenance such as cordon, drain, and rolling OS upgrades." weight: 10800 --- -When a Kubernetes worker node is cordoned or drained — for example during a rolling OS upgrade or node replacement — -the Simplyblock operator automatically coordinates the shutdown and restart of the backend storage node running on +When a Kubernetes worker node is cordoned or drained, for example, during a rolling OS upgrade or node replacement, +the Simplyblock Operator automatically coordinates the shutdown and restart of the backend storage node running on that worker. No manual intervention is required. -Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. At most that many workers may be inside the -active drain window at once, preventing the cluster from entering a degraded state during bulk maintenance. +Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. It defines the at-most number of Kubernetes +workers that can be drained at the same time. This prevents the cluster from entering a degraded state during bulk +maintenance operations and restarting cycles. ## How It Works When the operator detects that a worker node has become cordoned, it executes the following sequence: -1. Create a PodDisruptionBudget to prevent premature pod eviction. -2. Call the Simplyblock shutdown API for the backend storage node and wait until `offline`. -3. Relax the PDB to allow pod eviction — Kubernetes can now drain the worker. -4. Wait for the worker to return to a ready, uncordoned state. -5. Call the Simplyblock restart API and wait until `online` and cluster `rebalancing` is `false`. -6. Mark drain coordination `complete` and remove the PDB. +1. Creates a `PodDisruptionBudget` to prevent premature pod eviction. +2. Calls the simplyblock shutdown API for the backend storage node and wait until `offline`. +3. Relaxes the `PodDisruptionBudget` to allow pod eviction. Kubernetes can now drain the worker. +4. Waits for the worker to return to a ready, uncordoned state. +5. Calls the simplyblock restart API and wait until the storage nodes are `online` and cluster `rebalancing` is `false`. +6. Marks drain coordination `complete` and remove the `PodDisruptionBudget`. !!! warning If another worker is already in the drain window and `maxFaultTolerance` would be exceeded, the operator holds - the new worker in the `detected` phase until an in-progress drain completes. + the new worker in the `detected` phase until an in-progress drain completes to ensure that the cluster remains + available and connection loss is mitigated. ## Drain Phases Each worker being drained progresses through the following phases, tracked in `StorageNode.status.drainCoordination`: -| Phase | Description | -|-------------------|-----------------------------------------------------------------------------| -| `detected` | Worker is cordoned; waiting for a drain slot within `maxFaultTolerance`. | -| `shutdown_called` | Backend shutdown API has been called; waiting for `offline`. | -| `draining` | Shutdown confirmed; PDB relaxed — Kubernetes may evict pods. | -| `restart_called` | Worker is back; backend restart API has been called; waiting for `online`. | -| `complete` | Node is back online and cluster rebalancing has finished. | -| `failed` | An unrecoverable error occurred; manual intervention may be required. | +| Phase | Description | +|-------------------|-------------------------------------------------------------------------------| +| `detected` | Worker is cordoned. Waiting for a drain slot within `maxFaultTolerance`. | +| `shutdown_called` | Backend shutdown API has been called. Waiting for `offline`. | +| `draining` | Shutdown confirmed. `PodDisruptionBudget` relaxed. Kubernetes may evict pods. | +| `restart_called` | Worker is back. Backend restart API has been called. Waiting for `online`. | +| `complete` | Node is back online and cluster rebalancing has finished. | +| `failed` | An unrecoverable error occurred. Manual intervention may be required. | ## Monitoring Drain State -```bash title="Inspect drain coordination status" +The progress of the drain coordination can be monitored using the `StorageNode` custom resource. + +```bash title="Inspecting drain coordination status" kubectl get storagenode simplyblock-node -n simplyblock \ -o jsonpath='{.status.drainCoordination}' | jq . ``` -```bash title="Stream live changes" +```bash title="Streaming live changes" kubectl get storagenode simplyblock-node -n simplyblock -w ``` ## Configuring Fault Tolerance -Set `spec.maxFaultTolerance` on the `StorageCluster` resource to control how many workers can be simultaneously -inside the drain window: +To control the number of workers that can be simultaneously drained, the property `spec.maxFaultTolerance` on the +`StorageCluster` resource can be configured. ```yaml title="Example: allow one worker in the drain window at a time" spec: maxFaultTolerance: 1 ``` -A value of `1` is the safest default. Increase it only if your erasure coding scheme and replication factor can -tolerate multiple simultaneous node outages without data unavailability. +A value of `1` is the safest default. The safe-maximum of this value depends on the selected erasure coding scheme and +replication factor. It reflects the maximum number of toleratable simultaneous node outages without connection loss and +traffic interruption. diff --git a/docs/maintenance-operations/operator-cluster-operations.md b/docs/maintenance-operations/operator-cluster-operations.md index 4ba69b3a..b4b1701b 100644 --- a/docs/maintenance-operations/operator-cluster-operations.md +++ b/docs/maintenance-operations/operator-cluster-operations.md @@ -1,12 +1,12 @@ --- -title: "Cluster and Node Operations via the Kubernetes Operator" +title: "Operating Storage Clusters via Simplyblock Operator" description: "How to perform lifecycle operations on a Simplyblock storage cluster and its nodes using the Kubernetes operator and Custom Resource Definitions." weight: 10750 --- -When Simplyblock is deployed on Kubernetes, cluster and node lifecycle operations are performed by patching the -`StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the -change, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`. +When simplyblock is deployed on OpenShift or Kubernetes, cluster and node lifecycle operations are performed by patching +the `StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the +changes, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`. !!! info For CLI-based node operations on non-Kubernetes deployments, see @@ -14,13 +14,15 @@ change, calls the backend API, polls for the expected terminal state, and record ## StorageCluster Actions -Trigger a cluster-wide action by patching `spec.action` on the `StorageCluster` resource. Only one action runs at -a time. The operator sets `.status.actionStatus.state` to `running` while the action is in progress and to -`success` or `failed` when it completes. +Storage cluster actions are cluster-wide operations that affect all nodes in the cluster. + +To trigger a storage cluster action, the `spec.action` property on a `StorageCluster` resource must be patchec. Only +one action can run at any given time. The operator sets `.status.actionStatus.state` to `running` while the action is in +progress and to `success` or `failed` when it completes. ### Shutdown -```bash title="Shut down the storage cluster" +```bash title="Shutting down the storage cluster" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "shutdown"}}' ``` @@ -29,7 +31,7 @@ The operator calls the backend shutdown API and polls until the cluster reports ### Start -```bash title="Start a suspended storage cluster" +```bash title="Starting a suspended storage cluster" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "start"}}' ``` @@ -38,17 +40,17 @@ The operator calls the backend start API and polls until the cluster reports `ac ### Restart -```bash title="Restart the storage cluster" +```bash title="Restarting the storage cluster" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "restart"}}' ``` -Runs shutdown → waits for `suspended` → runs start → waits for `active`. The current sub-phase is stored in -`.status.actionStatus.message`. +The operator runs a shutdown, waits for `suspended`, runs start, and waits for `active`. The current sub-phase is stored +in `.status.actionStatus.message`. -### Activate +### Activate and Reactivate -```bash title="Activate a newly created cluster" +```bash title="Activating a newly created cluster" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "activate"}}' ``` @@ -57,7 +59,7 @@ The operator calls the backend activate API and waits until the cluster reports ### Expand -```bash title="Finalize a cluster expansion" +```bash title="Finalizing a cluster expansion" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "expand"}}' ``` @@ -65,7 +67,7 @@ kubectl patch storagecluster simplyblock-cluster -n simplyblock \ The operator calls the backend expand API and waits until the cluster returns to `active`. !!! info - To add new worker nodes to the storage fabric first, see + More information on how to add new worker nodes to the storage fabric first is available in [Expanding a Storage Cluster](scaling/expanding-storage-cluster.md). ### Node Recycle @@ -73,30 +75,30 @@ The operator calls the backend expand API and waits until the cluster returns to Node recycle sequentially restarts every backend storage node in the cluster. Use it after updating the storage-node container image or changing node configuration. -```bash title="Recycle all storage nodes" +```bash title="Restarting all storage nodes" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "node-recycle"}}' ``` -To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart — for example when -rolling out a new container image — add `nodeRecycle.refreshSNodeAPI: true`: +To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart add +`nodeRecycle.refreshSNodeAPI: true`. Situations include when rolling out a new container image: -```bash title="Recycle all storage nodes and refresh DaemonSet pods" +```bash title="Restarting all storage nodes and refreshing DaemonSet pods" kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "node-recycle", "nodeRecycle": {"refreshSNodeAPI": true}}}' ``` For each backend storage node the operator executes: -1. Shut down the node and wait until `offline` or `in_restart`. -2. If `refreshSNodeAPI: true`, restart the DaemonSet pod and wait for the storage-node API to become reachable. -3. Restart the node and wait until `online`. -4. Wait until cluster `rebalancing` is `false`. -5. Proceed to the next node. +1. Shuts down the node and wait until `offline` or `in_restart`. +2. If `refreshSNodeAPI: true`, restarts the DaemonSet pod and wait for the storage-node API to become reachable. +3. Restarts the node and wait until `online`. +4. Waits until cluster `rebalancing` is `false`. +5. Proceeds to the next node. Progress is tracked in `.status.actionStatus` and `.status.nodeRecycleStatus`: -```bash title="Watch node recycle progress" +```bash title="Watching node recycle progress" kubectl get storagecluster simplyblock-cluster -n simplyblock \ -o jsonpath='{.status.nodeRecycleStatus}' | jq . ``` @@ -104,10 +106,10 @@ kubectl get storagecluster simplyblock-cluster -n simplyblock \ ## StorageNode Actions Direct operations on individual backend storage nodes are triggered by patching `spec.action` and `spec.nodeUUID` -on the `StorageNode` resource. Both fields are required together — CRD validation rejects an `action` without a +on the `StorageNode` resource. Both fields are required together. The CRD validation rejects an `action` without a `nodeUUID`. -```bash title="Restart a specific storage node" +```bash title="Restarting a specific storage node" kubectl patch storagenode simplyblock-node -n simplyblock \ --type=merge -p '{ "spec": { @@ -117,52 +119,52 @@ kubectl patch storagenode simplyblock-node -n simplyblock \ }' ``` -After the action completes, clear `spec.action` and `spec.nodeUUID` from the CR — the operator does not clear them -automatically. +After the action completes, `spec.action` and `spec.nodeUUID` must be cleared from the custom resource. The operator +does not automatically clear them. ### Supported Actions and Terminal States -| Action | Expected backend state after success | -|------------|------------------------------------------------| -| `shutdown` | `offline` | -| `restart` | `online` | -| `suspend` | `suspended` | -| `resume` | `online` | -| `remove` | node no longer present; `404` treated as success | +| Action | Expected backend state after success | +|------------|-----------------------------------------------------------------| +| `shutdown` | `offline` | +| `restart` | `online` | +| `suspend` | `suspended` | +| `resume` | `online` | +| `remove` | Node no longer present. A `404` response is treated as success. | -### Restart with Worker Relocation +### Moving a Storage Node to a Different Worker Node (Storage Node Relocation) For a `restart` action, two additional fields are available: -| Field | Type | Description | -|------------------|------|-------------| -| `workerNode` | string | Kubernetes worker to restart the node on. The operator labels the worker and waits for the storage-node API to become reachable before triggering restart. | -| `reattachVolume` | bool | Reattach volumes during restart where the backend supports it. | -| `force` | bool | Force the action where supported by the backend. | +| Field | Type | Description | +|------------------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `workerNode` | string | Kubernetes worker to restart the storage node on. The operator labels the worker and waits for the storage node API to become reachable before triggering the move operation. | +| `reattachVolume` | bool | Reattach volumes during restart where the backend supports it. | +| `force` | bool | Force the action where supported by the backend. | ## Monitoring Action Progress -### Watch cluster action state +### Watch Cluster Action State -```bash title="Get current action status" +```bash title="Getting current action status" kubectl get storagecluster simplyblock-cluster -n simplyblock \ -o jsonpath='{.status.actionStatus}' | jq . ``` -```bash title="Stream live status changes" +```bash title="Streaming live status changes" kubectl get storagecluster simplyblock-cluster -n simplyblock -w ``` -### Read backend cluster status +### Read Backend Cluster Status -```bash title="Get backend lifecycle status" +```bash title="Getting backend lifecycle status" kubectl get storagecluster simplyblock-cluster -n simplyblock \ -o jsonpath='{.status.status}{"\n"}' ``` -### Inspect individual node states +### Inspecting individual node states -```bash title="Get all storage node states" +```bash title="Getting all storage node states" kubectl get storagenode simplyblock-node -n simplyblock \ -o jsonpath='{.status.nodes}' | jq . ``` diff --git a/docs/maintenance-operations/scaling/expanding-storage-cluster.md b/docs/maintenance-operations/scaling/expanding-storage-cluster.md index b3365927..f60f29bf 100644 --- a/docs/maintenance-operations/scaling/expanding-storage-cluster.md +++ b/docs/maintenance-operations/scaling/expanding-storage-cluster.md @@ -33,8 +33,8 @@ After the expansion is complete, the cluster returns to **ACTIVE** and resumes n ## Adding Worker Nodes with the Kubernetes Operator -When running Simplyblock on Kubernetes, add new worker nodes to the storage fabric by appending them to -`StorageNode.spec.workerNodes`: +When running simplyblock on Kubernetes, adding new worker nodes to the storage fabric is achieved by appending them to +the current `StorageNode.spec.workerNodes` configuration: ```bash title="Add worker nodes via the operator" kubectl patch storagenode simplyblock-node -n simplyblock \ @@ -44,8 +44,10 @@ kubectl patch storagenode simplyblock-node -n simplyblock \ ]' ``` -The operator deploys the storage-node DaemonSet to the new workers, registers them with the Simplyblock backend, -and waits for each node to come online. The backend transitions to **IN_EXPANSION** during this process. +The Simplyblock Operator automatically picks up on the change and will deploy the storage-node DaemonSet to the newly +added workers, register them with the simplyblock backend, and wait for each node to come online. + +The backend transitions to **IN_EXPANSION** during this process. Once the nodes are online, finalize the expansion using the `StorageCluster` action: @@ -54,7 +56,7 @@ kubectl patch storagecluster simplyblock-cluster -n simplyblock \ --type=merge -p '{"spec": {"action": "expand"}}' ``` -Monitor progress: +Progress can be monitored using the `StorageCluster` status: ```bash title="Watch expansion status" kubectl get storagecluster simplyblock-cluster -n simplyblock \