Add control plane recovery complex task#63
Conversation
…ure and updating gcp variables resolver
…namespace secret ID fix
…event context window bloat
…l_resource to ensure GKE workloads are re-applied on namespace changes
| template: | ||
| metadata: | ||
| labels: | ||
| job-name: setup-and-corrupt |
There was a problem hiding this comment.
We might not want to use job names that make it obvious this is a test. The agent is likely to understand even before inspection.
Ideally we should look for a way to trigger the corruption from outside the cluster else we risk the agent inspecting the job and identifying the exact trigger easily
| stack: "prebuilt/cp-recovery" | ||
| teardown: true | ||
| prompt: | | ||
| You are a Site Reliability Engineer managing a Kubernetes control plane (simulated by a custom API server 'mock-apiserver' and a 3-member etcd cluster 'etcd-0', 'etcd-1', 'etcd-2') in the namespace '{{NAMESPACE}}' on the cluster '{{GKE_CLUSTER_NAME}}' in project '{{GCP_PROJECT_ID}}'. |
There was a problem hiding this comment.
This feels like a task that is better suited for kind than GKE if I am being honest. We don't need to simulate etcd and api server and we can even trigger the corruption from outside the cluster. This is a fair amount of detail for the task and likely not what was intended.
|
|
||
| The control plane is currently experiencing a critical failure. | ||
| Restore the control plane to a healthy, consistent state from the latest verified backup in the GCS bucket 'cpr-{{GCP_PROJECT_ID}}-{{NAMESPACE}}'. | ||
| To prevent data drift or further corruption during recovery, ensure mutating requests are blocked and the system status is updated accordingly. |
There was a problem hiding this comment.
Let's remove all of this. The agent should be able to do it and if not the agent context can be assumed to be misconfigured.
No description provided.