OperAID is an open-source testbed for evaluating LLM agents as autonomous operators of 5G Core networks deployed on Kubernetes. It provides a closed-loop pipeline: Fault Injection → Agentic Diagnosis → Remediation → Execution-Based Verification.
# Install dependencies
pip install -r requirements.txt
# Set your API key
export OPENROUTER_API_KEY="sk-..."
# Run a single experiment
./run_experiment.sh --api-key "$OPENROUTER_API_KEY" --model z-ai/glm-5 --scenario 1
# Run without tools (no-tools condition)
./run_experiment.sh --api-key "$OPENROUTER_API_KEY" --model z-ai/glm-5 --scenario 1 --no-tools
# Run a full suite (YAML config)
python3 run_suite.py --suite suites/tool_impact.yaml --api-key "$OPENROUTER_API_KEY"
# Generate paper figures from results
python3 visualize_suite.py --stats paper/figures/suite_statistics.json --csv summary.csv -o paper/figures- Kubernetes cluster (KinD recommended)
- openverso-charts — Helm charts for Open5GS + UERANSIM deployment
- Python 3.10+
kubectlandhelmconfigured and pointing to your cluster- OpenRouter API key (or any OpenAI-compatible API)
Set the charts path in config.env or via environment variable:
export OPENVERSO_CHARTS_DIR=/path/to/openverso-chartsoperaid/
├── config.env # Main configuration (LLM, K8s, timeouts)
├── run_experiment.sh # Single experiment runner
├── run_suite.py # Suite runner with YAML config
├── regenerate_stats.py # Regenerate suite_statistics.json + summary.csv from per-run data
├── nuke-deployment.sh # Full cleanup / redeploy
├── reset-deployment.sh # Fast reset via helm upgrade
├── visualize.py # Single-run visualizer
├── visualize_suite.py # Suite visualizer (generates paper figures)
├── scenario_definitions.json # Scenario definitions + expected remediation
├── deployments/
│ └── open5gs.json # Open5GS deployment profile (faults, health, tools)
├── engine/
│ ├── diagnose.py # Multi-turn LLM diagnosis engine
│ └── profile.py # Deployment profile loader
├── tools/
│ ├── kubectl_tools.py # Built-in kubectl diagnostic tools + custom tool registry
│ └── __init__.py
├── health/
│ └── health_check.py # Generic deployment health check
├── scenarios/
│ ├── scenario_1_netpol.yaml # S1: NetworkPolicy fault
│ ├── scenario_2_configmap.yaml # S2: ConfigMap fault
│ └── scenario_3_upf_scale.yaml # S3: UPF scaling fault
├── suites/
│ ├── tool_impact.yaml # 5 models × 2 conditions × 3 scenarios × 30 runs
│ ├── single_model_quick.yaml
│ ├── validation_test.yaml
│ ├── temperature_sweep.yaml
│ └── scenario_deep_dive.yaml
└── suite_results/ # Experiment run outputs (git-ignored)
| ID | Type | Description |
|---|---|---|
| S1 | Network | NetworkPolicy blocks AMF→SMF SBI (port 7777) |
| S2 | Configuration | SMF references non-existent ConfigMap → CrashLoopBackOff |
| S3 | Scaling | UPF scaled to 0 replicas → no user plane |
Main configuration file sourced by run_experiment.sh:
LLM_PROVIDER/LLM_MODEL— LLM provider and modelLLM_MAX_TURNS— max diagnosis turns (default: 3)LLM_MAX_TOKENS— max output tokens (default: 4096)NAMESPACE— Kubernetes namespace (default: from deployment profile)DEPLOYMENT_PROFILE— deployment profile name or path (default: open5gs)- Various timeouts for health checks, remediation, and API calls
Deployment profiles (deployments/*.json) define everything specific to a target deployment:
- Components — NF names and deployment prefixes
- Context prompt — deployment description injected into the LLM system prompt
- Fault injection — method and parameters per scenario
- Health check — expected deployments and readiness criteria
- Custom tools — additional diagnostic tools beyond the built-in set
Example (deployments/open5gs.json):
{
"name": "open5gs",
"namespace": "open5gs",
"context_prompt": "Open5GS is a 5G Core network running on Kubernetes...",
"components": { "amf": { "deployment": "open5gs-amf" }, ... },
"fault_injection": { "1": { "method": "kubectl_apply", "file": "scenarios/scenario_1_netpol.yaml" }, ... },
"health_check": { "check_type": "deployments", "expected_deployments": [...] },
"custom_tools": []
}Suite YAML files (in suites/) define experiment matrices:
common:
profile: open5gs
scenarios: [1, 2, 3]
runs_per_scenario: 30
max_turns: 3
temperature: 0.0
max_tokens: 4096
custom_tools: []
experiments:
- name: "glm-5-all-tools"
model: "z-ai/glm-5"
use_tools: true
- name: "glm-5-no-tools"
model: "z-ai/glm-5"
use_tools: falseCustom diagnostic tools can be added via the deployment profile or suite YAML. Each tool defines an OpenAI function-calling schema and an executor:
{
"custom_tools": [
{
"schema": {
"type": "function",
"function": {
"name": "get_configmaps",
"description": "List all ConfigMaps in the namespace.",
"parameters": {"type": "object", "properties": {}, "required": []}
}
},
"executor": {"type": "kubectl", "command": "get configmaps"}
},
{
"schema": {
"type": "function",
"function": {
"name": "check_endpoint",
"description": "Check an HTTP endpoint.",
"parameters": {
"type": "object",
"properties": {"port": {"type": "integer"}},
"required": ["port"]
}
}
},
"executor": {"type": "shell", "command": "curl -sf http://localhost:{port}/health || echo 'unhealthy'"}
}
]
}Two executor types are supported:
kubectl— runs a kubectl command template.{arg}placeholders are substituted from tool arguments. Blocked commands (edit, exec, etc.) are rejected.shell— runs an arbitrary shell command.{namespace}and tool arguments are available as placeholders.
Suite-level custom_tools override the profile's list. The default is [] (built-in tools only).
| Tool | Description |
|---|---|
get_pods |
List all pods with status and restart counts |
get_events |
List Kubernetes events for the namespace |
describe_pod(name) |
Detailed pod information including events |
get_pod_logs(name) |
Container stdout/stderr logs |
get_deployment(name) |
Deployment spec, status, and conditions |
get_networkpolicies |
List NetworkPolicies and their rules |
run_kubectl(cmd) |
Execute arbitrary read-only kubectl commands |
Each experiment run follows this pipeline:
- Pre-flight health check — first run in a suite performs a full nuke for a clean baseline; subsequent runs skip the nuke if the cluster is healthy (the previous run's LLM diagnosis phase provides a natural stabilization window)
- Fault injection — deterministic fault applied via profile-defined method
- Agentic diagnosis — multi-turn LLM reasoning with tool access (configurable)
- Remediation — proposed
kubectlcommands executed against the cluster - Execution-based verification — health checks confirm all NFs are ready
- Fallback recovery — if LLM remediation fails, fast reset (helm upgrade) then full nuke as last resort
All figures use the seaborn rocket palette. To regenerate paper figures:
# Generate figures from suite statistics
python3 visualize_suite.py \
--stats suite_results/tool-impact-analysis/20260401-144241/suite_statistics.json \
--suite-dir suite_results/tool-impact-analysis/20260401-144241 \
--pricing paper/pricing.csv \
-o paper/figuresIf you have per-run data but need to regenerate the aggregated files:
python3 regenerate_stats.py suite_results/tool-impact-analysis/20260401-144241This recreates:
suite_statistics.json— model/scenario aggregated metricssummary.csv— per-run results tableapi_error_summary.csv— API error breakdown
| Metric | Value |
|---|---|
| Overall LLM success rate | 36.0% |
| Average with tools | 70.7% |
| Average without tools | 7.1% |
| Best performing model | Qwen3.5-35b-a3b (93.3% with tools) |
| Best small model | Qwen3.5-35b-a3b — 3B active params, 93.3% success |
Scenario breakdown:
- S1 (NetworkPolicy): 16.0% success — most challenging scenario
- S2 (ConfigMap): 42.0% success — 0% without tools for 4/5 models
- S3 (UPF Scale): 49.3% success — highest overall success rate
Key findings:
- Tool access raises average success from 7.1% to 70.7% (+63.6pp)
- Small models (3B active params) achieve 93.3% with tools
- S2 (ConfigMap): 0% without tools for most models — validates the Validity Gap
- Failure modes: 68% "no_remediation" (API limits), 31% "wrong_diagnosis" when tools unavailable
@inproceedings{operaid2026,
title={OperAID: Benchmarking LLM Agents for Autonomous Kubernetes Fault Remediation},
author={de Castro, Ariel G. and Vandikas, Konstantinos and Ferlin-Reiter, Simone and Chiesa, Marco and Rothenberg, Christian E.},
booktitle={IEEE NetSoft Trust 6G-Net Workshop},
year={2026}
}