Skip to content

aws-samples/sample-apex-skills

APEX Skills — Agentic Platform Engineering eXperience

Docs

Curated platform-engineering skills that compress onboarding from months to weeks. Domain knowledge authored by senior AWS SSAs, TAMs, and ProServe, delivered through agentic AI tools (Claude Code, Kiro CLI etc).

APEX uses agentic AI (frontier models and agent harness like Claude Code) combined with curated "skills" to give engineers SSA-grade platform engineering output.

Agent Skills are organized folders of instructions, scripts, and resources that frontier LLM models can discover and load dynamically to perform specialized tasks. By codifying expert platform engineering knowledge as Agent Skills, we amplify best practices and scale them across teams while reducing toil. They follow the Agent Skills Agent Skills open standard open standard and are compatible with any supported agent harness.


What's in This Repo

sample-apex-skills/
├── skills/       → 📚 Domain knowledge (platform-engineering best practices, Terraform, skill creation)
├── steering/     → 🎯 Guided workflows (optional — structured engagement playbooks)
├── rules/        → 📏 Agent rules (project-level AGENTS.md for consumers)
├── examples/     → 🏗️ Hands-on exercises (deployable labs with planted issues)
└── misc/         → 🔧 Maintenance and tooling
    ├── evals/    → 🧪 5-layer skill evaluation framework (triggering, process, artifact, knowledge, quality)
    └── (scripts) → Sync skills from sources, update cross-references
Directory Purpose Think of it as...
skills/ What the agent knows — reusable domain knowledge An expert's brain
steering/ How the agent runs an engagement — slash commands, questionnaires, checkpoints, routing A senior SA's playbook
rules/ How the agent behaves — verification, source-checking, guardrails A safety checklist
examples/ How to try it — deploy, run APEX against it, see results A workshop lab
misc/ Maintenance tooling and per-skill evaluation inputs The toolbox

Key principle: Skills provide the knowledge. Steering provides the structure. Rules provide the guardrails.


Quick Start

NPX Installer (recommended)

Prerequisites: Node.js 18+ and git must be installed.

npx apex-skills

Detects Claude Code and/or Kiro CLI, clones the repo to ~/.apex-skills/, and symlinks all skills + steering into the right locations.

npx apex-skills --update              # Pull latest skills
npx apex-skills --version v1.1.0      # Pin to a specific release
npx apex-skills --branch feat/new-eks # Install from a branch
npx apex-skills --help                # See all options

Manual Install

If you prefer not to use npx, clone the repo and copy skills directly.

Claude Code

git clone https://github.com/aws-samples/sample-apex-skills.git
cd sample-apex-skills

mkdir -p ~/.claude/skills ~/.claude/commands
cp -r skills/* ~/.claude/skills/
ln -sfn "$(pwd)/steering/commands/apex" ~/.claude/commands/apex
ln -sfn "$(pwd)/steering" ~/.claude/apex-steering

Usage: Start a Claude Code session and use slash commands:

  • /apex:eks — hub that auto-routes based on your request
  • /apex:eks-design"Help me design an EKS cluster"
  • /apex:eks-upgrade-check"Is my cluster ready to upgrade to 1.32?"

Kiro CLI

git clone https://github.com/aws-samples/sample-apex-skills.git
cd sample-apex-skills

mkdir -p ~/.kiro/skills ~/.kiro/steering
cp -r skills/* ~/.kiro/skills/
cp steering/workflows/*.md ~/.kiro/steering/

Other Agent Harnesses

Skills follow the Agent Skills standard. Each skill lives in skills/{skill-name}/ with a SKILL.md and optional references/, scripts/, and assets/ directories. Clone and point your tool at them — see skills/README.md for the layout.

Agent Rules (optional)

AGENTS.md contains project-level guardrails — verification habits, source-checking, safety boundaries. These are personal to each user's workflow, so they are not auto-loaded by the installer. To activate them, add the contents to whatever file your agent harness reads for project instructions (e.g., CLAUDE.md, AGENTS.md, .cursorrules, .github/copilot-instructions.md, .kiro/steering/project.md).

Skills Reference

This table is auto-generated by misc/update-skills-references.sh. Do not edit manually.

Skill What It Covers
eks-best-practices Advisory guidance for Amazon EKS architecture and configuration decisions — compute strategy, networking, security, reliability, cost, autoscaling, observability, multi-tenancy, and upgrade planning. Also answers Terraform configuration questions about terraform-aws-modules/terraform-aws-eks. Use for any EKS planning or architectural judgment call, even when phrased casually. Do NOT use for generating documents or code (eks-design, eks-build), scoring or auditing a live cluster (eks-operation-review, eks-upgrade-check), discovering what is running (eks-recon), MCP tooling setup (eks-mcp-server), building developer platforms and IDPs (eks-platform-engineering), GenAI/LLM workload decisions — GPU vs Trainium/Inferentia, vLLM/Ray serving, distributed training, ML storage (eks-genai), or compliance-regime hardening and audit prep — HIPAA/PCI/FedRAMP, CIS benchmarks, GuardDuty, image signing (eks-security).
eks-build Use when building EKS clusters. Generates complete, production-ready Terraform projects with optional ArgoCD GitOps integration. Handles environment-specific constraints: air-gapped/VPC-endpoint-only networks, enterprise proxies, private container registries, compliance requirements. Supports 3 patterns: full Terraform, ArgoCD+Terraform, ArgoCD+ACK/KRO. Includes validated modules, two-phase webhook ordering, IRSA/Pod Identity, and 29+ addon configurations. Ask interactive questions or accept requirements YAML. Also use when (1) generating EKS Terraform code from scratch, (2) creating GitOps-managed EKS addons with ArgoCD, (3) scaffolding EKS projects with compliance constraints, (4) implementing two-phase webhook ordering for EKS addons, (5) configuring IRSA or Pod Identity for EKS workloads, (6) generating ArgoCD ApplicationSets for EKS addon management, or (7) comparing deployment patterns for implementation decisions.
eks-cost-intelligence Run a live EKS cluster cost efficiency assessment — analyze spending across 6 dimensions (compute efficiency, Spot/Graviton adoption, networking, storage, observability, idle resources), calculate a weighted 0-100 Cost Score, and generate a prioritized report with dollar-quantified findings and ready-to-apply remediation snippets. Use this skill when someone asks "how much am I wasting on EKS?", "run a cost audit on my cluster", "what's my biggest cost driver?", "analyze my cluster's cost efficiency", or needs dollar-denominated findings for a FinOps review — even if they don't say "cost intelligence" or "score". Combines live Cost Explorer data, CloudWatch utilization metrics, and Kubernetes resource analysis. Falls back to AWS CLI and kubectl when the EKS MCP server is unavailable. Distinct from eks-best-practices (static advisory guidance), eks-operation-review (operational health), and eks-recon (cluster discovery).
eks-design Use when designing EKS architecture. Generates design documents with Mermaid diagrams, ADRs, security architecture, and validation reports. Translates requirements into tailored EKS designs guided by Well-Architected best practices. Covers cluster architecture, compute, networking, security, addons, observability, cost, and upgrade strategy. Also use when reviewing or validating existing EKS architectures, planning networking or security, evaluating deployment models, or generating architecture diagrams. Skip for short advisory recommendations without a formal document (eks-best-practices), Internal Developer Platforms or progressive delivery (eks-platform-engineering), and GenAI/LLM workload design — GPU vs Neuron, vLLM/Ray serving, distributed training (eks-genai).
eks-genai Use whenever someone is building, training, fine-tuning, or serving a generative AI / LLM workload on Amazon EKS — phrased as "GPU vs Trainium/Inferentia", "vLLM on EKS", "Ray Serve / KubeRay", "distributed training on EKS", "FSx for Lustre for ML", "Karpenter for GPU", "EFA / NCCL multi-node", "DCGM / Neuron Monitor", "LiteLLM / AI gateway", "RAG on EKS", "agentic AI on EKS", or "self-host Llama / Mistral / Qwen". Walks the opinionated 6-layer stack (compute → cluster/scheduler → frameworks → storage → observability → AI gateway), the GPU-vs-Neuron decision, the JARK + vLLM + LiteLLM canonical reference, KV-cache tiering, cost levers (Neuron, Spot, Capacity Blocks), and a non-negotiable security baseline. Trigger even if "GenAI" is never said — any GPU/Neuron, inference-serving, or distributed-training decision on EKS qualifies. Skip for SageMaker-only or Bedrock-only (no self-hosting) asks, and for generic cluster design/build with no AI/ML workload (use eks-design / eks-build).
eks-ingress-migration Assess a live EKS cluster's NGINX/Ingress estate and plan migration to Gateway API, the AWS Load Balancer Controller (ALB Ingress), or AWS Transform (ATX). Discovers ingress controllers and routes, scores migration difficulty 0–100 with a separate re-architecture gate, and generates per-cluster reports plus ready-to-apply manifests. Use when someone asks "how hard is it to move off nginx ingress?", "assess my ingress migration", "migrate nginx to ALB or Gateway API", "ingress migration audit", or "nginx ingress retirement plan". Not for upgrade readiness (eks-upgrade-check), operational audits (eks-operation-review), general cluster discovery (eks-recon), or general ingress configuration advice (eks-best-practices).
eks-mcp-server Install, configure, and troubleshoot the EKS MCP Server connection in your AI assistant (Claude Code, Cursor, Kiro). Use ONLY for MCP server setup problems — config file location (.mcp.json), IAM permissions for eks-mcp actions, uvx installation, choosing AWS-hosted vs self-hosted mode, or debugging why MCP tools fail to appear after config. Also activate if user mentions "eks mcp", "mcp server", "mcp.json", or "mcp tools not showing". Do NOT use for actual cluster operations once MCP is working — those go to eks-recon (discovery), eks-operation-review (audits), or eks-upgrade-check (upgrades).
eks-operation-review Run a structured EKS operational excellence assessment against a live cluster. Covers 10 areas — networking, autoscaling, observability, access & identity, add-ons, workload config, deployments, cluster lifecycle, IaC, operational processes — and produces a GREEN/AMBER/RED rated report with prioritized recommendations. Activate for any request to audit, review, health-check, or score an EKS cluster's operational posture, including section-scoped reviews of individual areas. Not for upgrade readiness, cluster discovery, or architectural design advice.
eks-platform-engineering Use whenever someone is designing or building an Internal Developer Platform (IDP) or doing platform engineering on Amazon EKS — phrased as "build a developer platform", "self-service for developers", "golden paths", "IDP", or "set up Backstage / ArgoCD / Kargo". Covers the opinionated platform stack — developer portal (Backstage), GitOps delivery (ArgoCD, Argo Workflows), progressive delivery (Argo Rollouts) and multi-stage promotion (Kargo), infrastructure abstraction (ACK, kro), the developer-facing app abstraction (Backstage templates + kro, or KubeVela/OAM), self-service provisioning, hub-and-spoke topology with the GitOps Bridge, identity/SSO (Keycloak, Pod Identity), measuring success (DORA, Apache DevLake), GenAI-assisted platform engineering (Kiro), and golden paths for AI/ML and data. Trigger even if "platform engineering" is never said. Skip for single-cluster EKS architecture or cost/ops tuning with no platform angle (use eks-best-practices); for standalone Terraform use terraform-skill.
eks-recon EKS cluster reconnaissance and environment discovery. Detects compute strategy (Karpenter, MNG, Auto Mode, Fargate), IaC tooling (Terraform, CloudFormation, CDK, eksctl), CI/CD pipelines (GitHub Actions, GitLab, ArgoCD, Flux), add-on inventory, networking, security posture, and observability. Use this skill whenever someone asks about their EKS cluster, wants to understand their setup, is planning an upgrade or migration, needs cluster context for any reason, asks what version am I running, mentions wanting to review or document their cluster, or is about to make any EKS-related decision - even if they don't explicitly say reconnaissance or discovery. When in doubt about cluster state, run recon first. Skip for upgrade readiness scoring or deprecated API checks (eks-upgrade-check), operational audits with GREEN/AMBER/RED ratings (eks-operation-review), and architecture design documents or Mermaid diagrams (eks-design).
eks-security Use whenever someone needs security or compliance guidance for Amazon EKS — phrased as "CIS Benchmark for EKS", "HIPAA / PCI-DSS / FedRAMP / SOC 2 / GDPR on EKS", "harden my EKS cluster", "Bottlerocket vs AL2023 vs RHEL/Ubuntu AMI", "EKS Pod Identity vs IRSA", "Access Entries vs aws-auth", "GuardDuty for EKS", "Pod Security Admission / Kyverno / OPA", "NetworkPolicy / Security Groups for Pods", "ECR scanning / image signing (Cosign / Notation)", "EKS audit logging", "etcd / secrets encryption", or regulated-workload / audit-prep guidance. Walks the discovery-driven 7-layer security stack (OS/AMI → identity → workload → image → runtime → audit → compliance accelerators), the compliance-regime scope view, the AWS-canonical baseline, and a 30/60/90 hardening roadmap. Trigger even if "compliance" is never said — any EKS hardening, audit-prep, or regulated-workload decision qualifies. Skip for non-EKS (ECS/ROSA), account-level security with no EKS angle, or GenAI-workload security (use eks-genai).
eks-upgrade-check Assess EKS cluster upgrade readiness — run automated checks across 8 areas (version, breaking changes, deprecated APIs, add-on compatibility, node readiness, workload risks, AWS Insights, upgrade plan), calculate a 0-100 readiness score with a hard-blocker override, and generate a markdown/HTML report with prioritized remediation. Use this skill whenever someone asks "can I upgrade my cluster?", "is my cluster ready for 1.32?", "are we good to go to 1.33?", "what is blocking my upgrade?", or "should we move to the next version?" — even if they do not say "readiness" or "score". Falls back to AWS CLI and kubectl when the EKS MCP server is unavailable.
skill-creator Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
steering-workflow-creator Author a new steering workflow for any AWS service and pair it with a matching slash-command shim. Use when the user asks to create a steering workflow, add a workflow to apex, standardize steering, write a new workflow for EKS / RDS / Lambda / IAM / any AWS service, or build a phased playbook that plugs into a service hub. Covers the convention (frontmatter, header block, required sections), tool routing (knowledge vs. live MCP vs. setup-bridge), and the lint pass before handoff.
terraform-skill Use when writing, reviewing, or debugging Terraform/OpenTofu modules, tests, CI, scans, or state ops - diagnoses failure mode (identity churn, secrets, blast radius, CI drift, state corruption) with version-aware guards.
update-docs Audit and update every documentation surface in the APEX repo against the current state of skills, steering workflows, README marker tables, and the Docusaurus site under misc/website/. After any change to a skill (rename, retire, add, edit description), walk the repo, re-run script-managed surfaces if their --check fails, and reason through every tracked prose *.md to catch references that need updating. Use after adding/removing/renaming a skill, after editing SKILL.md frontmatter, after editing README marker blocks, or before publishing a docs change. Also use when the user says "update docs", "sync docs", "check docs", "run update-docs", or mentions that documentation might be stale.

Steering (Optional)

This table is auto-generated by misc/update-steering-references.sh. Do not edit manually.

Steering File Description
apex APEX meta hub. Routes contributor requests about the repo itself — adding a new skill, authoring a new steering workflow, and other maintenance actions that are not tied to a specific AWS service.
eks EKS platform engineering hub. Routes to design, build, upgrade-readiness, and operational-review workflows. Use as the entry point for any EKS-related request.
design Day 0 architecture design workflow. 8-phase questionnaire for EKS cluster design, architecture reviews, and option comparisons.
eks-build Day 1 infrastructure build workflow. Multi-phase questionnaire gathering requirements then generating production-ready Terraform code for EKS clusters.
eks-genai Day 1 GenAI-on-EKS workflow. Guides building, training, fine-tuning, and serving generative AI / LLM workloads on Amazon EKS through the opinionated 6-layer stack — hardware (GPU vs Neuron), Karpenter scheduling, vLLM/Ray serving, distributed training, ML storage, GPU/Neuron observability, and the LiteLLM AI gateway — with a non-negotiable security baseline and cost levers.
eks-operation-review Day 2 operational-review workflow. Runs the eks-operation-review skill end-to-end — 10-section structured assessment of a live cluster's operational excellence, with GREEN/AMBER/RED ratings and a markdown/HTML report including prioritized actions and AWS reference links.
eks-platform-engineering Day 1 platform-engineering workflow. Guides building an Internal Developer Platform on EKS — golden paths, developer portal (Backstage), GitOps and progressive delivery, self-service infrastructure (ACK/KRO), tenancy, AI/ML golden paths, and measuring success with DORA.
eks-security Day 1/Day 2 EKS security & compliance workflow. Guides hardening an Amazon EKS cluster and preparing it for a compliance audit through the discovery-driven 7-layer stack — OS/AMI hardening, identity & access, workload security, image supply chain, runtime security, audit logging, and compliance accelerators — with the compliance-regime scope view (HIPAA/PCI/FedRAMP/GDPR/SOC2/ISO), a non-negotiable security baseline, and a 30/60/90 hardening roadmap.
eks-upgrade-check Day 2 upgrade-readiness assessment workflow. Runs the eks-upgrade-check skill end-to-end — 8 automated checks, 0-100 readiness score, markdown/HTML report with remediation steps.
new-skill Meta contributor workflow. Onboards a new skill end-to-end — scope intake, optional skill-creator drafting, sibling-graph survey, repo fan-out diff, and eval scaffold. Bimodal — greenfield authoring or retrofit on an existing skill that skipped the process.

Slash Commands (Claude Code)

Command Description
/apex:eks EKS platform engineering hub. Routes to design or upgrade workflows based on your request. Use for any EKS-related task -- architecture design, cluster upgrades, reviews, comparisons, or general EKS questions.
/apex:eks-build Build a production-ready EKS cluster. Multi-phase questionnaire gathering requirements then generating Terraform code via the eks-build skill.
/apex:eks-design Design a new EKS cluster architecture. 8-phase questionnaire covering compute, networking, security, observability, cost, reliability, and multi-tenancy. Also handles architecture reviews and option comparisons.
/apex:eks-genai Build, train, fine-tune, or serve a generative AI / LLM workload on Amazon EKS — walks the opinionated 6-layer stack (GPU vs Neuron, Karpenter scheduling, vLLM/Ray serving, distributed training, ML storage, GPU/Neuron observability, LiteLLM gateway) with a non-negotiable security baseline and cost levers. Use to design or stand up self-hosted GenAI on EKS.
/apex:eks-operation-review Run a structured EKS operational excellence assessment — 10-section review (cluster lifecycle, IaC/GitOps, access/identity, observability, workload config, networking, autoscaling, deployment practices, ops processes, add-on management) producing a rated report with GREEN/AMBER/RED findings and prioritized actions. Use when someone asks "run an EKS operational review", "audit my cluster", "EKS health check", "review my EKS posture", or asks for a section-scoped review (networking, RBAC, observability, etc.).
/apex:eks-platform-engineering Build an Internal Developer Platform on EKS — golden paths, developer portal (Backstage), GitOps and progressive delivery, self-service infrastructure (ACK/KRO), tenancy, AI/ML golden paths, and DORA-based measurement. Use to design or stand up developer self-service on EKS.
/apex:eks-security Harden an Amazon EKS cluster or prepare it for a compliance audit — walks the discovery-driven 7-layer security stack (OS/AMI, identity, workload, image, runtime, audit, compliance accelerators), the compliance-regime scope (HIPAA/PCI/FedRAMP/GDPR/SOC2/ISO), the AWS-canonical baseline, and a 30/60/90 hardening roadmap. Use to design or harden EKS security for regulated workloads.
/apex:eks-upgrade-check Assess EKS cluster upgrade readiness — automated checks across 8 areas (version, breaking changes, deprecated APIs, add-on compatibility, node readiness, workload risks, AWS Insights, upgrade plan), a 0-100 readiness score, and a markdown/HTML report with prioritized remediation. Use for upgrade-readiness assessments before running an actual upgrade.
/apex:new-skill Onboard a new skill end-to-end — draft it, survey siblings, fan out the repo edits, and scaffold the eval set. Bimodal — greenfield authoring or retrofit on an existing skill.

Steering files control how the agent runs an engagement — they don't contain domain knowledge (that's in skills), but define the interaction pattern. The hub (eks.md) is the entry point — it detects what the user wants and routes to the appropriate workflow. Each workflow follows a structured sequence with checkpoints and STOP gates. The commands/ directory provides agent-harness-specific entry points (e.g., Claude Code slash commands) that map to the hub and workflows.

The key test: If you removed all steering files, would the agent still know the right answers? Yes — skills provide the knowledge. But the agent wouldn't know how to run the engagement.


Examples

This table is auto-generated by misc/update-examples-references.sh. Do not edit manually.

Example Description Workflow
EKS Upgrade Readiness Check Deploy an EKS 1.32 cluster with Karpenter v1.0.2 and planted upgrade issues, then run the APEX EKS upgrade-check skill to produce a scored readiness report showing NOT READY status. eks-upgrade-check

Contributing

See CONTRIBUTING.md for guidelines on:

  • Where new content goes (skills vs steering vs examples)
  • How to create a new skill
  • How to create a new steering workflow
  • How to create a new example
  • How to add evals for a new skill

Sources

All best practices content is sourced from public AWS documentation:


Disclaimer

This repository provides sample code for educational and demonstration purposes only. It is not intended for direct production use without proper review, testing, and validation. Always test generated infrastructure artifacts (Terraform, Helm charts, kubectl commands) in non-production environments first. Use at your own risk — the authors are not responsible for any issues, damages, or losses that may result from using this code in production.


License

This project is licensed under the MIT-0 License. See the LICENSE file.