Add AWS Inferentia2/Neuron support to k8s AI inference by kiryl-filatau · Pull Request #6678 · GoogleCloudPlatform/PerfKitBenchmarker

kiryl-filatau · 2026-05-19T11:32:16Z

Summary

Enables running the kubernetes_ai_inference benchmark on AWS EKS Auto with AWS Inferentia2 (Neuron): PKB provisions the cluster, deploys vLLM via wg-serving, loads weights from S3, and runs the same benchmark flow as GKE + TPU.

Companion PR: kubernetes-sigs/wg-serving #71 (aws-neuron/ overlays). Until it merges: --wg_serving_repo_url=https://github.com/vofish/wg-serving.git, --wg_serving_repo_branch=aws-tpu, catalog_provider=aws-neuron.

Main changes

kubernetes_ai_inference on AWS Neuron: EKS Auto (type=Auto), S3-backed weights, Neuron node pool, wg-serving aws-neuron catalog; log/event parsing for Neuron vLLM startup.
EKS Auto: Optional Mountpoint S3 CSI addon (--eks_install_s3_csi_addon) and Neuron device plugin (--eks_install_neuron_device_plugin). nodePools: [general-purpose, system] so the wg-serving catalog Job can schedule.
S3 weights: --k8s_inference_server_s3_bucket / _region, PV/PVC; model files under /data/models (HF snapshot in S3 before the run).
wg-serving inference server: catalog_provider=aws-neuron tells serving-catalog-cli to render vllm/<model>/aws-neuron/ (from kubernetes-sigs/wg-serving #71 ) instead of the GPU aws/ path. PKB adds a Karpenter NodePool for Inf / Trn instance families so the inference pod runs on a Neuron node. HF token may be a plain hf_… value or an s3://… URI.
Benchmark: Model-load parser refactor; K8s v1.35 "Container started" events; Neuron log patterns. GetEvents() tolerates empty kubectl output.
Tests: wg_serving_inference_server_test, elastic_kubernetes_service_test, kubernetes_ai_inference_benchmark_test.

Models and weights

S3: upload HF weights before the run. Bucket and read access (IAM / Pod Identity for the CSI driver) are prerequisites; in test E2E run they were set up manually, not by PKB.
Neuron compile: first pod start compiles for the chip even when weights are on S3. Pre-compiled Neuron artifacts on NEURON_COMPILED_ARTIFACTS can skip re-compile (not automated in this PR).
Model compatibility: not every HuggingFace checkpoint compiles on Neuron; check architecture and Neuron SDK support before adding a catalog overlay. Example we hit: meta-llama/Llama-3.2-1B (llama32-1b) failed during Neuron compile with NCC_ITEN404 (model uses "llama3" rope_type, Neuron SDK 2.28.0 in the image). Overlay removed from wg-serving kubernetes-sigs/wg-serving #71.
Neuron vLLM image from AWS ECR, not vllm/vllm-openai.
E2E validated: tinyllama-1b only (see below).
Not tested: llama3-8b. Overlay for it exist in kubernetes-sigs/wg-serving #71.

Tested with

tinyllama-1b, weights pre-loaded to S3, us-east-1:

./pkb.py \
  --benchmarks=kubernetes_ai_inference \
  --cloud=AWS \
  --config_override=kubernetes_ai_inference.container_cluster.type='Auto' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.model_server='vllm' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.model_name='tinyllama-1b' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.pvc-name='s3-csi-static-pvc' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.catalog_components='1-Inf2,s3' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.catalog_provider='aws-neuron' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.deployment_timeout=3600 \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.model-path='vllm_models/tinyllama-1b' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.container-image='public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.13.0-neuronx-py312-sdk2.28.0-ubuntu24.04' \
  --k8s_ai_inference_hf_token=<hf_token> \
  --k8s_ai_inference_request_rate=5 \
  --k8s_inference_server_image_repo=<image_repo> \
  --metadata=cloud:AWS \
  --zone=us-east-1a \
  --k8s_inference_server_s3_bucket=<bucket> \
  --k8s_inference_server_s3_region=us-east-1 \
  --eks_install_s3_csi_addon=true \
  --eks_install_neuron_device_plugin=true \
  --wg_serving_repo_url=https://github.com/vofish/wg-serving.git \
  --wg_serving_repo_branch=aws-neuron

hubatish · 2026-05-22T21:39:45Z

-      storage_service.PrepareService()
+      # Pick the storage backend by URI scheme so existing gs:// usage is
+      # untouched while s3:// URIs work for AWS-only runs.
+      storage_cloud = (


I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?

@hubatish

Yes, this block is only for the HuggingFace token in _InjectDefaultHuggingfaceToken.

If the flag value starts with hf_, we skip object storage and put the token straight into the K8s secret.

If it is a URI, we download the file once with Copy and then create the secret. Model weights are not loaded here (they use S3 CSI / GCS via other flags and the wg-serving catalog).

Before this PR we always used the GCP storage class (for gs:// tokens). On AWS runs the token can be in S3, so I pick AWS vs GCP from the URI scheme. I did not want to always use GCS because that needs GCP credentials even when the benchmark runs only on AWS.

This does not change how the storage class is used elsewhere in PKB. We create a new service instance only inside this function.

true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.

hubatish · 2026-05-22T21:45:04Z

+        # parameterised. Only the values differ for Neuron (instance family,
+        # taint, NodePool name).
+        kubernetes_commands.ApplyManifest(
+            'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2',


Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.

Good point, and I saw the direction in #6687.

Right now _ProvisionGPUNodePool() runs from the inference server _Create() and applies aws-gpu-nodepool.yaml.j2. That was already the pattern for AWS GPU in this benchmark; I reused the same template for Neuron (different instance family + aws.amazon.com/neuron taint).

For EKS Auto we only get general-purpose and system from the cluster setup. The wg-serving pod still needs a pool that can schedule on Inf/Trn with the Neuron taint, so I kept this extra NodePool here for now.

I agree it would be cleaner to move this to cluster _Create / _PostCreate and spec nodepools like in #6687. However, for now I propose to leave that as a follow-up.

hubatish · 2026-05-22T21:46:14Z

    if 'gcsfuse' in self.spec.catalog_components:
      self._ApplyGCSFusePVC()
+    elif 's3' in self.spec.catalog_components:
+      self._ApplyS3PVC()


same comments as for Vlodymyr's #6654 (review)

Agreed to change after #6654 is merged

hubatish · 2026-05-22T21:47:43Z

 from perfkitbenchmarker.resources.container_service import kubectl
+# Imported to register --k8s_inference_server_s3_bucket so flagsaver can set
+# it from this test file.
+from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server  # pylint: disable=unused-import


move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)

moved to aws/

hubatish · 2026-05-22T21:50:28Z

+
+  def _InstallNeuronDevicePlugin(self):
+    """Applies the AWS Neuron Device Plugin DaemonSet to the cluster."""
+    # Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image


Why does forcing a jinja render matter? What happens if you don't?

Good question. In PKB, ReadAndRenderJinja2Template only runs Jinja when kwargs is non-empty. If we call ApplyManifest with no kwargs, the file is applied as raw text and the DaemonSet still has the literal {{ neuron_device_plugin_image | ... }} in the image field, so apply fails.

I pass neuron_device_plugin_image=... to trigger rendering. The value matches the template default for now. I do not think it is worth changing vm_util behavior in this PR.

I updated the comment in code to explain this.

hubatish · 2026-05-22T21:56:20Z

+  def _InstallS3CsiAddon(self):
+    """Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA)."""
+    # Local import: the inference-server module owns the bucket flag.
+    from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server  # pylint: disable=g-import-not-at-top


see other comments about moving flags here

moved to aws/

hubatish · 2026-05-22T22:06:27Z

+      vllm_start_timestamp = _ParseInferenceServerTimeStamp(line)
+      break
+  if container_init_timestamp is None:
+    raise ValueError('Container init timestamp is not found in the logs.')


You talked about sharing some code. It doesn't look too bad but some ideas:

Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.

This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].

The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.

A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.

@hubatish thanks for the suggestions.

I already moved some shared logic out:

_GetContainerInitTimestamp: when the inference container started (from K8s events, same for all backends).

_ComputeModelLoadDurations: takes 4 times (container start + 3 from logs: model load start, load end, vLLM ready) and returns the 3 numbers we publish as metrics (time until load starts, load duration, time until API server is up).

The three parsers only scan logs for different strings (GPU vs TPU vs Neuron). Then they call _ComputeModelLoadDurations.

I could still do the other ideas (rename default parser, one shared for line in logs loop). Do you think we need that for this PR, or is the current split enough?

About 4 classes: I would prefer not to overcomplicate this. We have 3 backends and the difference is mostly which log substring to match. Classes would mean more code than we need for something that is already clear in one function.

…in wg-serving inference server

…ol cases

hubatish · 2026-05-26T15:15:35Z

Looks like the test failure is in tests/providers/gcp/gce_managed_instance_group_test.py . If you can't see a relationship between your code & the failure try repulling/syncing & running through checks again

hubatish · 2026-05-28T15:18:36Z

-      storage_service.PrepareService()
+      # Pick the storage backend by URI scheme so existing gs:// usage is
+      # untouched while s3:// URIs work for AWS-only runs.
+      storage_cloud = (


true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.

hubatish · 2026-05-28T15:19:56Z

-  )
-  application_start_time = GetTimeDifference(
-      model_load_end_timestamp, vllm_start_timestamp
+  return _ComputeModelLoadDurations(


ty for making this refactor

pyink reformated

d172801

kiryl-filatau changed the title ~~Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing~~ Add AWS Inferentia2/Neuron support to k8s AI inference May 19, 2026

Cleaned up Neuron device plugin and S3 PVC manifests

aeacc80

hubatish reviewed May 22, 2026

View reviewed changes

kiryl-filatau added 5 commits May 23, 2026 16:28

pyink reformated

a40e3bd

Extend Neuron chip detection and refactor AWS node pool provisioning …

5dbf87c

…in wg-serving inference server

Refactor wg-serving inference server tests to deduplicate AWS node po…

576f606

…ol cases

Merge branch 'master' into aws-tpu

8eda18d

Handle empty kubectl events output in GetEvents.

d8677a2

kiryl-filatau mentioned this pull request May 24, 2026

Add AWS Inferentia2/Neuron support to k8s AI inference kubernetes-sigs/wg-serving#70

Closed

Move S3 inference config to aws/flags and document Neuron Jinja kwargs

3c6c22b

kiryl-filatau marked this pull request as ready for review May 26, 2026 16:10

Merge branch 'master' into aws-tpu

e08bd3d

kiryl-filatau mentioned this pull request May 27, 2026

Add AWS Inferentia2/Neuron support to k8s AI inference kubernetes-sigs/wg-serving#71

Open

remove temp local fix

33e3109

hubatish approved these changes May 28, 2026

View reviewed changes

hubatish self-assigned this May 28, 2026

hubatish added the ready to pull label May 28, 2026

Conversation

kiryl-filatau commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Models and weights

Tested with

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiryl-filatau May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubatish commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiryl-filatau commented May 19, 2026 •

edited

Loading

kiryl-filatau May 26, 2026 •

edited

Loading