Add AWS Inferentia2/Neuron support to k8s AI inference#6678
Add AWS Inferentia2/Neuron support to k8s AI inference#6678kiryl-filatau wants to merge 10 commits into
Conversation
| storage_service.PrepareService() | ||
| # Pick the storage backend by URI scheme so existing gs:// usage is | ||
| # untouched while s3:// URIs work for AWS-only runs. | ||
| storage_cloud = ( |
There was a problem hiding this comment.
I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?
There was a problem hiding this comment.
Yes, this block is only for the HuggingFace token in _InjectDefaultHuggingfaceToken.
If the flag value starts with hf_, we skip object storage and put the token straight into the K8s secret.
If it is a URI, we download the file once with Copy and then create the secret. Model weights are not loaded here (they use S3 CSI / GCS via other flags and the wg-serving catalog).
Before this PR we always used the GCP storage class (for gs:// tokens). On AWS runs the token can be in S3, so I pick AWS vs GCP from the URI scheme. I did not want to always use GCS because that needs GCP credentials even when the benchmark runs only on AWS.
This does not change how the storage class is used elsewhere in PKB. We create a new service instance only inside this function.
There was a problem hiding this comment.
true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.
| # parameterised. Only the values differ for Neuron (instance family, | ||
| # taint, NodePool name). | ||
| kubernetes_commands.ApplyManifest( | ||
| 'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2', |
There was a problem hiding this comment.
Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.
There was a problem hiding this comment.
Good point, and I saw the direction in #6687.
Right now _ProvisionGPUNodePool() runs from the inference server _Create() and applies aws-gpu-nodepool.yaml.j2. That was already the pattern for AWS GPU in this benchmark; I reused the same template for Neuron (different instance family + aws.amazon.com/neuron taint).
For EKS Auto we only get general-purpose and system from the cluster setup. The wg-serving pod still needs a pool that can schedule on Inf/Trn with the Neuron taint, so I kept this extra NodePool here for now.
I agree it would be cleaner to move this to cluster _Create / _PostCreate and spec nodepools like in #6687. However, for now I propose to leave that as a follow-up.
| if 'gcsfuse' in self.spec.catalog_components: | ||
| self._ApplyGCSFusePVC() | ||
| elif 's3' in self.spec.catalog_components: | ||
| self._ApplyS3PVC() |
There was a problem hiding this comment.
same comments as for Vlodymyr's #6654 (review)
| from perfkitbenchmarker.resources.container_service import kubectl | ||
| # Imported to register --k8s_inference_server_s3_bucket so flagsaver can set | ||
| # it from this test file. | ||
| from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=unused-import |
There was a problem hiding this comment.
move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)
There was a problem hiding this comment.
moved to aws/
|
|
||
| def _InstallNeuronDevicePlugin(self): | ||
| """Applies the AWS Neuron Device Plugin DaemonSet to the cluster.""" | ||
| # Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image |
There was a problem hiding this comment.
Why does forcing a jinja render matter? What happens if you don't?
There was a problem hiding this comment.
Good question. In PKB, ReadAndRenderJinja2Template only runs Jinja when kwargs is non-empty. If we call ApplyManifest with no kwargs, the file is applied as raw text and the DaemonSet still has the literal {{ neuron_device_plugin_image | ... }} in the image field, so apply fails.
I pass neuron_device_plugin_image=... to trigger rendering. The value matches the template default for now. I do not think it is worth changing vm_util behavior in this PR.
I updated the comment in code to explain this.
| def _InstallS3CsiAddon(self): | ||
| """Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA).""" | ||
| # Local import: the inference-server module owns the bucket flag. | ||
| from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=g-import-not-at-top |
There was a problem hiding this comment.
see other comments about moving flags here
There was a problem hiding this comment.
moved to aws/
| vllm_start_timestamp = _ParseInferenceServerTimeStamp(line) | ||
| break | ||
| if container_init_timestamp is None: | ||
| raise ValueError('Container init timestamp is not found in the logs.') |
There was a problem hiding this comment.
You talked about sharing some code. It doesn't look too bad but some ideas:
- Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.
- This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].
- The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.
- A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.
There was a problem hiding this comment.
@hubatish thanks for the suggestions.
I already moved some shared logic out:
- _GetContainerInitTimestamp: when the inference container started (from K8s events, same for all backends).
- _ComputeModelLoadDurations: takes 4 times (container start + 3 from logs: model load start, load end, vLLM ready) and returns the 3 numbers we publish as metrics (time until load starts, load duration, time until API server is up).
The three parsers only scan logs for different strings (GPU vs TPU vs Neuron). Then they call _ComputeModelLoadDurations.
I could still do the other ideas (rename default parser, one shared for line in logs loop). Do you think we need that for this PR, or is the current split enough?
About 4 classes: I would prefer not to overcomplicate this. We have 3 backends and the difference is mostly which log substring to match. Classes would mean more code than we need for something that is already clear in one function.
…in wg-serving inference server
|
Looks like the test failure is in tests/providers/gcp/gce_managed_instance_group_test.py . If you can't see a relationship between your code & the failure try repulling/syncing & running through checks again |
| storage_service.PrepareService() | ||
| # Pick the storage backend by URI scheme so existing gs:// usage is | ||
| # untouched while s3:// URIs work for AWS-only runs. | ||
| storage_cloud = ( |
There was a problem hiding this comment.
true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.
| ) | ||
| application_start_time = GetTimeDifference( | ||
| model_load_end_timestamp, vllm_start_timestamp | ||
| return _ComputeModelLoadDurations( |
There was a problem hiding this comment.
ty for making this refactor
Summary
Enables running the
kubernetes_ai_inferencebenchmark on AWS EKS Auto with AWS Inferentia2 (Neuron): PKB provisions the cluster, deploys vLLM via wg-serving, loads weights from S3, and runs the same benchmark flow as GKE + TPU.Companion PR: kubernetes-sigs/wg-serving #71 (
aws-neuron/overlays). Until it merges:--wg_serving_repo_url=https://github.com/vofish/wg-serving.git,--wg_serving_repo_branch=aws-tpu,catalog_provider=aws-neuron.Main changes
kubernetes_ai_inferenceon AWS Neuron: EKS Auto (type=Auto), S3-backed weights, Neuron node pool, wg-servingaws-neuroncatalog; log/event parsing for Neuron vLLM startup.EKS Auto: Optional Mountpoint S3 CSI addon (
--eks_install_s3_csi_addon) and Neuron device plugin (--eks_install_neuron_device_plugin).nodePools: [general-purpose, system]so the wg-serving catalog Job can schedule.S3 weights:
--k8s_inference_server_s3_bucket/_region, PV/PVC; model files under/data/models(HF snapshot in S3 before the run).wg-serving inference server:
catalog_provider=aws-neurontellsserving-catalog-clito rendervllm/<model>/aws-neuron/(from kubernetes-sigs/wg-serving #71 ) instead of the GPUaws/path. PKB adds a Karpenter NodePool for Inf / Trn instance families so the inference pod runs on a Neuron node. HF token may be a plainhf_…value or ans3://…URI.Benchmark: Model-load parser refactor; K8s v1.35
"Container started"events; Neuron log patterns.GetEvents()tolerates empty kubectl output.Tests:
wg_serving_inference_server_test,elastic_kubernetes_service_test,kubernetes_ai_inference_benchmark_test.Models and weights
NEURON_COMPILED_ARTIFACTScan skip re-compile (not automated in this PR).meta-llama/Llama-3.2-1B(llama32-1b) failed during Neuron compile withNCC_ITEN404(model uses"llama3"rope_type, Neuron SDK 2.28.0 in the image). Overlay removed from wg-serving kubernetes-sigs/wg-serving #71.vllm/vllm-openai.tinyllama-1bonly (see below).llama3-8b. Overlay for it exist in kubernetes-sigs/wg-serving #71.Tested with
tinyllama-1b, weights pre-loaded to S3,us-east-1: