diff --git a/README.md b/README.md index e39cb944..b45d7310 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,27 @@ +<<<<<<< HEAD # Intel® AI for Enterprise Inference Unleash the power of AI Inference on Intel Silicon The Intel® AI for Enterprise Inference is aimed to streamline and enhance the deployment and management of AI inference services on Intel hardware. Utilizing the power of Kubernetes Orchestration, this solution automates the deployment of LLM models to run faster inference, provision compute resources, and configure the optimal settings to minimize the complexities and reduce manual efforts. -It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors and Intel® Gaudi® AI Accelerators, ensuring flexibility and scalability to meet diverse enterprise needs. +It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors, Intel® Gaudi® AI Accelerators, and **Intel® Arc™ Battlemage (BMG) GPUs**, ensuring flexibility and scalability to meet diverse enterprise needs. -Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly. +Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly. ![Intel AI for Enterprise Inference](docs/pictures/Enterprise-Inference-Architecture.png) #### Key Components: - **Kubernetes**: A powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications, ensuring high availability and efficient resource utilization. - **Intel Gaudi Base Operator**: A specialized operator that manages the lifecycle of Habana AI resources within the Kubernetes cluster, enabling efficient utilization of Intel® Gaudi® hardware for AI workloads. (Applicable only to Gaudi based deployments) + - **Intel GPU Plugin**: A Kubernetes device plugin that manages Intel® Arc™ GPU resources within the cluster, enabling efficient utilization of Intel® Arc™ Battlemage (BMG) hardware for AI workloads. (Applicable only to BMG based deployments) - **Ingress NGINX Controller**: A high-performance reverse proxy and load balancer for traffic, responsible for routing incoming requests to the appropriate services within the Kubernetes cluster, ensuring seamless access to deployed AI models. - **Keycloak**: An open-source identity and access management solution that provides robust authentication and authorization capabilities, ensuring secure access to AI services and resources within the cluster. - **APISIX**: A cloud-native API gateway, handling API traffic and providing advanced features caching, and authentication, enabling efficient and secure access to deployed AI models. - **Observability**: An open-source monitoring solution designed to operate natively within Kubernetes clusters, providing comprehensive visibility into the performance, health, and resource utilization of deployed applications and cluster components through metrics, visualization, and alerting capabilities. - **Model Deployments**: Automated deployment and management of AI LLM models within the Kubernetes inference cluster, enabling scalable and reliable AI inference capabilities. - **GenAI Gateway**: An integrated gateway leveraging LiteLLM and Langfuse to provide flexible interfaces for routing and managing generative AI models. It enables user and key management, user token telemetry, and analytics for LLM inference workflows. - + ## Table of Contents - [Usage](#usage) - [Support](#support) @@ -32,6 +34,8 @@ Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI The Usage instructions for the AI Inference as a Service Deployment Automation can be found in the [docs/README.md](docs/README.md) file. To setup, follow the step-by-step instructions provided in the `docs/README.md` file. +For Intel® Arc™ Battlemage (BMG) GPU setup, refer to [docs/intel-arc-bmg-setup.md](docs/intel-arc-bmg-setup.md). + ## Support For feature requests, bugs or questions about the project, [open an issue](https://github.com/opea-project/Enterprise-Inference/issues) on the GitHub Issues page. Provide as much details as possible, including steps to reproduce the issue, expected behavior, and actual behavior. @@ -42,7 +46,7 @@ Intel® AI for Enterprise Inference is licensed under the [Apache License Versio The [Security Policy](SECURITY.md) outlines our guidelines and procedures for ensuring the highest level of security and trust for our users who consume Intel® AI for Enterprise Inference. ## Trademark Information -Intel, the Intel logo, Xeon, and Gaudi are trademarks of Intel Corporation or its subsidiaries. +Intel, the Intel logo, Xeon, Gaudi, and Arc are trademarks of Intel Corporation or its subsidiaries. * Other names and brands may be claimed as the property of others. © Intel Corporation diff --git a/core/helm-charts/vllm/bmg-values.yaml b/core/helm-charts/vllm/bmg-values.yaml new file mode 100644 index 00000000..dfd12cb5 --- /dev/null +++ b/core/helm-charts/vllm/bmg-values.yaml @@ -0,0 +1,194 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# Intel® Arc™ Battlemage (BMG) GPU optimized override values for vLLM deployments. +# This file contains BMG-specific overrides for Intel Arc B-series GPU (e.g., B580, B770). +# Requires the Intel GPU Plugin (intel-device-plugins-gpu) to be installed on the cluster. + +# Intel XPU accelerator device (Arc GPU) +accelDevice: "xpu" +# Kubernetes resource name exposed by the Intel GPU device plugin +accelDeviceResource: "gpu.intel.com/xe" + +block_size: 64 # XPU-optimised KV cache block size (must be >= 64 for 0.14.1-xpu IPEX chunked prefill) +max_num_seqs: 128 # Max concurrent sequences (tuned for Arc B-series VRAM) +max_seq_len_to_capture: 2048 +d_type: "float16" +max_model_len: 8192 + +image: + repository: intel/vllm + tag: "0.14.1-xpu" + pullPolicy: IfNotPresent + command: ["vllm", "serve"] + +# intel/vllm:0.14.1-xpu runs as root (no user defined in image) +podSecurityContext: + fsGroup: 0 + runAsUser: 0 + +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + add: + - SYS_NICE + readOnlyRootFilesystem: false + runAsNonRoot: false + runAsUser: 0 + +# Node affinity for BMG inference nodes +affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: ei-inference-eligible + operator: In + values: ["true"] + +# Intel XPU runtime settings +VLLM_NO_USAGE_STATS: 1 +DO_NOT_TRACK: 1 + +# vLLM device backend - set via env var in 0.14.1-xpu (VLLM_TARGET_DEVICE=xpu is already baked in) +VLLM_WORKER_MULTIPROC_METHOD: "spawn" + +LLM_MODEL_ID: "Qwen/Qwen2.5-Coder-3B-Instruct" + +modelConfigs: + + "meta-llama/Llama-3.1-8B-Instruct": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + "--enable-auto-tool-choice", + "--tool-call-parser", "llama3_json", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + + "mistralai/Mistral-7B-Instruct-v0.3": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + "--enable-auto-tool-choice", + "--tool-call-parser", "mistral", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + + "deepseek-ai/DeepSeek-R1-Distill-Llama-8B": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + + "Qwen/Qwen2.5-7B-Instruct": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + "--enable-auto-tool-choice", + "--tool-call-parser", "hermes", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + + "Qwen/Qwen2.5-Coder-3B-Instruct": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + "--enable-auto-tool-choice", + "--tool-call-parser", "hermes", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + + "tiiuae/Falcon3-7B-Instruct": + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "64", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + ] + tensor_parallel_size: "1" + pipeline_parallel_size: "1" + +defaultModelConfigs: + configMapValues: + VLLM_NO_USAGE_STATS: "1" + DO_NOT_TRACK: "1" + VLLM_WORKER_MULTIPROC_METHOD: "spawn" + HF_HUB_DISABLE_XET: "1" + extraCmdArgs: + [ + "--dtype", "float16", + "--block-size", "16", + "--max-model-len", "8192", + "--gpu-memory-utilization", "0.90", + "--max-num-seqs", "128", + "--enforce-eager", + ] + tensor_parallel_size: "{{ .Values.tensor_parallel_size }}" + pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" diff --git a/core/helm-charts/vllm/templates/deployment.yaml b/core/helm-charts/vllm/templates/deployment.yaml index a6cf21eb..825a5d2d 100644 --- a/core/helm-charts/vllm/templates/deployment.yaml +++ b/core/helm-charts/vllm/templates/deployment.yaml @@ -126,11 +126,12 @@ spec: {{- if .Values.image.pullPolicy }} imagePullPolicy: {{ .Values.image.pullPolicy }} {{- end }} - # command: - # - /bin/bash - # - -c - # - | - # python3 -m vllm.entrypoints.openai.api_server --dtype {{ .Values.d_type }} --model {{ .Values.LLM_MODEL_ID }} --port {{ .Values.port }} --tensor-parallel-size {{ .Values.tensor_parallel_size }} --block-size {{ .Values.block_size }} --max-model-len {{ .Values.max_model_len }} --disable-log-requests + {{- if .Values.image.command }} + command: + {{- range .Values.image.command }} + - {{ . | quote }} + {{- end }} + {{- end }} args: {{- $modelConfig := (index .Values.modelConfigs $modelName | default dict) }} {{- $modelArgs := $modelConfig.extraCmdArgs | default .Values.defaultModelConfigs.extraCmdArgs }} @@ -195,7 +196,7 @@ spec: {{- end }} {{- else }} limits: - habana.ai/gaudi: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}} + {{ .Values.accelDeviceResource | default "habana.ai/gaudi" }}: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}} {{- end }} {{- end }} diff --git a/core/helm-charts/vllm/values.yaml b/core/helm-charts/vllm/values.yaml index 0a3d5df1..f79d595b 100644 --- a/core/helm-charts/vllm/values.yaml +++ b/core/helm-charts/vllm/values.yaml @@ -17,6 +17,8 @@ autoscaling: # empty for CPU (longer latencies are tolerated before HPA scaling unaccelerated service) accelDevice: "" +# Kubernetes resource name for the accelerator (e.g. habana.ai/gaudi, gpu.intel.com/xe) +accelDeviceResource: "" port: 2080 shmSize: 1Gi diff --git a/core/inference-stack-deploy.sh b/core/inference-stack-deploy.sh old mode 100644 new mode 100755 index 148ae5b8..edaf5596 --- a/core/inference-stack-deploy.sh +++ b/core/inference-stack-deploy.sh @@ -50,7 +50,7 @@ NC=$(tput sgr0) # --keycloak-admin-password : The Keycloak admin password. # --hugging-face-token : The token for Huggingface. # --models : The models to deploy (comma-separated list of model numbers or names). -# --cpu-or-gpu : Specify whether to run on CPU or GPU. +# --device : Specify the target device. 'cpu' for Xeon, 'hpu' for Gaudi GPU, 'xpu' for Intel Arc Battlemage GPU. # Main Menu @@ -86,7 +86,7 @@ NC=$(tput sgr0) # Example # To perform a fresh installation with specific parameters, you can run: -# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g" +# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu" ############################################################################## @@ -118,6 +118,7 @@ source "$SCRIPT_DIR/lib/cluster/drv-fw-update.sh" # Components deployment source "$SCRIPT_DIR/lib/components/kubernetes-setup.sh" source "$SCRIPT_DIR/lib/components/intel-base-operator.sh" +source "$SCRIPT_DIR/lib/components/intel-gpu-plugin.sh" source "$SCRIPT_DIR/lib/components/ingress-controller.sh" source "$SCRIPT_DIR/lib/components/keycloak-controller.sh" source "$SCRIPT_DIR/lib/components/genai-gateway-controller.sh" @@ -166,10 +167,11 @@ Options: --keycloak-admin-password Keycloak admin password. --hugging-face-token Huggingface token. --models Models to deploy (comma-separated). - --cpu-or-gpu Run on CPU (c) or GPU (g). + --device Target device: cpu (Xeon), hpu (Gaudi GPU), xpu (Intel Arc Battlemage GPU). Examples: - Setup cluster: ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g" + Setup cluster (Gaudi GPU): ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu" + Setup cluster (BMG GPU): ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "36" --device "xpu" ############################################################################### EOF diff --git a/core/inventory/hosts.yaml b/core/inventory/hosts.yaml index eb2cd095..5a792709 100644 --- a/core/inventory/hosts.yaml +++ b/core/inventory/hosts.yaml @@ -1,28 +1,19 @@ all: hosts: - master: - ansible_host: "{{ private_ip_control_plane_node }}" - ansible_user: "username_of_user_running_automation" - ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa" - worker1: - ansible_host: "{{ private_ip_workload_node_1 }}" - ansible_user: "username_of_user_running_automation" - ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa" - worker2: - ansible_host: "{{ private_ip_workload_node_2 }}" - ansible_user: "username_of_user_running_automation" - ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa" + master1: + ansible_connection: local + ansible_user: gta + ansible_become: true children: kube_control_plane: hosts: - master: + master1: kube_node: hosts: - worker1: - worker2: + master1: etcd: hosts: - master: + master1: k8s_cluster: children: kube_control_plane: diff --git a/core/inventory/inference-config.cfg b/core/inventory/inference-config.cfg index e63552d4..75be992a 100644 --- a/core/inventory/inference-config.cfg +++ b/core/inventory/inference-config.cfg @@ -2,14 +2,15 @@ cluster_url=api.example.com cert_file=~/certs/cert.pem key_file=~/certs/key.pem keycloak_client_id=my-client-id -keycloak_admin_user=your-keycloak-admin-user +keycloak_admin_user=admin keycloak_admin_password=changeme -hugging_face_token=your_hugging_face_token -hugging_face_token_falcon3=your_hugging_face_token -models=11 -cpu_or_gpu=cpu +hugging_face_token= +hugging_face_token_falcon3= +models=36 +# Hardware selection: 'cpu' for Xeon CPU, 'hpu'/'gpu'/'gaudi2'/'gaudi3' for Gaudi GPU, 'xpu'/'bmg' for Intel Arc Battlemage GPU +device=xpu vault_pass_code=place-holder-123 -deploy_kubernetes_fresh=on +deploy_kubernetes_fresh=yes deploy_ingress_controller=on deploy_keycloak_apisix=on deploy_genai_gateway=off @@ -20,4 +21,7 @@ deploy_istio=off uninstall_ceph=off # Agentic AI Plugin -deploy_agenticai_plugin=off \ No newline at end of file +deploy_agenticai_plugin=off +http_proxy="" +https_proxy="" +no_proxy="localhost,127.0.0.1,10.96.0.0/12,10.244.0.0/16,192.168.0.0/16,.svc,.cluster.local,10.233.0.1,10.233.0.0/18,10.0.0.0/8,api.example.com" diff --git a/core/inventory/metadata/all.yml b/core/inventory/metadata/all.yml index 0c8d7744..198b4088 100644 --- a/core/inventory/metadata/all.yml +++ b/core/inventory/metadata/all.yml @@ -38,7 +38,10 @@ loadbalancer_apiserver_healthcheck_port: 8081 # disable_host_nameservers: false ## Upstream dns servers -# upstream_dns_servers: +upstream_dns_servers: + - "10.248.2.1" + - "10.45.15.7" + - "10.2.71.6" # - 8.8.8.8 # - 8.8.4.4 @@ -55,15 +58,15 @@ loadbalancer_apiserver_healthcheck_port: 8081 # external_cloud_provider: ## Set these proxy values in order to update package manager and docker daemon to use proxies and custom CA for https_proxy if needed -# http_proxy: "" -# https_proxy: "" +http_proxy: "http://proxy-dmz.intel.com:912" +https_proxy: "http://proxy-dmz.intel.com:912" # https_proxy_cert_file: "" # DO NOT CHANGE INDENTATION - used by Automation env_proxy: - http_proxy: "" - https_proxy: "" - no_proxy: "" + http_proxy: "http://proxy-dmz.intel.com:912" + https_proxy: "http://proxy-dmz.intel.com:912" + no_proxy: "localhost,127.0.0.1,10.96.0.0/12,10.244.0.0/16,192.168.0.0/16,.svc,.cluster.local,10.233.0.1,10.233.0.0/18,10.0.0.0/8,intel.com,bmgaisolutions.com,api.example.com" ## Refer to roles/kubespray-defaults/defaults/main/main.yml before modifying no_proxy # no_proxy: "" diff --git a/core/inventory/metadata/inference-metadata.cfg b/core/inventory/metadata/inference-metadata.cfg index 48b01376..913d2ded 100644 --- a/core/inventory/metadata/inference-metadata.cfg +++ b/core/inventory/metadata/inference-metadata.cfg @@ -1,5 +1,6 @@ gaudi2_operator="1.22.0-740" gaudi3_operator="1.22.0-740" +intel_gpu_plugin_version="0.36.0" python3_interpreter="/usr/bin/python3" ingress_controller="4.12.2" keycloak_chart_version="22.1.0" diff --git a/core/inventory/metadata/vars/inference_llm_models.yml b/core/inventory/metadata/vars/inference_llm_models.yml index 51694ea4..7303c5c0 100644 --- a/core/inventory/metadata/vars/inference_llm_models.yml +++ b/core/inventory/metadata/vars/inference_llm_models.yml @@ -17,12 +17,14 @@ ingress_enabled: 'false' deploy_keycloak: 'no' tensor_parallel_size_vllm: 1 gaudi_deployment: 'true' +bmg_deployment: 'false' huggingface_model_id: 'false' hugging_face_model_deployment: 'false' huggingface_model_deployment_name: 'None' hugging_face_model_remove_name: 'false' balloon_policy_cpu: 'None' gaudi_values_file: "{{ remote_helm_charts_base }}/vllm/gaudi-values.yaml" +bmg_values_file: "{{ remote_helm_charts_base }}/vllm/bmg-values.yaml" huggingface_tensor_parellel_size: 'false' vllm_metrics_enabled: 'false' # Total CPUs reserved across all NUMA nodes for system components (keycloak, apisix, observability) diff --git a/core/lib/add-node.sh b/core/lib/add-node.sh index ae9e01fb..bd692916 100644 --- a/core/lib/add-node.sh +++ b/core/lib/add-node.sh @@ -44,7 +44,7 @@ add_worker_node() { #Rerun baloon policy if its cpu deployment - if [[ "$cpu_or_gpu" == "c" ]]; then + if [[ "$device" == "cpu" ]]; then echo "Reapplying NRI CPU Balloons for CPU deployments..." execute_and_check "Reapplying NRI CPU Balloons..." deploy_nri_balloons_playbook "$@" \ "NRI CPU Balloons re-applied successfully." \ diff --git a/core/lib/cluster/deployment/fresh-install.sh b/core/lib/cluster/deployment/fresh-install.sh index 1ec01aae..deb3e31d 100644 --- a/core/lib/cluster/deployment/fresh-install.sh +++ b/core/lib/cluster/deployment/fresh-install.sh @@ -57,10 +57,10 @@ fresh_installation() { # Deploy NRI CPU Balloons for CPU deployments (after all infrastructure, before models) if [[ "$deploy_nri_balloon_policy" == "yes" ]]; then # Ensure this is a CPU deployment - if [[ "$cpu_or_gpu" != "c" ]]; then - echo "${RED}Error: NRI Balloon Policy can only be deployed for CPU deployments (cpu_or_gpu='c')${NC}" - echo "${RED}Current cpu_or_gpu setting: '$cpu_or_gpu'${NC}" - echo "${RED}Please set cpu_or_gpu to 'c' or disable NRI balloon policy deployment. Exiting!${NC}" + if [[ "$device" != "cpu" ]]; then + echo "${RED}Error: NRI Balloon Policy can only be deployed for CPU deployments (device='cpu')${NC}" + echo "${RED}Current device setting: '$device'${NC}" + echo "${RED}Please set device to 'cpu' or disable NRI balloon policy deployment. Exiting!${NC}" exit 1 fi execute_and_check "Deploying CPU Optimization (NRI Balloons & Topology Detection)..." deploy_nri_balloons_playbook "$@" \ @@ -76,6 +76,14 @@ fresh_installation() { echo "Skipping Habana AI Operator installation..." fi + if [[ "$deploy_intel_gpu_plugin" == "yes" ]]; then + execute_and_check "Deploying Intel GPU Plugin for Arc BMG..." run_deploy_intel_gpu_plugin_playbook \ + "Intel GPU Plugin is deployed." \ + "Failed to deploy Intel GPU Plugin. Exiting." + else + echo "Skipping Intel GPU Plugin installation..." + fi + if [[ "$uninstall_ceph" == "yes" ]]; then execute_and_check "Uninstalling CEPH storage..." uninstall_ceph_cluster "$@" \ "CEPH is uninstalled successfully." \ diff --git a/core/lib/cluster/nodes/add-node.sh b/core/lib/cluster/nodes/add-node.sh index 6e976731..b03189e9 100644 --- a/core/lib/cluster/nodes/add-node.sh +++ b/core/lib/cluster/nodes/add-node.sh @@ -42,7 +42,7 @@ add_worker_node() { #Rerun baloon policy if its cpu deployment - if [[ "$cpu_or_gpu" == "c" ]]; then + if [[ "$device" == "cpu" ]]; then echo "Reapplying NRI CPU Balloons for CPU deployments..." execute_and_check "Reapplying NRI CPU Balloons..." deploy_nri_balloons_playbook "$@" \ "NRI CPU Balloons re-applied successfully." \ diff --git a/core/lib/components/intel-gpu-plugin.sh b/core/lib/components/intel-gpu-plugin.sh new file mode 100644 index 00000000..af0699b1 --- /dev/null +++ b/core/lib/components/intel-gpu-plugin.sh @@ -0,0 +1,15 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +run_deploy_intel_gpu_plugin_playbook() { + echo "Running the deploy-intel-gpu-plugin.yml playbook to deploy the Intel GPU Plugin for Arc BMG..." + ansible-galaxy collection install community.kubernetes + ansible-playbook -i "${INVENTORY_PATH}" --become --become-user=root playbooks/deploy-intel-gpu-plugin.yml \ + --extra-vars "intel_gpu_plugin_version=${intel_gpu_plugin_version}" + if [ $? -eq 0 ]; then + echo "The deploy-intel-gpu-plugin.yml playbook ran successfully." + else + echo "The deploy-intel-gpu-plugin.yml playbook encountered an error." + exit 1 + fi +} diff --git a/core/lib/models/install-model-hf.sh b/core/lib/models/install-model-hf.sh index 08a2b146..c09de6c6 100644 --- a/core/lib/models/install-model-hf.sh +++ b/core/lib/models/install-model-hf.sh @@ -19,7 +19,7 @@ deploy_from_huggingface() { echo "${YELLOW}NOTICE: The model deployment name will be used as the release identifier for deployment. It must be unique, meaningful, and follow Kubernetes naming conventions — lowercase letters, numbers, and hyphens only. Capital letters or special characters are not allowed. ${NC}" read -p "Enter Deployment Name for the Model: " huggingface_model_deployment_name echo "${YELLOW}NOTICE: Ensure the Tensor Parallel size value corresponds to the number of available Gaudi cards. Providing an incorrect value may result in the model being in a not ready state. ${NC}" - if [ "$cpu_or_gpu" = "g" ] || [ "$cpu_or_gpu" = "gaudi2" ] || [ "$cpu_or_gpu" = "gaudi3" ]; then + if [ "$device" = "hpu" ] || [ "$device" = "gaudi2" ] || [ "$device" = "gaudi3" ] || [ "$device" = "xpu" ] || [ "$device" = "bmg" ]; then read -p "Enter the Tensor Parallel size:" -r huggingface_tensor_parellel_size if ! [[ "$huggingface_tensor_parellel_size" =~ ^[0-9]+$ ]]; then echo "Invalid input: Tensor Parallel size must be a positive integer." diff --git a/core/lib/models/install-model.sh b/core/lib/models/install-model.sh index 40321f8d..3d464f24 100644 --- a/core/lib/models/install-model.sh +++ b/core/lib/models/install-model.sh @@ -6,22 +6,32 @@ deploy_inference_llm_models_playbook() { install_true="true" enable_cpu_balloons="false" - if [ "$cpu_or_gpu" == "c" ]; then + if [ "$device" == "cpu" ]; then cpu_playbook="true" gpu_playbook="false" gaudi_deployment="false" + bmg_deployment="false" enable_cpu_balloons="true" # Enable NRI balloons for CPU deployments huggingface_model_deployment_name="${huggingface_model_deployment_name}-cpu" if [ "$balloon_policy_cpu" == "enabled" ]; then echo "${GREEN}CPU deployment detected - using generic NRI balloon policy${NC}" fi fi - if [ "$cpu_or_gpu" == "g" ]; then + if [ "$device" == "hpu" ]; then cpu_playbook="false" gpu_playbook="true" gaudi_deployment="true" + bmg_deployment="false" enable_cpu_balloons="false" fi + if [ "$device" == "xpu" ]; then + cpu_playbook="false" + gpu_playbook="true" + gaudi_deployment="false" + bmg_deployment="true" + enable_cpu_balloons="false" + echo "${GREEN}Intel Arc Battlemage (BMG) GPU deployment detected${NC}" + fi if [ "$deploy_apisix" == "no" ]; then apisix_enabled="false" else @@ -42,12 +52,17 @@ deploy_inference_llm_models_playbook() { gaudi_values_file=$gaudi2_values_file_path elif [[ "$gaudi_platform" == "gaudi3" ]]; then gaudi_values_file=$gaudi3_values_file_path - fi + fi + + if [[ "$bmg_deployment" == "true" ]]; then + bmg_values_file=$bmg_values_file_path + fi echo "Ingress based Deployment: $ingress_enabled" echo "APISIX Enabled: $apisix_enabled" - echo "Keycloak Enabled: $deploy_keycloak" + echo "Keycloak Enabled: $deploy_keycloak" echo "Gaudi based: $gaudi_deployment" + echo "BMG (Intel Arc) based: $bmg_deployment" echo "Model Metrics Enabled: $vllm_metrics_enabled" echo "CPU NRI Balloons: $enable_cpu_balloons" @@ -77,7 +92,7 @@ deploy_inference_llm_models_playbook() { fi ansible-playbook -i "${INVENTORY_PATH}" playbooks/deploy-inference-models.yml \ - --extra-vars "kubernetes_platform=${kubernetes_platform} secret_name=${cluster_url} cert_file=${cert_file} key_file=${key_file} keycloak_admin_user=${keycloak_admin_user} keycloak_admin_password=${keycloak_admin_password} keycloak_client_id=${keycloak_client_id} hugging_face_token=${hugging_face_token} install_true=${install_true} model_name_list='${model_name_list//\ /,}' cpu_playbook=${cpu_playbook} gpu_playbook=${gpu_playbook} hugging_face_token_falcon3=${hugging_face_token_falcon3} deploy_keycloak=${deploy_keycloak} apisix_enabled=${apisix_enabled} ingress_enabled=${ingress_enabled} gaudi_deployment=${gaudi_deployment} huggingface_model_id=${huggingface_model_id} hugging_face_model_deployment=${hugging_face_model_deployment} huggingface_model_deployment_name=${huggingface_model_deployment_name} deploy_inference_llm_models_playbook=${deploy_inference_llm_models_playbook} huggingface_tensor_parellel_size=${huggingface_tensor_parellel_size} deploy_genai_gateway=${deploy_genai_gateway} vllm_metrics_enabled=${vllm_metrics_enabled} gaudi_values_file=${gaudi_values_file} xeon_values_file=${xeon_values_file_path} deploy_ceph=${deploy_ceph} enable_cpu_balloons=${enable_cpu_balloons} balloon_policy_cpu=${balloon_policy_cpu} aws_certificate_arn=${aws_certificate_arn}" --tags "$tags" --vault-password-file "$vault_pass_file" + --extra-vars "kubernetes_platform=${kubernetes_platform} secret_name=${cluster_url} cert_file=${cert_file} key_file=${key_file} keycloak_admin_user=${keycloak_admin_user} keycloak_admin_password=${keycloak_admin_password} keycloak_client_id=${keycloak_client_id} hugging_face_token=${hugging_face_token} install_true=${install_true} model_name_list='${model_name_list//\ /,}' cpu_playbook=${cpu_playbook} gpu_playbook=${gpu_playbook} hugging_face_token_falcon3=${hugging_face_token_falcon3} deploy_keycloak=${deploy_keycloak} apisix_enabled=${apisix_enabled} ingress_enabled=${ingress_enabled} gaudi_deployment=${gaudi_deployment} bmg_deployment=${bmg_deployment} huggingface_model_id=${huggingface_model_id} hugging_face_model_deployment=${hugging_face_model_deployment} huggingface_model_deployment_name=${huggingface_model_deployment_name} deploy_inference_llm_models_playbook=${deploy_inference_llm_models_playbook} huggingface_tensor_parellel_size=${huggingface_tensor_parellel_size} deploy_genai_gateway=${deploy_genai_gateway} vllm_metrics_enabled=${vllm_metrics_enabled} gaudi_values_file=${gaudi_values_file} xeon_values_file=${xeon_values_file_path} bmg_values_file=${bmg_values_file_path} deploy_ceph=${deploy_ceph} enable_cpu_balloons=${enable_cpu_balloons} balloon_policy_cpu=${balloon_policy_cpu} aws_certificate_arn=${aws_certificate_arn}" --tags "$tags" --vault-password-file "$vault_pass_file" } @@ -113,10 +128,10 @@ add_model() { # Deploy NRI CPU Balloons for CPU deployments (after all infrastructure, before models) if [[ "$deploy_nri_balloon_policy" == "yes" ]]; then # Ensure this is a CPU deployment - if [[ "$cpu_or_gpu" != "c" ]]; then - echo "${RED}Error: NRI Balloon Policy can only be deployed for CPU deployments (cpu_or_gpu='c')${NC}" - echo "${RED}Current cpu_or_gpu setting: '$cpu_or_gpu'${NC}" - echo "${RED}Please set cpu_or_gpu to 'c' or disable NRI balloon policy deployment. Exiting!${NC}" + if [[ "$device" != "cpu" ]]; then + echo "${RED}Error: NRI Balloon Policy can only be deployed for CPU deployments (device='cpu')${NC}" + echo "${RED}Current device setting: '$device'${NC}" + echo "${RED}Please set device to 'cpu' or disable NRI balloon policy deployment. Exiting!${NC}" exit 1 fi execute_and_check "Deploying CPU Optimization (NRI Balloons & Topology Detection)..." deploy_nri_balloons_playbook "$@" \ diff --git a/core/lib/models/model-selection.sh b/core/lib/models/model-selection.sh index 2d09ee5f..2040db7f 100644 --- a/core/lib/models/model-selection.sh +++ b/core/lib/models/model-selection.sh @@ -23,7 +23,7 @@ model_selection(){ if [ "$hugging_face_model_deployment" != "true" ]; then if [ -z "$models" ]; then if [ "$hugging_face_model_remove_deployment" != "true" ]; then - if [ "$cpu_or_gpu" = "g" ]; then + if [ "$device" = "hpu" ]; then # Prompt for GPU models echo "Available Models for GPU Deployment:" echo "1. meta-llama/Llama-3.1-8B-Instruct" @@ -49,6 +49,24 @@ model_selection(){ exit 1 fi done + elif [ "$device" = "xpu" ]; then + # Prompt for BMG (Intel Arc Battlemage) GPU models + echo "Available Models for Intel Arc Battlemage (BMG) GPU Deployment:" + echo "31. meta-llama/Llama-3.1-8B-Instruct" + echo "32. mistralai/Mistral-7B-Instruct-v0.3" + echo "33. deepseek-ai/DeepSeek-R1-Distill-Llama-8B" + echo "34. Qwen/Qwen2.5-7B-Instruct" + echo "35. tiiuae/Falcon3-7B-Instruct" + echo "36. Qwen/Qwen2.5-Coder-3B-Instruct" + read -p "Enter the numbers of the BMG models you want to deploy/remove (comma-separated, e.g., 31,33): " models + # Validate input + IFS=',' read -ra selected <<< "$models" + for m in "${selected[@]}"; do + if ! [[ "$m" =~ ^(31|32|33|34|35|36)$ ]]; then + echo "Error: Invalid model selected ($m). Exiting." >&2 + exit 1 + fi + done else # Prompt for CPU models echo "Available Models for CPU Deployment:" @@ -78,8 +96,10 @@ model_selection(){ if [ "$hugging_face_model_remove_deployment" != "true" ]; then if [ -n "$model_names" ]; then if [ "$hugging_face_model_deployment" != "true" ]; then - if [ "$cpu_or_gpu" = "g" ]; then + if [ "$device" = "hpu" ]; then echo "Deploying/removing GPU models: $model_names" + elif [ "$device" = "xpu" ]; then + echo "Deploying/removing Intel Arc BMG GPU models: $model_names" else echo "Deploying/removing CPU models: $model_names" fi @@ -104,162 +124,211 @@ get_model_names() { for model in "${model_array[@]}"; do case "$model" in 1) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("llama-8b") ;; 2) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("llama-70b") ;; 3) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("llama3-405b") ;; 4) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("llama-3-3-70b") ;; 5) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("llama-4-scout-17b") ;; 6) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("qwen-2-5-32b") ;; 7) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("deepseek-r1-distill-qwen-32b") ;; 8) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("deepseek-r1-distill-llama8b") ;; 9) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("mixtral-8x-7b") ;; 10) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("mistral-7b") ;; 11) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("tei") ;; 12) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("rerank") ;; 13) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("codellama-34b") ;; 14) - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("falcon3-7b") ;; 21) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-llama-8b") ;; 22) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-llama-3-2-3b") ;; 23) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-deepseek-r1-distill-llama8b") ;; 24) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-deepseek-r1-distill-qwen-32b") ;; 25) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-qwen3-1-7b") ;; 26) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-qwen3-4b") ;; 27) - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("cpu-qwen3-coder-30b") ;; + 31) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-llama-8b") + ;; + 32) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-mistral-7b") + ;; + 33) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-deepseek-r1-distill-llama8b") + ;; + 34) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-qwen-2-5-7b") + ;; + 35) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-falcon3-7b") + ;; + 36) + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 + exit 1 + fi + model_names+=("bmg-qwen-2-5-coder-3b") + ;; "llama-8b"|"llama-70b"|"codellama-34b"|"mixtral-8x-7b"|"mistral-7b"|"tei"|"tei-rerank"|"falcon3-7b"|"deepseek-r1-distill-qwen-32b"|"deepseek-r1-distill-llama8b"|"llama3-405b"|"llama-3-3-70b"|"llama-4-scout-17b"|"qwen-2-5-32b") - if [ "$cpu_or_gpu" = "c" ]; then - echo "Error: GPU model identifier provided for CPU deployment/removal." >&2 + if [ "$device" = "cpu" ] || [ "$device" = "xpu" ]; then + echo "Error: Gaudi GPU (hpu) model identifier provided for cpu/xpu deployment/removal." >&2 exit 1 fi model_names+=("$model") ;; "cpu-llama-8b"|"cpu-deepseek-r1-distill-qwen-32b"|"cpu-deepseek-r1-distill-llama8b"|"cpu-qwen3-1-7b"|"cpu-llama-3-2-3b"|"cpu-qwen3-4b"|"cpu-qwen3-coder-30b") - if [ "$cpu_or_gpu" = "g" ]; then - echo "Error: CPU model identifier provided for GPU deployment/removal." >&2 + if [ "$device" = "hpu" ] || [ "$device" = "xpu" ]; then + echo "Error: CPU model identifier provided for hpu/xpu deployment/removal." >&2 + exit 1 + fi + model_names+=("$model") + ;; + "bmg-llama-8b"|"bmg-mistral-7b"|"bmg-deepseek-r1-distill-llama8b"|"bmg-qwen-2-5-7b"|"bmg-falcon3-7b"|"bmg-qwen-2-5-coder-3b") + if [ "$device" = "cpu" ] || [ "$device" = "hpu" ]; then + echo "Error: XPU model identifier provided for cpu/hpu deployment/removal." >&2 exit 1 fi model_names+=("$model") diff --git a/core/lib/models/uninstall-model-hf.sh b/core/lib/models/uninstall-model-hf.sh index f389d6e3..ccab7575 100644 --- a/core/lib/models/uninstall-model-hf.sh +++ b/core/lib/models/uninstall-model-hf.sh @@ -23,7 +23,7 @@ remove_model_deployed_via_huggingface(){ fi read -p "Enter the deployment name of the model you wish to deprovision: " hugging_face_model_remove_name - if [ "$cpu_or_gpu" == "c" ]; then + if [ "$device" == "cpu" ]; then hugging_face_model_remove_name="${hugging_face_model_remove_name}-cpu" fi if [ -n "$hugging_face_model_remove_name" ]; then diff --git a/core/lib/system/config-vars.sh b/core/lib/system/config-vars.sh index a0245c61..f6a1b971 100644 --- a/core/lib/system/config-vars.sh +++ b/core/lib/system/config-vars.sh @@ -15,7 +15,7 @@ keycloak_admin_password="" hugging_face_token="" models="" model_name_list="" -cpu_or_gpu="" +device="" deploy_kubernetes_fresh="" deploy_habana_ai_operator="" deploy_ingress_controller="" @@ -43,6 +43,8 @@ gaudi_platform="" gaudi_operator="" gaudi2_values_file_path="" gaudi3_values_file_path="" +bmg_values_file_path="" +deploy_intel_gpu_plugin="" python3_interpreter="" skip_check="" purge_inference_cluster="" diff --git a/core/lib/system/precheck/read-config-file.sh b/core/lib/system/precheck/read-config-file.sh index 45c6baf1..5f5df6d6 100644 --- a/core/lib/system/precheck/read-config-file.sh +++ b/core/lib/system/precheck/read-config-file.sh @@ -10,13 +10,15 @@ read_config_file() { # Trim leading/trailing whitespace key=$(echo "$key" | xargs) value=$(echo "$value" | xargs) + # Skip empty lines and comments + [[ -z "$key" || "$key" =~ ^#.* ]] && continue # Set the variable using a temporary file if [[ "$value" == "on" ]]; then value="yes" elif [[ "$value" == "off" ]]; then value="no" fi - printf "%s=%s\n" "$key" "$value" >> temp_env_vars + printf "%s=%s\n" "$key" "$value" >> temp_env_vars done < "$config_file" # Load the environment variables from the temporary file @@ -26,9 +28,11 @@ read_config_file() { if [ -f "$metadata_config_file" ]; then echo "Metadata configuration file found, setting vars!" echo "---------------------------------------" - while IFS='=' read -r key value || [ -n "$key" ]; do + while IFS='=' read -r key value || [ -n "$key" ]; do key=$(echo "$key" | xargs) - value=$(echo "$value" | xargs) + value=$(echo "$value" | xargs) + # Skip empty lines and comments + [[ -z "$key" || "$key" =~ ^#.* ]] && continue printf "%s=%s\n" "$key" "$value" >> temp_env_vars_metadata done < "$metadata_config_file" source temp_env_vars_metadata @@ -58,25 +62,83 @@ read_config_file() { sed -i -E "/^env_proxy:/,/^[^[:space:]]/s|^[[:space:]]*no_proxy:.*| no_proxy: \"$no_proxy\"|" "$INVENTORY_ALL_FILE" export no_proxy fi + + # Detect real upstream DNS servers from the host, skipping link-local (169.254.x.x) + # and loopback (127.x.x.x) addresses that cause CoreDNS forwarding loops. + # /run/systemd/resolve/resolv.conf is used when available (more reliable than /etc/resolv.conf + # which may symlink to nodelocaldns or stub-resolver on systemd-resolved systems). + local _resolv_src + if [[ -f /run/systemd/resolve/resolv.conf ]]; then + _resolv_src=/run/systemd/resolve/resolv.conf + else + _resolv_src=/etc/resolv.conf + fi + local _upstream_dns + _upstream_dns=$(grep -E "^nameserver" "$_resolv_src" \ + | awk '{print $2}' \ + | grep -vE "^(127\.|169\.254\.)" \ + | head -3) + if [[ -n "$_upstream_dns" ]]; then + # Build the replacement lines for the upstream_dns_servers block. + # The block already exists commented-out in all.yml; we uncomment it + # and replace the placeholder IPs. Using sed keeps the block at its + # original position and is safe to run repeatedly (idempotent). + local _dns_list_sed + _dns_list_sed=$(echo "$_upstream_dns" | awk 'BEGIN{ORS="\\n"} {printf " - \"%s\"", $1}') + + # Step 1: uncomment the key line (handles both commented and already-active) + sed -i -E 's|^#[[:space:]]*(upstream_dns_servers:.*)|\1|' "$INVENTORY_ALL_FILE" + + # Step 2: replace the entire list under upstream_dns_servers with detected IPs. + # Matches all consecutive " - ..." lines that follow the key and replaces them. + python3 - "$INVENTORY_ALL_FILE" "$_upstream_dns" <<'PYEOF' +import sys +path = sys.argv[1] +servers = [s for s in sys.argv[2].split('\n') if s.strip()] +lines = open(path).readlines() +out, in_block = [], False +for line in lines: + if line.startswith('upstream_dns_servers:'): + out.append(line) + for s in servers: + out.append(f' - "{s}"\n') + in_block = True + continue + if in_block: + # skip old list entries; stop skipping on any non-list line + if line.startswith(' - ') or (line.strip() == '' and in_block): + continue + in_block = False + out.append(line) +open(path, 'w').writelines(out) +PYEOF + fi - case "$cpu_or_gpu" in - "c" | "cpu") - cpu_or_gpu="c" + case "$device" in + "cpu") + device="cpu" deploy_habana_ai_operator="no" + deploy_intel_gpu_plugin="no" ;; - "g" | "gpu" | "gaudi2" | "gaudi3") - if [[ "$cpu_or_gpu" == "gaudi2" || "$cpu_or_gpu" == "gpu" || "$cpu_or_gpu" == "g" ]]; then + "hpu" | "gpu" | "gaudi2" | "gaudi3") + if [[ "$device" == "gaudi2" || "$device" == "gpu" || "$device" == "hpu" ]]; then gaudi_platform="gaudi2" - elif [[ "$cpu_or_gpu" == "gaudi3" ]]; then + elif [[ "$device" == "gaudi3" ]]; then gaudi_platform="gaudi3" fi - cpu_or_gpu="g" - deploy_habana_ai_operator="yes" + device="hpu" + deploy_habana_ai_operator="yes" + deploy_intel_gpu_plugin="no" + ;; + "xpu" | "bmg") + device="xpu" + deploy_habana_ai_operator="no" + deploy_intel_gpu_plugin="yes" ;; *) - echo "Invalid value for cpu_or_gpu. It should be 'c' or 'cpu' for CPU, or 'g', 'gpu', 'gaudi2', or 'gaudi3' for GPU." + echo "Invalid value for device. It should be 'cpu' for CPU, 'hpu', 'gpu', 'gaudi2', or 'gaudi3' for Gaudi GPU, or 'xpu' or 'bmg' for Intel Arc Battlemage GPU." exit 1 ;; esac diff --git a/core/lib/system/setup-env.sh b/core/lib/system/setup-env.sh index 9df77aa6..54af4829 100644 --- a/core/lib/system/setup-env.sh +++ b/core/lib/system/setup-env.sh @@ -19,17 +19,22 @@ setup_initial_env() { git config --global http.proxy "$https_proxy" git config --global https.proxy "$https_proxy" fi - if [ ! -d "$KUBESPRAYDIR" ]; then + if [ ! -d "$KUBESPRAYDIR/.git" ]; then + # Remove incomplete kubespray directory if it exists + if [ -d "$KUBESPRAYDIR" ]; then + echo "Removing incomplete Kubespray directory..." + rm -rf "$KUBESPRAYDIR" + fi git clone https://github.com/kubernetes-sigs/kubespray.git $KUBESPRAYDIR if [ $? -ne 0 ] || [ ! -d "$KUBESPRAYDIR/.git" ]; then echo -e "${RED}----------------------------------------------------------------------------${NC}" - echo -e "${RED}| NOTICE: Failed to clone Kubespray Repository. |${NC}" - echo -e "${RED}| Unable to proceed with Inference Stack Deployment |${NC}" - echo -e "${RED}| due to missing dependency |${NC}" - echo -e "${RED}----------------------------------------------------------------------------${NC}" + echo -e "${RED}| NOTICE: Failed to clone Kubespray Repository. |${NC}" + echo -e "${RED}| Unable to proceed with Inference Stack Deployment |${NC}" + echo -e "${RED}| due to missing dependency |${NC}" + echo -e "${RED}----------------------------------------------------------------------------${NC}" exit 1 fi - cd $KUBESPRAYDIR + cd $KUBESPRAYDIR git checkout "$kubespray_version" else echo "Kubespray directory already exists, skipping clone." @@ -96,6 +101,7 @@ setup_initial_env() { gaudi2_values_file_path="$REMOTEDIR/vllm/gaudi-values.yaml" gaudi3_values_file_path="$REMOTEDIR/vllm/gaudi3-values.yaml" xeon_values_file_path="$REMOTEDIR/vllm/xeon-values.yaml" + bmg_values_file_path="$REMOTEDIR/vllm/bmg-values.yaml" cp "$HOMEDIR"/inventory/metadata/addons.yml $KUBESPRAYDIR/inventory/mycluster/group_vars/k8s_cluster/addons.yml cp "$HOMEDIR"/inventory/metadata/all.yml $KUBESPRAYDIR/inventory/mycluster/group_vars/all/all.yml cp -r "$HOMEDIR"/roles/* $KUBESPRAYDIR/roles/ diff --git a/core/lib/user-menu/parse-user-prompts.sh b/core/lib/user-menu/parse-user-prompts.sh index 0928bdf2..c404d767 100644 --- a/core/lib/user-menu/parse-user-prompts.sh +++ b/core/lib/user-menu/parse-user-prompts.sh @@ -12,7 +12,7 @@ parse_arguments() { --keycloak-admin-password) keycloak_admin_password="$2"; shift ;; --hugging-face-token) hugging_face_token="$2"; shift ;; --models) models="$2"; shift ;; - --cpu-or-gpu) cpu_or_gpu="$2"; shift ;; + --device) device="$2"; shift ;; --deploy-nri-balloon-policy) deploy_nri_balloon_policy="$2"; shift ;; --skip-check) skip_check="true" ;; -h|--help) usage; exit 0 ;; @@ -86,14 +86,14 @@ prompt_for_input() { if [ -z "$deploy_nri_balloon_policy" ]; then # Automatically enable NRI balloon policy for CPU deployments - if [ "$cpu_or_gpu" == "c" ]; then + if [ "$device" == "cpu" ]; then deploy_nri_balloon_policy="yes" if [ "$balloon_policy_cpu" == "enabled" ]; then echo "NRI CPU Balloon Policy automatically enabled for CPU deployment" fi else deploy_nri_balloon_policy="no" - echo "NRI CPU Balloon Policy disabled for GPU deployment" + echo "NRI CPU Balloon Policy disabled for GPU/BMG deployment" fi else echo "Proceeding with the setup of NRI CPU Balloon Policy: $deploy_nri_balloon_policy" @@ -135,24 +135,28 @@ prompt_for_input() { fi fi - if [[ -z "$cpu_or_gpu" ]]; then - read -p "Do you want to run on CPU or GPU? (c/g): " cpu_or_gpu - case "$cpu_or_gpu" in - c|C) - cpu_or_gpu="c" + if [[ -z "$device" ]]; then + read -p "Do you want to run on CPU, Gaudi GPU, or Intel Arc BMG GPU? (cpu/hpu/xpu): " device + case "$device" in + cpu|CPU) + device="cpu" echo "Running on CPU" ;; - g|G) - cpu_or_gpu="g" - echo "Running on GPU" + hpu|HPU) + device="hpu" + echo "Running on Gaudi GPU" + ;; + xpu|XPU) + device="xpu" + echo "Running on Intel Arc Battlemage (BMG) GPU" ;; *) echo "Invalid option. Defaulting to CPU." - cpu_or_gpu="c" + device="cpu" ;; esac else - echo "cpu_or_gpu is already set to $cpu_or_gpu" + echo "device is already set to $device" fi } \ No newline at end of file diff --git a/core/lib/xeon/ballon-policy.sh b/core/lib/xeon/ballon-policy.sh index 69069fa6..9466d03c 100644 --- a/core/lib/xeon/ballon-policy.sh +++ b/core/lib/xeon/ballon-policy.sh @@ -5,14 +5,14 @@ deploy_nri_balloons_playbook() { if [ "$balloon_policy_cpu" = "enabled" ]; then echo "Deploying CPU Optimization (NRI Balloons & Topology Detection)..." # Strict CPU deployment check - if [[ "$cpu_or_gpu" != "c" ]]; then + if [[ "$device" != "cpu" ]]; then echo "${RED}Error: CPU optimization can only be deployed for CPU deployments${NC}" - echo "${RED}Current cpu_or_gpu setting: '$cpu_or_gpu'${NC}" + echo "${RED}Current device setting: '$device'${NC}" echo "${RED}CPU optimization is specifically designed for CPU resource management${NC}" exit 1 fi - if [ "$deploy_nri_balloon_policy" == "yes" ] || [ "$cpu_or_gpu" == "c" ]; then + if [ "$deploy_nri_balloon_policy" == "yes" ] || [ "$device" == "cpu" ]; then echo "${GREEN}Deploying CPU optimization with topology detection and NRI balloon policy${NC}" ansible-playbook -i "${INVENTORY_PATH}" playbooks/deploy-cpu-optimization.yml \ --extra-vars "cpu_playbook=true" \ diff --git a/core/playbooks/deploy-inference-models.yml b/core/playbooks/deploy-inference-models.yml index ee370268..01e42b5f 100644 --- a/core/playbooks/deploy-inference-models.yml +++ b/core/playbooks/deploy-inference-models.yml @@ -2687,6 +2687,534 @@ + - name: Deploy BMG Llama-3.1-8B LLM Model + block: + - name: Delete Ingress resource BMG Llama-3.1-8B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-llama-8b-ingress + state: absent + tags: install-bmg-llama-8b + - name: Delete Ingress resource BMG Llama-3.1-8B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-llama-8b-ingress + state: absent + tags: install-bmg-llama-8b + - name: Deploy BMG LLM model Llama-3.1-8B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-llama-8b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="meta-llama/Llama-3.1-8B-Instruct" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_llama_8b + failed_when: helm_upgrade_install_bmg_llama_8b.rc != 0 + tags: install-bmg-llama-8b + - name: Register BMG Llama-3.1-8B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "meta-llama/Llama-3.1-8B-Instruct" + reg_litellm_model: "openai/meta-llama/Llama-3.1-8B-Instruct" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-llama-8b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-llama-8b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-llama-8b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-llama-8b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-llama-8b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-llama-8b --short" + register: helm_release_bmg_llama_8b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-llama-8b + - name: Uninstall vllm-bmg-llama-8b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-llama-8b" + run_once: true + tags: uninstall-bmg-llama-8b + when: + - "'bmg-llama-8b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_llama_8b_installed.stdout != "" + + - name: Deploy BMG Mistral-7B LLM Model + block: + - name: Delete Ingress resource BMG Mistral-7B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-mistral-7b-ingress + state: absent + tags: install-bmg-mistral-7b + - name: Delete Ingress resource BMG Mistral-7B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-mistral-7b-ingress + state: absent + tags: install-bmg-mistral-7b + - name: Deploy BMG LLM model Mistral-7B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-mistral-7b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_mistral_7b + failed_when: helm_upgrade_install_bmg_mistral_7b.rc != 0 + tags: install-bmg-mistral-7b + - name: Register BMG Mistral-7B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "mistralai/Mistral-7B-Instruct-v0.3" + reg_litellm_model: "openai/mistralai/Mistral-7B-Instruct-v0.3" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-mistral-7b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-mistral-7b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-mistral-7b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-mistral-7b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-mistral-7b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-mistral-7b --short" + register: helm_release_bmg_mistral_7b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-mistral-7b + - name: Uninstall vllm-bmg-mistral-7b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-mistral-7b" + run_once: true + tags: uninstall-bmg-mistral-7b + when: + - "'bmg-mistral-7b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_mistral_7b_installed.stdout != "" + + - name: Deploy BMG DeepSeek-R1-Distill-Llama-8B LLM Model + block: + - name: Delete Ingress resource BMG DeepSeek-R1-Distill-Llama-8B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-deepseek-r1-distill-llama8b-ingress + state: absent + tags: install-bmg-deepseek-r1-distill-llama8b + - name: Delete Ingress resource BMG DeepSeek-R1-Distill-Llama-8B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-deepseek-r1-distill-llama8b-ingress + state: absent + tags: install-bmg-deepseek-r1-distill-llama8b + - name: Deploy BMG LLM model DeepSeek-R1-Distill-Llama-8B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-deepseek-r1-distill-llama8b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="deepseek-ai/DeepSeek-R1-Distill-Llama-8B" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_deepseek_llama8b + failed_when: helm_upgrade_install_bmg_deepseek_llama8b.rc != 0 + tags: install-bmg-deepseek-r1-distill-llama8b + - name: Register BMG DeepSeek-R1-Distill-Llama-8B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" + reg_litellm_model: "openai/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-deepseek-r1-distill-llama8b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-deepseek-r1-distill-llama8b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-deepseek-r1-distill-llama8b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-deepseek-r1-distill-llama8b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-deepseek-r1-distill-llama8b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-deepseek-r1-distill-llama8b --short" + register: helm_release_bmg_deepseek_llama8b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-deepseek-r1-distill-llama8b + - name: Uninstall vllm-bmg-deepseek-r1-distill-llama8b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-deepseek-r1-distill-llama8b" + run_once: true + tags: uninstall-bmg-deepseek-r1-distill-llama8b + when: + - "'bmg-deepseek-r1-distill-llama8b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_deepseek_llama8b_installed.stdout != "" + + - name: Deploy BMG Qwen2.5-7B LLM Model + block: + - name: Delete Ingress resource BMG Qwen2.5-7B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-qwen-2-5-7b-ingress + state: absent + tags: install-bmg-qwen-2-5-7b + - name: Delete Ingress resource BMG Qwen2.5-7B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-qwen-2-5-7b-ingress + state: absent + tags: install-bmg-qwen-2-5-7b + - name: Deploy BMG LLM model Qwen2.5-7B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-qwen-2-5-7b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="Qwen/Qwen2.5-7B-Instruct" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_qwen_2_5_7b + failed_when: helm_upgrade_install_bmg_qwen_2_5_7b.rc != 0 + tags: install-bmg-qwen-2-5-7b + - name: Register BMG Qwen2.5-7B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "Qwen/Qwen2.5-7B-Instruct" + reg_litellm_model: "openai/Qwen/Qwen2.5-7B-Instruct" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-qwen-2-5-7b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-qwen-2-5-7b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-qwen-2-5-7b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-qwen-2-5-7b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-qwen-2-5-7b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-qwen-2-5-7b --short" + register: helm_release_bmg_qwen_2_5_7b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-qwen-2-5-7b + - name: Uninstall vllm-bmg-qwen-2-5-7b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-qwen-2-5-7b" + run_once: true + tags: uninstall-bmg-qwen-2-5-7b + when: + - "'bmg-qwen-2-5-7b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_qwen_2_5_7b_installed.stdout != "" + + - name: Deploy BMG Falcon3-7B LLM Model + block: + - name: Delete Ingress resource BMG Falcon3-7B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-falcon3-7b-ingress + state: absent + tags: install-bmg-falcon3-7b + - name: Delete Ingress resource BMG Falcon3-7B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-falcon3-7b-ingress + state: absent + tags: install-bmg-falcon3-7b + - name: Deploy BMG LLM model Falcon3-7B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-falcon3-7b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="tiiuae/Falcon3-7B-Instruct" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_falcon3_7b + failed_when: helm_upgrade_install_bmg_falcon3_7b.rc != 0 + tags: install-bmg-falcon3-7b + - name: Register BMG Falcon3-7B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "tiiuae/Falcon3-7B-Instruct" + reg_litellm_model: "openai/tiiuae/Falcon3-7B-Instruct" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-falcon3-7b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-falcon3-7b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-falcon3-7b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-falcon3-7b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-falcon3-7b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-falcon3-7b --short" + register: helm_release_bmg_falcon3_7b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-falcon3-7b + - name: Uninstall vllm-bmg-falcon3-7b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-falcon3-7b" + run_once: true + tags: uninstall-bmg-falcon3-7b + when: + - "'bmg-falcon3-7b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_falcon3_7b_installed.stdout != "" + + - name: Deploy BMG Qwen2.5-Coder-3B LLM Model + block: + - name: Delete Ingress resource BMG Qwen2.5-Coder-3B from default namespace + kubernetes.core.k8s: + kind: Ingress + namespace: default + name: vllm-bmg-qwen-2-5-coder-3b-ingress + state: absent + tags: install-bmg-qwen-2-5-coder-3b + - name: Delete Ingress resource BMG Qwen2.5-Coder-3B from auth-apisix namespace + kubernetes.core.k8s: + kind: Ingress + namespace: auth-apisix + name: vllm-bmg-qwen-2-5-coder-3b-ingress + state: absent + tags: install-bmg-qwen-2-5-coder-3b + - name: Deploy BMG LLM model Qwen2.5-Coder-3B + ansible.builtin.command: >- + helm upgrade --install vllm-bmg-qwen-2-5-coder-3b "{{ remote_helm_charts_base }}/vllm" + --set LLM_MODEL_ID="Qwen/Qwen2.5-Coder-3B-Instruct" + --set global.monitoring="{{ vllm_metrics_enabled }}" + --set svcmonitor.enabled="{{ vllm_metrics_enabled }}" + --set global.HUGGINGFACEHUB_API_TOKEN={{ hugging_face_token }} + {% if bmg_deployment|lower == "true" %} + --values {{ bmg_values_file }} + --set tensor_parallel_size={{ tensor_parallel_size | default(1) }} + {% endif %} + {% if apisix_enabled %} + --set apisix.enabled={{ apisix_enabled }} + --set platform={{ kubernetes_platform }} + {% endif %} + {% if ingress_enabled %} + --set ingress.enabled={{ ingress_enabled }} + --set ingress.host={{ secret_name }} + --set ingress.secretname={{ secret_name }} + {% if kubernetes_platform == "eks" %} + --set aws_certificate_arn={{ aws_certificate_arn | default('') }} + {% endif %} + {% endif %} + {% if deploy_keycloak == 'yes' and apisix_enabled %} + --set oidc.client_id={{ keycloak_client_id | default('') }} + --set oidc.client_secret={{ client_secret | default('') }} + {% endif %} + {{ helm_proxy_args | default('') }} + --force + register: helm_upgrade_install_bmg_qwen_2_5_coder_3b + failed_when: helm_upgrade_install_bmg_qwen_2_5_coder_3b.rc != 0 + tags: install-bmg-qwen-2-5-coder-3b + - name: Register BMG Qwen2.5-Coder-3B model + import_tasks: register-model-genai-gateway.yml + vars: + reg_model_name: "Qwen/Qwen2.5-Coder-3B-Instruct" + reg_litellm_model: "openai/Qwen/Qwen2.5-Coder-3B-Instruct" + reg_custom_llm_provider: "openai" + reg_api_base: "http://vllm-bmg-qwen-2-5-coder-3b-service.default/v1" + reg_input_cost_per_token: 0.001 + reg_output_cost_per_token: 0.002 + tags: + - install-bmg-qwen-2-5-coder-3b + - install-genai-gateway + run_once: true + when: + - "'install-bmg-qwen-2-5-coder-3b' in ansible_run_tags" + - "'install-genai-gateway' in ansible_run_tags" + run_once: true + when: + - model_name_list is defined + - bmg_deployment|lower == "true" + - install_true == 'true' + - "'bmg-qwen-2-5-coder-3b' in (model_name_list | regex_replace(',', ' ') | split())" + + - name: Check if vllm-bmg-qwen-2-5-coder-3b Model is deployed + ansible.builtin.command: + cmd: "helm list --filter vllm-bmg-qwen-2-5-coder-3b --short" + register: helm_release_bmg_qwen_2_5_coder_3b_installed + ignore_errors: true + run_once: true + tags: uninstall-bmg-qwen-2-5-coder-3b + - name: Uninstall vllm-bmg-qwen-2-5-coder-3b Model + ansible.builtin.command: + cmd: "helm uninstall vllm-bmg-qwen-2-5-coder-3b" + run_once: true + tags: uninstall-bmg-qwen-2-5-coder-3b + when: + - "'bmg-qwen-2-5-coder-3b' in (model_name_list | regex_replace(',', ' ') | split())" + - uninstall_true == 'true' + - helm_release_bmg_qwen_2_5_coder_3b_installed.stdout != "" + - name: List of Models to be Installed tags: always run_once: true diff --git a/core/playbooks/deploy-intel-gpu-plugin.yml b/core/playbooks/deploy-intel-gpu-plugin.yml new file mode 100644 index 00000000..a0ac78a1 --- /dev/null +++ b/core/playbooks/deploy-intel-gpu-plugin.yml @@ -0,0 +1,106 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 +--- +- name: Deploy Intel GPU Plugin for Intel Arc Battlemage (BMG) GPU + hosts: "{{ inference_delegate | default('kube_control_plane') }}" + gather_facts: false + any_errors_fatal: "{{ any_errors_fatal | default(true) }}" + environment: "{{ proxy_disable_env | default(env_proxy | default({})) }}" + vars: + intel_gpu_plugin_namespace: "intel-system" + intel_gpu_plugin_version: "{{ intel_gpu_plugin_version | default('0.35.0') }}" + tasks: + - name: Create Intel System namespace + kubernetes.core.k8s: + name: "{{ intel_gpu_plugin_namespace }}" + api_version: v1 + kind: Namespace + state: present + run_once: true + + - name: Deploy Node Feature Discovery (NFD) + ansible.builtin.command: + cmd: >- + kubectl apply -k + https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd?ref=v{{ intel_gpu_plugin_version }} + run_once: true + register: nfd_deploy + retries: 3 + delay: 10 + until: nfd_deploy.rc == 0 + changed_when: "'configured' in nfd_deploy.stdout or 'created' in nfd_deploy.stdout" + + - name: Wait for NFD master to be ready + ansible.builtin.command: + cmd: kubectl rollout status deployment/nfd-master -n node-feature-discovery --timeout=120s + run_once: true + register: nfd_ready + retries: 6 + delay: 20 + until: nfd_ready.rc == 0 + changed_when: false + + - name: Deploy NodeFeatureRules for Intel GPU detection + ansible.builtin.command: + cmd: >- + kubectl apply -k + https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=v{{ intel_gpu_plugin_version }} + run_once: true + register: nfd_rules_deploy + retries: 3 + delay: 10 + until: nfd_rules_deploy.rc == 0 + changed_when: "'configured' in nfd_rules_deploy.stdout or 'created' in nfd_rules_deploy.stdout" + + - name: Wait for nodes to be labeled with Intel GPU feature + ansible.builtin.command: + cmd: >- + kubectl get nodes -l intel.feature.node.kubernetes.io/gpu=true -o name + run_once: true + register: gpu_nodes + retries: 18 + delay: 10 + until: gpu_nodes.stdout_lines | length > 0 + changed_when: false + + - name: Deploy Intel GPU Device Plugin via NFD-labeled nodes overlay + ansible.builtin.command: + cmd: >- + kubectl apply -k + https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/nfd_labeled_nodes?ref=v{{ intel_gpu_plugin_version }} + run_once: true + register: gpu_plugin_deploy + retries: 3 + delay: 10 + until: gpu_plugin_deploy.rc == 0 + changed_when: "'configured' in gpu_plugin_deploy.stdout or 'created' in gpu_plugin_deploy.stdout" + + - name: Wait for Intel GPU Device Plugin DaemonSet to be ready + kubernetes.core.k8s_info: + kind: DaemonSet + label_selectors: + - "app=intel-gpu-plugin" + namespace: "{{ intel_gpu_plugin_namespace }}" + register: gpu_plugin_ds + until: + - gpu_plugin_ds.resources | length > 0 + - gpu_plugin_ds.resources[0].status.numberReady is defined + - gpu_plugin_ds.resources[0].status.numberReady >= 1 + retries: 30 + delay: 10 + run_once: true + ignore_errors: true + + - name: Print Intel GPU Plugin deployment status + debug: + msg: + - "================================================================" + - "Intel GPU Plugin Deployment Status" + - "================================================================" + - "Plugin Version: {{ intel_gpu_plugin_version }}" + - "Namespace: {{ intel_gpu_plugin_namespace }}" + - "GPU-labeled nodes: {{ gpu_nodes.stdout_lines | join(', ') }}" + - "The Intel GPU Plugin enables Intel Arc Battlemage (BMG) GPU" + - "resources (gpu.intel.com/xe) on Kubernetes nodes." + - "================================================================" + run_once: true diff --git a/core/roles/inference-tools/tasks/main.yml b/core/roles/inference-tools/tasks/main.yml index 12801161..6605697c 100644 --- a/core/roles/inference-tools/tasks/main.yml +++ b/core/roles/inference-tools/tasks/main.yml @@ -7,59 +7,21 @@ state: present become: true tags: always -- name: Install Kubernetes Python SDK - ansible.builtin.pip: - name: kubernetes +- name: Install Kubernetes Python SDK (apt) + ansible.builtin.package: + name: python3-kubernetes state: present become: true ignore_errors: true - register: pip_install_result + register: apt_k8s_result tags: always -- name: Install Kubernetes Python SDK Fallback +- name: Install Kubernetes Python SDK (pip fallback) ansible.builtin.pip: name: kubernetes state: present - extra_args: "--break-system-packages" + extra_args: "--break-system-packages --ignore-installed pyyaml" become: true - when: pip_install_result is failed - tags: always -- name: Deploy fix script for kubernetes SDK no_proxy bug - ansible.builtin.copy: - dest: /tmp/_fix_k8s_no_proxy.py - mode: '0644' - content: | - import re, sys, os - path = sys.argv[1] - with open(path, 'r') as f: - original = f.read() - # Remove the duplicate self.no_proxy = None that appears after the no_proxy - # env-loading block (bug introduced in kubernetes SDK >= 34.x by code generator). - # Pattern: the env-loading line is followed within 3 lines by a bare self.no_proxy = None - fixed = re.sub( - r'(if os\.getenv\("no_proxy"\)[^\n]+\n(?:.*\n){1,3}?)(\s+self\.no_proxy = None\n)', - r'\1', - original - ) - if fixed == original: - print("OK: no duplicate no_proxy line found, nothing to do") - sys.exit(0) - with open(path, 'w') as f: - f.write(fixed) - print("FIXED: removed duplicate self.no_proxy = None from " + path) - sys.exit(2) - become: true - tags: always -- name: Fix kubernetes SDK no_proxy bug (duplicate self.no_proxy = None after env-loading block) - ansible.builtin.shell: | - set -e - KUBE_CFG=$(python3 -c "import kubernetes, os; print(os.path.join(os.path.dirname(kubernetes.__file__), 'client', 'configuration.py'))" 2>/dev/null) || exit 0 - python3 /tmp/_fix_k8s_no_proxy.py "$KUBE_CFG" - args: - executable: /bin/bash - become: true - register: _no_proxy_fix - changed_when: _no_proxy_fix.rc == 2 - failed_when: _no_proxy_fix.rc not in [0, 2] + when: apt_k8s_result is failed tags: always - name: Deploy fix script for kubernetes SDK no_proxy bug ansible.builtin.copy: diff --git a/core/scripts/generate-token.sh b/core/scripts/generate-token.sh old mode 100644 new mode 100755 index 8ce5adf1..40f7c7d0 --- a/core/scripts/generate-token.sh +++ b/core/scripts/generate-token.sh @@ -4,9 +4,9 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" export BASE_URL=api.example.com # The base URL of the Keycloak server, note https:// is omitted -export KEYCLOAK_ADMIN_USERNAME=your-keycloak-admin-user # The username for Keycloak admin login -export KEYCLOAK_PASSWORD=changeme # The password for Keycloak admin login -export KEYCLOAK_CLIENT_ID=my-client-id # The client ID to be created in Keycloak +export KEYCLOAK_ADMIN_USERNAME=admin # The username for Keycloak admin login +export KEYCLOAK_PASSWORD=password # The password for Keycloak admin login +export KEYCLOAK_CLIENT_ID=my-client # The client ID to be created in Keycloak export KEYCLOAK_CLIENT_SECRET=$(bash "${SCRIPT_DIR}/keycloak-fetch-client-secret.sh" ${BASE_URL} ${KEYCLOAK_ADMIN_USERNAME} ${KEYCLOAK_PASSWORD} ${KEYCLOAK_CLIENT_ID} | awk -F': ' '/Client secret:/ {print $2}') @@ -43,4 +43,4 @@ export TOKEN=$(curl -k -s -X POST \ echo "BASE_URL=${BASE_URL}" echo "TOKEN=${TOKEN}" -echo "TOKEN_LIFESPAN=${TOKEN_LIFESPAN} seconds ($(( TOKEN_LIFESPAN / 60 )) minutes)" \ No newline at end of file +echo "TOKEN_LIFESPAN=${TOKEN_LIFESPAN} seconds ($(( TOKEN_LIFESPAN / 60 )) minutes)" diff --git a/core/scripts/generate-vault-secrets.sh b/core/scripts/generate-vault-secrets.sh old mode 100644 new mode 100755 diff --git a/core/scripts/keycloak-fetch-client-secret.sh b/core/scripts/keycloak-fetch-client-secret.sh index bbd3427c..5ee53de0 100644 --- a/core/scripts/keycloak-fetch-client-secret.sh +++ b/core/scripts/keycloak-fetch-client-secret.sh @@ -45,8 +45,8 @@ TOKEN=$(curl -s -k -X POST "$KEYCLOAK_URL/realms/master/protocol/openid-connect/ -H "Content-Type: application/x-www-form-urlencoded" \ -d "username=$USERNAME" \ -d "password=$PASSWORD" \ - -d 'grant_type=password' \ - -d 'client_id=admin-cli' | jq -r '.access_token') + -d "grant_type=password" \ + -d "client_id=admin-cli" | jq -r '.access_token') if [ -z "$TOKEN" ]; then echo "Login failed" diff --git a/core/scripts/vllm-quickstart/README.md b/core/scripts/vllm-quickstart/README.md index 7dba3473..5da5bb47 100644 --- a/core/scripts/vllm-quickstart/README.md +++ b/core/scripts/vllm-quickstart/README.md @@ -1,11 +1,12 @@ ## 📋 Overview -The `vllm-model-runner.sh` launcher script simplifies the deployment of popular open-source LLMs with optimized configurations for CPU-based inference. It handles dependency installation, hardware detection, Docker container management, and health monitoring automatically. +The `vllm-model-runner.sh` launcher script simplifies the deployment of popular open-source LLMs with optimized configurations for both CPU and Intel GPU/XPU inference. It handles dependency installation, hardware detection, Docker container management, and health monitoring automatically. ## ✨ Features - **One-Command Deployment** — Interactive model selection and automated setup - **Multi-Model Support** — Pre-configured profiles for popular LLMs +- **Dual Runtime Support** — Switch between CPU and Intel XPU profiles with `--runtime` - **Custom Port Configuration** — Run the server on any port with `-p` option - **Hardware Auto-Detection** — Automatically configures tensor/pipeline parallelism based on NUMA topology - **Dependency Management** — Installs Docker, jq, curl, and git if missing @@ -18,7 +19,9 @@ The `vllm-model-runner.sh` launcher script simplifies the deployment of popular - **Operating System**: Ubuntu - **HuggingFace Token**: Required for downloading models - **Sudo Access**: Required for dependency installation -- **Hardware**: CPU with sufficient RAM for model inference +- **Hardware**: + - CPU runtime: Intel Xeon class CPU with sufficient RAM + - XPU runtime: Intel GPU (for example Battlemage/BMG) with `/dev/dri` available on host > **Note:** The script will automatically install Docker, jq, curl, and git if they are not present. @@ -42,6 +45,13 @@ The `vllm-model-runner.sh` launcher script simplifies the deployment of popular ./vllm-model-runner.sh ``` +To explicitly choose runtime: + +```bash +./vllm-model-runner.sh --runtime cpu +./vllm-model-runner.sh --runtime xpu +``` + To run on a custom port: ```bash @@ -53,11 +63,12 @@ To run on a custom port: The script will: 1. Check and install any missing dependencies 2. Validate your environment and HuggingFace token -3. Display available models for selection -4. Detect hardware configuration for optimal parallelism -5. Pull the vLLM Docker image (if not cached) -6. Start the vLLM server container -7. Perform health checks until the server is ready +3. Resolve the runtime profile (`cpu` or `xpu`) +4. Display available models for selection +5. Detect hardware configuration for optimal parallelism +6. Pull the vLLM Docker image (if not cached) +7. Start the vLLM server container +8. Perform health checks until the server is ready ### Example Session @@ -116,16 +127,22 @@ The `models.json` file contains all configuration: ```json { "docker": { - "image": "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.11.2", + "default_runtime": "cpu", + "runtime_profiles": { + "cpu": { + "image": "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.11.2" + }, + "xpu": { + "image": "intel/vllm:0.14.1-xpu" + } + }, "port": "8000:8000", "environment": { ... }, "volumes": [ ... ] }, "global_defaults": { - "block_size": 128, - "dtype": "bfloat16", - "trust_remote_code": true, - ... + "cpu": { ... }, + "xpu": { ... } }, "models": { "model-key": { @@ -185,6 +202,7 @@ docker logs -f vllm-container | `Permission denied` | Add user to docker group: `sudo usermod -aG docker $USER` then logout/login | | `Container keeps stopping` | Check logs: `docker logs vllm-container` — usually indicates insufficient memory | | `Health check timeout` | Model loading can take several minutes; check logs for progress | +| `XPU runtime fails to start` | Ensure Intel GPU drivers are installed and `/dev/dri/renderD*` exists on host | ### Stop the Server diff --git a/core/scripts/vllm-quickstart/models.json b/core/scripts/vllm-quickstart/models.json index e7e24eff..53a2f6f5 100644 --- a/core/scripts/vllm-quickstart/models.json +++ b/core/scripts/vllm-quickstart/models.json @@ -1,31 +1,72 @@ { "docker": { - "image": "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.11.2", + "default_runtime": "cpu", "port": "8000:8000", - "environment": { - "VLLM_CPU_SGL_KERNEL": "1", - "VLLM_CPU_KVCACHE_SPACE": "40", - "VLLM_RPC_TIMEOUT": "100000", - "VLLM_ALLOW_LONG_MAX_MODEL_LEN": "1", - "VLLM_ENGINE_ITERATION_TIMEOUT_S": "120", - "VLLM_CPU_NUM_OF_RESERVED_CPU": "0" - }, - "volumes": [ - "/root/.cache/huggingface:/root/.cache/huggingface", - "/opt/vllm/examples:/workspace/examples" - ] + "runtime_profiles": { + "cpu": { + "image": "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.11.2", + "environment": { + "VLLM_CPU_SGL_KERNEL": "1", + "VLLM_CPU_KVCACHE_SPACE": "40", + "VLLM_RPC_TIMEOUT": "100000", + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": "1", + "VLLM_ENGINE_ITERATION_TIMEOUT_S": "120", + "VLLM_CPU_NUM_OF_RESERVED_CPU": "0" + }, + "volumes": [ + "/root/.cache/huggingface:/root/.cache/huggingface", + "/opt/vllm/examples:/workspace/examples" + ] + }, + "xpu": { + "image": "intel/vllm:0.14.1-xpu", + "group_add": [ + "video", + "render" + ], + "devices": [ + "/dev/dri:/dev/dri" + ], + "shm_size": "16g", + "environment": { + "VLLM_RPC_TIMEOUT": "100000", + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": "1", + "VLLM_ENGINE_ITERATION_TIMEOUT_S": "120", + "SYCL_CACHE_PERSISTENT": "1" + }, + "volumes": [ + "/root/.cache/huggingface:/root/.cache/huggingface", + "/opt/vllm/examples:/workspace/examples" + ] + } + } }, "global_defaults": { - "block_size": 128, - "dtype": "bfloat16", - "distributed_executor_backend": "mp", - "trust_remote_code": true, - "enable_chunked_prefill": true, - "enforce_eager": true, - "max_num_batched_tokens": 2048, - "max_num_seqs": 256, - "disable_log_requests": true, - "enable_auto_tool_choice": true + "cpu": { + "block_size": 128, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "trust_remote_code": true, + "enable_chunked_prefill": true, + "enforce_eager": true, + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "disable_log_requests": true, + "enable_auto_tool_choice": true + }, + "xpu": { + "device": "xpu", + "block_size": 128, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "trust_remote_code": true, + "enable_chunked_prefill": true, + "enforce_eager": true, + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "disable_log_requests": true, + "enable_auto_tool_choice": true + } }, "models": { "llama-8B": { diff --git a/core/scripts/vllm-quickstart/vllm-model-runner.sh b/core/scripts/vllm-quickstart/vllm-model-runner.sh index 881cf02d..a2e62e0c 100755 --- a/core/scripts/vllm-quickstart/vllm-model-runner.sh +++ b/core/scripts/vllm-quickstart/vllm-model-runner.sh @@ -16,6 +16,7 @@ readonly CONTAINER_NAME="vllm-container" # Port configuration (can be overridden via command line) PORT="8000" HEALTHCHECK_URL="http://localhost:${PORT}/health" +RUNTIME="" # Colors for output readonly RED='\033[0;31m' @@ -33,13 +34,16 @@ show_usage() { echo "Usage: $0 [OPTIONS]" echo "" echo "Options:" - echo " -p, --port PORT Port to run the vLLM server on (default: 8000)" - echo " -h, --help Display this help message" + echo " -p, --port PORT Port to run the vLLM server on (default: 8000)" + echo " -r, --runtime RT Runtime profile: cpu or xpu (default from models.json)" + echo " -h, --help Display this help message" echo "" echo "Examples:" - echo " $0 # Start vLLM on default port 8000" - echo " $0 -p 8080 # Start vLLM on port 8080" - echo " $0 --port 9000 # Start vLLM on port 9000" + echo " $0 # Start vLLM with default runtime on port 8000" + echo " $0 -p 8080 # Start vLLM on port 8080" + echo " $0 -r cpu # Start vLLM using CPU runtime profile" + echo " $0 -r xpu # Start vLLM using XPU runtime profile" + echo " $0 --port 9000 # Start vLLM on port 9000" } # Parse command line arguments @@ -60,6 +64,15 @@ parse_arguments() { PORT="$2" shift 2 ;; + -r|--runtime) + if [[ -z "$2" || "$2" == -* ]]; then + echo "Error: --runtime requires a value (cpu or xpu)" + show_usage + exit 1 + fi + RUNTIME=$(echo "$2" | tr '[:upper:]' '[:lower:]') + shift 2 + ;; -h|--help) show_usage exit 0 @@ -105,7 +118,7 @@ cleanup_and_exit() { if [[ "$exit_code" -ne 0 ]]; then log "ERROR" "$message" - printf "${RED}❌ %s${NC}\n" "$message" + printf "${RED}[ERROR] %s${NC}\n" "$message" printf "${YELLOW}Check %s for detailed logs.${NC}\n" "$LOG_FILE" fi @@ -339,12 +352,19 @@ validate_environment() { cleanup_and_exit 1 "Invalid JSON syntax in $CONFIG_FILE" fi + # Resolve runtime profile from config/CLI + resolve_runtime + # Check Docker daemon check_docker_access if ! ${USE_SUDO}docker info >/dev/null 2>&1; then cleanup_and_exit 1 "Docker daemon is not running or not accessible." fi + if is_xpu_mode; then + validate_xpu_environment + fi + log "SUCCESS" "Environment validation completed" } @@ -362,6 +382,99 @@ check_docker_access() { fi } +# Resolve runtime profile from CLI or config +resolve_runtime() { + local has_profiles + has_profiles=$(jq -r 'if .docker.runtime_profiles then "yes" else "no" end' "$CONFIG_FILE") + + if [[ "$has_profiles" == "yes" ]]; then + if [[ -z "$RUNTIME" ]]; then + RUNTIME=$(jq -r '.docker.default_runtime // empty' "$CONFIG_FILE") + if [[ -z "$RUNTIME" || "$RUNTIME" == "null" ]]; then + RUNTIME=$(jq -r '.docker.runtime_profiles | keys[0]' "$CONFIG_FILE") + fi + fi + + if ! jq -e ".docker.runtime_profiles.\"$RUNTIME\"" "$CONFIG_FILE" >/dev/null 2>&1; then + cleanup_and_exit 1 "Runtime '$RUNTIME' not found in $CONFIG_FILE. Available: $(jq -r '.docker.runtime_profiles | keys | join(", ")' "$CONFIG_FILE")" + fi + else + if [[ -z "$RUNTIME" ]]; then + RUNTIME="cpu" + fi + fi + + log "INFO" "Using runtime profile: $RUNTIME" +} + +# Detect whether selected runtime is XPU +is_xpu_mode() { + local docker_image + docker_image=$(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".image else .docker.image end" "$CONFIG_FILE") + + if [[ "$RUNTIME" == "xpu" || "$docker_image" == *"-xpu"* ]]; then + return 0 + fi + + return 1 +} + +# Validate host prerequisites for Intel GPU/XPU containers +validate_xpu_environment() { + log "INFO" "XPU mode detected. Validating Intel GPU device access..." + + if [[ ! -e "/dev/dri" ]]; then + cleanup_and_exit 1 "XPU mode requires /dev/dri on the host. Install Intel GPU runtime/driver and verify device nodes." + fi + + if ! ls /dev/dri/renderD* >/dev/null 2>&1; then + cleanup_and_exit 1 "No render device found under /dev/dri. Ensure Intel Battlemage GPU drivers are loaded." + fi + + log "SUCCESS" "Detected /dev/dri render node(s) for XPU execution" +} + +# Resolve configured Docker supplementary groups to a host-supported value. +# Prefer a named group when it exists; otherwise fall back to the numeric GID +# of the relevant /dev/dri device node so Docker can still grant access. +resolve_docker_group_add() { + local group_name="$1" + local device_glob="" + local resolved_gid="" + + if [[ -z "$group_name" ]]; then + return 1 + fi + + if getent group "$group_name" >/dev/null 2>&1; then + printf "%s\n" "$group_name" + return 0 + fi + + case "$group_name" in + render) + device_glob="/dev/dri/renderD*" + ;; + video) + device_glob="/dev/dri/card* /dev/dri/renderD*" + ;; + *) + ;; + esac + + if [[ -n "$device_glob" ]]; then + resolved_gid=$(stat -c '%g' $device_glob 2>/dev/null | awk 'NF { print; exit }') + if [[ -n "$resolved_gid" ]]; then + log "WARN" "Host group '$group_name' is missing; using device GID '$resolved_gid' instead" >&2 + printf "%s\n" "$resolved_gid" + return 0 + fi + fi + + log "WARN" "Skipping Docker group '$group_name' because it is not defined on the host" >&2 + return 1 +} + # Load and parse configuration load_configuration() { log "INFO" "Loading configuration from $CONFIG_FILE" @@ -479,14 +592,14 @@ build_vllm_args() { # Start building arguments local args="--model $model_path" - # Add global defaults + # Add global defaults (runtime-specific if configured) while IFS='=' read -r key value; do if [[ "$value" == "true" ]]; then args="$args --$key" elif [[ "$value" != "false" && "$value" != "null" ]]; then args="$args --$(echo "$key" | tr '_' '-') $value" fi - done < <(jq -r '.global_defaults | to_entries[] | "\(.key)=\(.value)"' "$CONFIG_FILE") + done < <(jq -r "if (.global_defaults.\"$RUNTIME\") then .global_defaults.\"$RUNTIME\" else .global_defaults end | to_entries[] | \"\(.key)=\(.value)\"" "$CONFIG_FILE") # Add model-specific arguments while IFS='=' read -r key value; do @@ -647,7 +760,11 @@ start_vllm_container() { # Get Docker image first local docker_image - docker_image=$(jq -r '.docker.image' "$CONFIG_FILE") + docker_image=$(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".image else .docker.image end" "$CONFIG_FILE") + + if [[ -z "$docker_image" || "$docker_image" == "null" ]]; then + cleanup_and_exit 1 "Docker image is not configured for runtime '$RUNTIME'" + fi # Pull the image first to avoid confusion during container start if ! pull_docker_image "$docker_image"; then @@ -660,16 +777,54 @@ start_vllm_container() { # Add port mapping (use user-specified PORT, mapping host port to container port 8000) docker_cmd="$docker_cmd -p ${PORT}:8000" + # Add optional shared memory configuration + local shm_size + shm_size=$(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".shm_size // empty else .docker.shm_size // empty end" "$CONFIG_FILE") + if [[ -n "$shm_size" ]]; then + docker_cmd="$docker_cmd --shm-size $shm_size" + fi + + # Add optional supplementary groups (skip render by request) + while read -r group_name; do + local resolved_group + if [[ "$group_name" == "render" ]]; then + continue + fi + if [[ -n "$group_name" ]] && resolved_group=$(resolve_docker_group_add "$group_name"); then + docker_cmd="$docker_cmd --group-add $resolved_group" + fi + done < <(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".group_add[]? else .docker.group_add[]? end" "$CONFIG_FILE") + + local use_intel_xpu_image=false + if [[ "$docker_image" == "intel/vllm:0.14.1-xpu" ]]; then + use_intel_xpu_image=true + fi + + # Add optional device mappings (e.g. /dev/dri for XPU) + while read -r device; do + [[ -n "$device" ]] && docker_cmd="$docker_cmd --device $device" + done < <(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".devices[]? else .docker.devices[]? end" "$CONFIG_FILE") + # Add environment variables - docker_cmd="$docker_cmd -e HUGGING_FACE_HUB_TOKEN=$HFToken" + local escaped_value + printf -v escaped_value '%q' "$HFToken" + docker_cmd="$docker_cmd -e HUGGING_FACE_HUB_TOKEN=$escaped_value" while IFS='=' read -r key value; do - docker_cmd="$docker_cmd -e $key=$value" - done < <(jq -r '.docker.environment | to_entries[] | "\(.key)=\(.value)"' "$CONFIG_FILE") + printf -v escaped_value '%q' "$value" + docker_cmd="$docker_cmd -e $key=$escaped_value" + done < <(jq -r "if .docker.runtime_profiles then (.docker.runtime_profiles.\"$RUNTIME\".environment // {}) else (.docker.environment // {}) end | to_entries[] | \"\(.key)=\(.value)\"" "$CONFIG_FILE") + + if [[ "$use_intel_xpu_image" == "true" ]]; then + docker_cmd="$docker_cmd -e VLLM_TARGET_DEVICE=xpu" + docker_cmd="$docker_cmd -e VLLM_LOGGING_LEVEL=DEBUG" + docker_cmd="$docker_cmd -e ZE_FLAT_DEVICE_HIERARCHY=FLAT" + docker_cmd="$docker_cmd -e ONEAPI_DEVICE_SELECTOR='level_zero:gpu;opencl:gpu'" + fi # Add volume mounts while read -r volume; do [[ -n "$volume" ]] && docker_cmd="$docker_cmd -v $volume" - done < <(jq -r '.docker.volumes[]?' "$CONFIG_FILE") + done < <(jq -r "if .docker.runtime_profiles then .docker.runtime_profiles.\"$RUNTIME\".volumes[]? else .docker.volumes[]? end" "$CONFIG_FILE") # Add Docker image and vLLM arguments docker_cmd="$docker_cmd --ipc=host $docker_image $vllm_args" @@ -756,7 +911,7 @@ perform_health_check() { if [[ "$http_code" == "200" ]]; then log "SUCCESS" "vLLM server is healthy and responding" - printf "${GREEN}✅ vLLM server is running successfully at %s${NC}\n" "$HEALTHCHECK_URL" + printf "${GREEN}[OK] vLLM server is running successfully at %s${NC}\n" "$HEALTHCHECK_URL" return 0 elif [[ -n "$http_code" && "$http_code" != "000" ]]; then # We got a response but not 200, show what we got @@ -771,7 +926,7 @@ perform_health_check() { done log "ERROR" "Health check failed after $max_attempts attempts" - printf "${RED}❌ vLLM server failed to start or is not responding${NC}\n" + printf "${RED}[ERROR] vLLM server failed to start or is not responding${NC}\n" printf "${YELLOW}The server may still be initializing. Check logs with: ${USE_SUDO}docker logs %s${NC}\n" "$CONTAINER_NAME" return 1 } @@ -797,8 +952,12 @@ main() { load_configuration # Hardware detection - local parallel_config - parallel_config=$(compute_parallel_config) + local parallel_config="" + if is_xpu_mode; then + log "INFO" "Skipping CPU NUMA parallel auto-configuration for XPU mode" + else + parallel_config=$(compute_parallel_config) + fi # User interaction local selected_model diff --git a/docs/examples/single-node/README.md b/docs/examples/single-node/README.md index 8a0786ac..b98abd32 100644 --- a/docs/examples/single-node/README.md +++ b/docs/examples/single-node/README.md @@ -1,6 +1,6 @@ # Setup Single Node Using Ansible -These playbooks sets up a single node inference environment on either a Intel® AI Accelerator or Intel® Xeon node using Ansible. It is designed to be run on the Intel® AI Accelerator or Intel® Xeon node where the Intel® AI for Enterprise Inference Service will be deployed. The playbooks installs all necessary dependencies, configures the environment, and prepares the system for the Intel® AI for Enterprise Inference Service. If you are going to use Intel® AI Accelerator, you will need to have the Intel® AI Accelerator drivers and firmware installed on the system before running this playbook, for more information on installing the Intel® AI Accelerator drivers and firmware, refer to the [Intel® AI Accelerator Drivers Installation Guide](../../intel-ai-accelerator-prerequisites.md). +These playbooks set up a single node inference environment on an Intel® AI Accelerator (Gaudi), Intel® Arc™ Battlemage (BMG) GPU, or Intel® Xeon node using Ansible. They are designed to be run on the target node where the Intel® AI for Enterprise Inference Service will be deployed. The playbooks install all necessary dependencies, configure the environment, and prepare the system for the Intel® AI for Enterprise Inference Service. If you are going to use Intel® AI Accelerator, you will need to have the Intel® AI Accelerator drivers and firmware installed on the system before running this playbook; for more information refer to the [Intel® AI Accelerator Drivers Installation Guide](../../intel-ai-accelerator-prerequisites.md). For Intel® Arc™ Battlemage GPU, refer to the [BMG Setup Guide](../../intel-arc-bmg-setup.md) for driver and prerequisites. Many of the defaults are setup to work out of the box, but you will need to update the **`cluster_ip`** and provide the **`hf_token`** for downloading models from Hugging Face. @@ -33,8 +33,8 @@ Before running the playbook, review and update the following variables in the pl - **`hf_token_falcon3`**: Hugging Face token for Falcon 3. This can be the same as `hf_token`. ### Model Configuration -- **`models`**: A comma-separated list of model IDs to deploy (e.g., `1,2`). See the main documentation for a list of the models. The current setup only allows for one model to be deployed at a time. -- **`cpu_or_gpu`**: Set to `cpu` or `gpu` depending on the hardware. +- **`models`**: A comma-separated list of model IDs to deploy (e.g., `1,2`). See the [supported models](../../supported-models.md) for the full list. The current setup only allows for one model to be deployed at a time. +- **`device`**: Set to `cpu` for Xeon CPU, `hpu` for Gaudi GPU, or `xpu` for Intel® Arc™ Battlemage GPU. ### Certificate Configuration - **`cert_dir`**: The directory where certificates will be stored. No need to update it. diff --git a/docs/intel-arc-bmg-setup.md b/docs/intel-arc-bmg-setup.md new file mode 100644 index 00000000..065ffdfe --- /dev/null +++ b/docs/intel-arc-bmg-setup.md @@ -0,0 +1,164 @@ +# Intel® Arc™ Battlemage (BMG) GPU Setup Guide + + + + +## Overview + +This guide describes how to deploy Intel® AI for Enterprise Inference on Intel® Arc™ Battlemage (BMG) GPU hardware (Arc Pro B-series, e.g., B60, B70) for testing pourposes. + +Intel Arc Battlemage GPUs are supported via the Intel GPU Plugin for Kubernetes and the XPU backend in vLLM. + +## Supported Models + +The following models have been enabled for testing porpose on Intel Arc Battlemage GPU deployment: + +| Menu # | Model ID | VRAM Required | +|--------|----------|---------------| +| 36 | Qwen/Qwen2.5-Coder-3B-Instruct *(default)* | ~4 GB | + +## Prerequisites + +### Hardware Requirements + +- Intel® Arc™ Battlemage GPU +- Host system with PCIe Gen5 x16 slot +- Ubuntu 25.10 + +### Software Requirements + +Before deploying, ensure the following are installed on **all worker nodes** with BMG GPUs: + +#### 1. Enteprise inference prerequisites: [docs/intel-arc-bmg-setup.md](intel-arc-bmg-setup.md) + +#### 2. Intel GPU Drivers + +Install Intel GPU drivers https://dgpu-docs.intel.com/installation-guides/installing-packages-from-the-intel-ppa.html +``` + +#### 3. Verify Intel GPU is detected + +```bash +# Verify GPU detection +clinfo | grep "Device Name" +# Should show: Intel(R) Arc(TM) B580 Graphics (or similar) + +ls /dev/dri/ +# Should show renderD128 (or similar render node) +``` + +## Configuration + +### inference-config.cfg + +Set `device=xpu` in your `core/inventory/inference-config.cfg`: + +```ini +# Hardware selection: 'cpu' for Xeon CPU, 'hpu'/'gpu'/'gaudi2'/'gaudi3' for Gaudi, 'xpu'/'bmg' for Intel Arc Battlemage GPU +device=xpu + +# Select BMG-compatible models (31-36). Default is 36 (Qwen2.5-Coder-3B-Instruct). +models=36 + +# Other settings remain the same +deploy_kubernetes_fresh=on +deploy_ingress_controller=on +deploy_keycloak_apisix=on +deploy_llm_models=on +``` + +### Command Line Usage + +```bash +# Deploy with Intel Arc Battlemage GPU (Qwen2.5-Coder-3B-Instruct is the default XPU model) +./inference-stack-deploy.sh \ + --cluster-url "https://my-cluster.example.com" \ + --cert-file "/path/to/cert.pem" \ + --key-file "/path/to/key.pem" \ + --keycloak-client-id "my-client-id" \ + --keycloak-admin-user "admin" \ + --keycloak-admin-password "changeme" \ + --hugging-face-token "hf_your_token" \ + --models "36" \ + --device "xpu" +``` + +## Deployment Architecture + +For Intel Arc Battlemage GPU deployments, the Intel GPU Plugin is used instead of the Habana AI Operator: + +``` ++--------------------------------------------------+ +| Kubernetes Cluster | +| | +| +--------------------------------------------+ | +| | Intel GPU Plugin (DaemonSet) | | +| | - Exposes gpu.intel.com/xe resource | | +| | - Manages Arc B-series GPU allocation | | +| +--------------------------------------------+ | +| | +| +--------------------------------------------+ | +| | vLLM Pod (XPU Backend) | | +| | - Image: opea/vllm-xpu | | +| | - Device: xpu (Intel Arc GPU) | | +| | - Resources: gpu.intel.com/xe: 1 | | +| +--------------------------------------------+ | ++--------------------------------------------------+ +``` + +## Intel GPU Plugin + +The Intel GPU Plugin for Kubernetes is automatically deployed when `device=xpu`. It: + +- Exposes `gpu.intel.com/xe` resource on nodes with Intel Arc GPUs +- Enables Kubernetes workloads to request Intel Arc GPU resources +- Uses Node Feature Discovery (NFD) to detect GPU-capable nodes + +**Plugin Version**: 0.36.0 (configurable in `inference-metadata.cfg`) + +## vLLM XPU Backend + +Intel Arc BMG GPU deployments use vLLM with the XPU backend: + +- **Image**: `intel/vllm:0.14.0-xpu` +- **Device**: set via `VLLM_TARGET_DEVICE=xpu` (baked into the image; do not pass `--device xpu` as a CLI argument) +- **Precision**: `float16` (optimized for Arc GPU) +- **Block size**: 16 (optimized for XPU KV cache) +- **GPU Memory Utilization**: 90% + +## Troubleshooting + +### Common Issues + +1. **GPU not detected**: Ensure Intel GPU drivers are installed and `/dev/dri/renderD128` exists +2. **NFD label missing**: Label the node manually with `intel.feature.node.kubernetes.io/gpu=true` +3. **Out of memory**: Use a smaller model or upgrade to B770 (16GB GDDR6) +4. **XPU backend error**: Verify `intel/vllm:0.17.0-xpu` image is available and can pull from registry + +### CoreDNS Crash-Loop (Keycloak / PostgreSQL DNS failure) + +**Symptom:** Keycloak pod repeatedly logs `cannot resolve host "keycloak-postgresql"`, CoreDNS pod is in `CrashLoopBackOff` with: +``` +[FATAL] plugin/loop: Loop (127.0.0.1:... -> :53) detected for zone "." +``` + +**Cause:** On systems using `systemd-resolved`, `/etc/resolv.conf` is a symlink to +`/run/systemd/resolve/resolv.conf`, which lists the `nodelocaldns` address +as the first nameserver. CoreDNS `forward . /etc/resolv.conf` then forwards external queries +to nodelocaldns, which forwards them back to CoreDNS — creating a loop that the `loop` plugin +kills CoreDNS over. + +**Fix:** +sudo rm -f /etc/resolv.conf +sudo ln -s /run/systemd/resolve/resolv.conf /etc/resolv.conf + +## Supported Platforms + +- Ubuntu 25.10 + +## References + +- [Intel GPU Plugin for Kubernetes](https://github.com/intel/intel-device-plugins-for-kubernetes) +- [vLLM XPU Backend Documentation](https://docs.vllm.ai/en/latest/getting_started/xpu-installation.html) +- [Intel Arc GPU Driver Installation](https://dgpu-docs.intel.com/driver/client/overview.html) +- [Intel Extension for PyTorch (IPEX)](https://intel.github.io/intel-extension-for-pytorch/) diff --git a/docs/single-node-deployment.md b/docs/single-node-deployment.md index 823b7dbf..8d590b4f 100644 --- a/docs/single-node-deployment.md +++ b/docs/single-node-deployment.md @@ -10,9 +10,9 @@ Before running the automation, it is recommended to complete all [prerequisites] ## System Component Deployment Recommendations -For single-node Xeon clusters, **Keycloak** and **APISIX** are recommended. +For single-node Xeon or Intel® Arc™ Battlemage GPU clusters, **Keycloak** and **APISIX** are recommended. -For Intel® AI Accelerator or large multi-node Xeon clusters, the GenAI Gateway is well-suited. +For Intel® AI Accelerator (Gaudi) or large multi-node Xeon clusters, the GenAI Gateway is well-suited. ## Deployment @@ -53,17 +53,28 @@ Follow the steps below depending on the hardware platform. The `models` argument #### CPU only Run the command below to deploy the Llama 3.1 8B parameter model on CPU. ```bash -./inference-stack-deploy.sh --models "21" --cpu-or-gpu "cpu" --hugging-face-token $HUGGINGFACE_TOKEN +./inference-stack-deploy.sh --models "21" --device "cpu" --hugging-face-token $HUGGINGFACE_TOKEN ``` -#### Intel® AI Accelerators +#### Intel® AI Accelerators (Gaudi) > **📝 Note**: If running on Intel® AI Accelerators, ensure firmware and drivers are up to date using the [automated setup scripts](./intel-ai-accelerator-prerequisites.md#automated-installationupgrade-process) before deployment. -Run the command below to deploy the Llama 3.1 8B parameter model on Intel® AI Accelerator. For Gaudi 3, set `cpu-or-gpu` to `gaudi3` instead. +Run the command below to deploy the Llama 3.1 8B parameter model on Intel® AI Accelerator (Gaudi 2). For Gaudi 3, the same `hpu` value applies. ```bash -./inference-stack-deploy.sh --models "1" --cpu-or-gpu "gpu" --hugging-face-token $HUGGINGFACE_TOKEN +./inference-stack-deploy.sh --models "1" --device "hpu" --hugging-face-token $HUGGINGFACE_TOKEN ``` +#### Intel® Arc™ Battlemage (BMG) GPU + +> **📝 Note**: If running on Intel® Arc™ Battlemage GPU, ensure Intel GPU drivers are installed and the Intel GPU Plugin is ready. See the [BMG setup guide](./intel-arc-bmg-setup.md) for prerequisites. + +Run the command below to deploy the Qwen2.5-Coder-3B-Instruct model on Intel® Arc™ Battlemage GPU (default for XPU deployments). +```bash +./inference-stack-deploy.sh --models "36" --device "xpu" --hugging-face-token $HUGGINGFACE_TOKEN +``` + +For other BMG-compatible models (31–36), replace the model number. See [supported models](./intel-arc-bmg-setup.md#supported-models) for the full list. + Select Option 1 and confirm the Yes/No prompt. This will deploy the setup automatically. If any issues are encountered, double-check the prerequisites and configuration files. @@ -88,10 +99,15 @@ To test on CPU only. Note `vllmcpu` is appended to the URL. curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct-vllmcpu/v1/completions -X POST -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 50, "temperature": 0}' -H 'Content-Type: application/json' -H "Authorization: Bearer $TOKEN" ``` -To test on Intel® AI Accelerators: +To test on Intel® AI Accelerators (Gaudi): ```bash curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct/v1/completions -X POST -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 50, "temperature": 0}' -H 'Content-Type: application/json' -H "Authorization: Bearer $TOKEN" ``` +To test on Intel® Arc™ Battlemage GPU (XPU). Note `vllmxpu` is appended to the URL. +```bash +curl -k https://${BASE_URL}/Qwen2.5-Coder-3B-Instruct-vllmxpu/v1/completions -X POST -d '{"model": "Qwen/Qwen2.5-Coder-3B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 50, "temperature": 0}' -H 'Content-Type: application/json' -H "Authorization: Bearer $TOKEN" +``` + ## Post-Deployment With the deployed model on the server, refer to the [post-deployment instructions](./README.md#post-deployment) for options. diff --git a/docs/supported-models.md b/docs/supported-models.md index f9854cb9..81f8c820 100644 --- a/docs/supported-models.md +++ b/docs/supported-models.md @@ -1,23 +1,28 @@ -### Xeon and Gaudi Supported Models +### Supported Models -The following table lists the pre-validated models for Intel® AI for Enterprise Inference. +The following table lists the pre-validated models for Intel® AI for Enterprise Inference. ### ✅ **Model Support Matrix** - **Model** | **Xeon** | **Gaudi** | ----------------------------------------------------------------------------------------------|:-------------------:|:---------------------:| -[**deepseek-ai/DeepSeek-R1-Distill-Qwen-32B**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | ✓ | ✓ | -[**deepseek-ai/DeepSeek-R1-Distill-Llama-8B**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) | ✓ | ✓ | -[**meta-llama/Llama-3.1-8B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | ✓ | ✓ | -[**meta-llama/Llama-3.2-3B-Instruct**](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | ✓ | | -[**Qwen/Qwen3-1.7B**](https://huggingface.co/Qwen/Qwen3-1.7B) | ✓ | | -[**Qwen/Qwen3-4B-Instruct-2507**](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | ✓ | | -[**meta-llama/Llama-3.1-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | | ✓ | -[**meta-llama/Llama-3.1-405B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | | ✓ (Gaudi3) | ✓ -[**meta-llama/Llama-4-Scout-17B-16E-Instruct**](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | | ✓ | ✓ -[**Qwen/Qwen2.5-32B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | | ✓ | ✓ -[**mistralai/Mixtral-8x7B-Instruct-v0.1**](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | | ✓ | -[**mistralai/Mistral-7B-Instruct-v0.3**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | | ✓ | -[**BAAI/bge-reranker-base**](https://huggingface.co/BAAI/bge-reranker-base) | | ✓ -[**BAAI/bge-base-en-v1.5**](https://huggingface.co/BAAI/bge-base-en-v1.5) | | ✓ +| **Model** | **Xeon** | **Gaudi** | **Arc BMG (XPU)** | +|-----------|:--------:|:---------:|:-----------------:| +| [**deepseek-ai/DeepSeek-R1-Distill-Qwen-32B**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | ✓ | ✓ | | +| [**deepseek-ai/DeepSeek-R1-Distill-Llama-8B**](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) | ✓ | ✓ | ✓ (33) | +| [**meta-llama/Llama-3.1-8B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | ✓ | ✓ | ✓ (31) | +| [**meta-llama/Llama-3.2-3B-Instruct**](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | ✓ | | | +| [**Qwen/Qwen3-1.7B**](https://huggingface.co/Qwen/Qwen3-1.7B) | ✓ | | | +| [**Qwen/Qwen3-4B-Instruct-2507**](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | ✓ | | | +| [**Qwen/Qwen2.5-Coder-3B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) | | | ✓ (36) *(default)* | +| [**Qwen/Qwen2.5-7B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | | | ✓ (34) | +| [**meta-llama/Llama-3.1-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | | ✓ | | +| [**meta-llama/Llama-3.1-405B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | | ✓ (Gaudi3) | | +| [**meta-llama/Llama-4-Scout-17B-16E-Instruct**](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | | ✓ | | +| [**Qwen/Qwen2.5-32B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | | ✓ | | +| [**mistralai/Mixtral-8x7B-Instruct-v0.1**](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | | ✓ | | +| [**mistralai/Mistral-7B-Instruct-v0.3**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | | ✓ | ✓ (32) | +| [**tiiuae/Falcon3-7B-Instruct**](https://huggingface.co/tiiuae/Falcon3-7B-Instruct) | | | ✓ (35) | +| [**BAAI/bge-reranker-base**](https://huggingface.co/BAAI/bge-reranker-base) | | ✓ | | +| [**BAAI/bge-base-en-v1.5**](https://huggingface.co/BAAI/bge-base-en-v1.5) | | ✓ | | + +> Arc BMG (XPU) model numbers (31–36) are used with `--device xpu`. See the [BMG setup guide](./intel-arc-bmg-setup.md) for details. diff --git a/sample_solutions/CodeTranslation/api/services/api_client.py b/sample_solutions/CodeTranslation/api/services/api_client.py index f83c42a7..6538ee23 100644 --- a/sample_solutions/CodeTranslation/api/services/api_client.py +++ b/sample_solutions/CodeTranslation/api/services/api_client.py @@ -35,7 +35,7 @@ def get_inference_client(self): def translate_code(self, source_code: str, source_lang: str, target_lang: str) -> str: """ - Translate code from one language to another using CodeLlama-34b-instruct + Translate code from one language to another using an instruct model. Args: source_code: Code to translate @@ -48,42 +48,41 @@ def translate_code(self, source_code: str, source_lang: str, target_lang: str) - try: client = self.get_inference_client() - # Create prompt for code translation - prompt = f"""Translate the following {source_lang} code to {target_lang}. -Only output the translated code without any explanations or markdown formatting. - -{source_lang} code: -``` -{source_code} -``` - -{target_lang} code: -```""" - logger.info(f"Translating code from {source_lang} to {target_lang}") - # Use completions endpoint for CodeLlama - response = client.completions.create( + response = client.chat.completions.create( model=config.INFERENCE_MODEL_NAME, - prompt=prompt, + messages=[ + { + "role": "system", + "content": ( + f"You are an expert code translator. " + f"Translate {source_lang} code to {target_lang}. " + f"Output only the translated code with no explanations or markdown." + ), + }, + { + "role": "user", + "content": f"Translate this {source_lang} code to {target_lang}:\n\n{source_code}", + }, + ], max_tokens=config.LLM_MAX_TOKENS, temperature=config.LLM_TEMPERATURE, - stop=["```"] # Stop at closing code block ) - # Handle response structure - if hasattr(response, 'choices') and len(response.choices) > 0: - choice = response.choices[0] - if hasattr(choice, 'text'): - translated_code = choice.text.strip() - logger.info(f"Successfully translated code ({len(translated_code)} characters)") - return translated_code - else: - logger.error(f"Unexpected response structure: {type(choice)}, {choice}") - return "" - else: - logger.error(f"Unexpected response: {type(response)}, {response}") - return "" + if response.choices: + translated_code = response.choices[0].message.content.strip() + # Strip markdown code fences if model wraps output anyway + if translated_code.startswith("```"): + lines = translated_code.splitlines() + translated_code = "\n".join( + lines[1:-1] if lines[-1].strip() == "```" else lines[1:] + ).strip() + logger.info(f"Successfully translated code ({len(translated_code)} characters)") + return translated_code + + logger.error(f"Empty choices in response: {response}") + return "" except Exception as e: logger.error(f"Error translating code: {str(e)}", exc_info=True) raise diff --git a/sample_solutions/CodeTranslation/docker-compose.yaml b/sample_solutions/CodeTranslation/docker-compose.yaml index 368c9d66..e3ec3422 100644 --- a/sample_solutions/CodeTranslation/docker-compose.yaml +++ b/sample_solutions/CodeTranslation/docker-compose.yaml @@ -3,6 +3,10 @@ services: build: context: ./api dockerfile: Dockerfile + args: + - http_proxy=${http_proxy} + - https_proxy=${https_proxy} + - no_proxy=${no_proxy} container_name: code-trans-backend ports: - "5001:5001" @@ -32,6 +36,10 @@ services: build: context: ./ui dockerfile: Dockerfile + args: + - http_proxy=${http_proxy} + - https_proxy=${https_proxy} + - no_proxy=${no_proxy} container_name: code-trans-frontend ports: - "3000:8080" @@ -39,7 +47,6 @@ services: - backend networks: - code-trans-network - restart: unless-stopped networks: code-trans-network: