opea-project · tintisimone · Jun 10, 2026
diff --git a/README.md b/README.md
@@ -1,25 +1,27 @@
+<<<<<<< HEAD
 # Intel® AI for Enterprise Inference
 
 Unleash the power of AI Inference on Intel Silicon
 
 The Intel® AI for Enterprise Inference is aimed to streamline and enhance the deployment and management of AI inference services on Intel hardware. Utilizing the power of Kubernetes Orchestration, this solution automates the deployment of LLM models to run faster inference, provision compute resources, and configure the optimal settings to minimize the complexities and reduce manual efforts.
 
-It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors and Intel® Gaudi® AI Accelerators, ensuring flexibility and scalability to meet diverse enterprise needs.
+It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors, Intel® Gaudi® AI Accelerators, and **Intel® Arc™ Battlemage (BMG) GPUs**, ensuring flexibility and scalability to meet diverse enterprise needs.
 
-Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly. 
+Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly.
 
 ![Intel AI for Enterprise Inference](docs/pictures/Enterprise-Inference-Architecture.png)
 
 #### Key Components:
    - **Kubernetes**: A powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications, ensuring high availability and efficient resource utilization.
    - **Intel Gaudi Base Operator**: A specialized operator that manages the lifecycle of Habana AI resources within the Kubernetes cluster, enabling efficient utilization of Intel® Gaudi® hardware for AI workloads. (Applicable only to Gaudi based deployments)
+   - **Intel GPU Plugin**: A Kubernetes device plugin that manages Intel® Arc™ GPU resources within the cluster, enabling efficient utilization of Intel® Arc™ Battlemage (BMG) hardware for AI workloads. (Applicable only to BMG based deployments)
    - **Ingress NGINX Controller**: A high-performance reverse proxy and load balancer for traffic, responsible for routing incoming requests to the appropriate services within the Kubernetes cluster, ensuring seamless access to deployed AI models.
    - **Keycloak**: An open-source identity and access management solution that provides robust authentication and authorization capabilities, ensuring secure access to AI services and resources within the cluster.
    - **APISIX**: A cloud-native API gateway, handling API traffic and providing advanced features caching, and authentication, enabling efficient and secure access to deployed AI models.
    - **Observability**: An open-source monitoring solution designed to operate natively within Kubernetes clusters, providing comprehensive visibility into the performance, health, and resource utilization of deployed applications and cluster components through metrics, visualization, and alerting capabilities.
    - **Model Deployments**: Automated deployment and management of AI LLM models within the Kubernetes inference cluster, enabling scalable and reliable AI inference capabilities.
    - **GenAI Gateway**: An integrated gateway leveraging LiteLLM and Langfuse to provide flexible interfaces for routing and managing generative AI models. It enables user and key management, user token telemetry, and analytics for LLM inference workflows.
-   
+
 ## Table of Contents
 -   [Usage](#usage)
 -   [Support](#support)
@@ -32,6 +34,8 @@ Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI
 The Usage instructions for the AI Inference as a Service Deployment Automation can be found in the [docs/README.md](docs/README.md) file.
 To setup, follow the step-by-step instructions provided in the `docs/README.md` file.
 
+For Intel® Arc™ Battlemage (BMG) GPU setup, refer to [docs/intel-arc-bmg-setup.md](docs/intel-arc-bmg-setup.md).
+
 ## Support
 For feature requests, bugs or questions about the project, [open an issue](https://github.com/opea-project/Enterprise-Inference/issues) on the GitHub Issues page. Provide as much details as possible, including steps to reproduce the issue, expected behavior, and actual behavior.
 
@@ -42,7 +46,7 @@ Intel® AI for Enterprise Inference is licensed under the [Apache License Versio
 The [Security Policy](SECURITY.md) outlines our guidelines and procedures for ensuring the highest level of security and trust for our users who consume Intel® AI for Enterprise Inference.
 
 ## Trademark Information
-Intel, the Intel logo, Xeon, and Gaudi are trademarks of Intel Corporation or its subsidiaries.
+Intel, the Intel logo, Xeon, Gaudi, and Arc are trademarks of Intel Corporation or its subsidiaries.
 
 * Other names and brands may be claimed as the property of others.
 &copy; Intel Corporation
diff --git a/core/helm-charts/vllm/bmg-values.yaml b/core/helm-charts/vllm/bmg-values.yaml
@@ -0,0 +1,194 @@
+# Copyright (C) 2025-2026 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+# Intel® Arc™ Battlemage (BMG) GPU optimized override values for vLLM deployments.
+# This file contains BMG-specific overrides for Intel Arc B-series GPU (e.g., B580, B770).
+# Requires the Intel GPU Plugin (intel-device-plugins-gpu) to be installed on the cluster.
+
+# Intel XPU accelerator device (Arc GPU)
+accelDevice: "xpu"
+# Kubernetes resource name exposed by the Intel GPU device plugin
+accelDeviceResource: "gpu.intel.com/xe"
+
+block_size: 64           # XPU-optimised KV cache block size (must be >= 64 for 0.14.1-xpu IPEX chunked prefill)
+max_num_seqs: 128        # Max concurrent sequences (tuned for Arc B-series VRAM)
+max_seq_len_to_capture: 2048
+d_type: "float16"
+max_model_len: 8192
+
+image:
+  repository: intel/vllm
+  tag: "0.14.1-xpu"
+  pullPolicy: IfNotPresent
+  command: ["vllm", "serve"]
+
+# intel/vllm:0.14.1-xpu runs as root (no user defined in image)
+podSecurityContext:
+  fsGroup: 0
+  runAsUser: 0
+
+securityContext:
+  allowPrivilegeEscalation: false
+  capabilities:
+    drop:
+      - ALL
+    add:
+      - SYS_NICE
+  readOnlyRootFilesystem: false
+  runAsNonRoot: false
+  runAsUser: 0
+
+# Node affinity for BMG inference nodes
+affinity:
+  nodeAffinity:
+    requiredDuringSchedulingIgnoredDuringExecution:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: ei-inference-eligible
+          operator: In
+          values: ["true"]
+
+# Intel XPU runtime settings
+VLLM_NO_USAGE_STATS: 1
+DO_NOT_TRACK: 1
+
+# vLLM device backend - set via env var in 0.14.1-xpu (VLLM_TARGET_DEVICE=xpu is already baked in)
+VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+
+LLM_MODEL_ID: "Qwen/Qwen2.5-Coder-3B-Instruct"
+
+modelConfigs:
+
+  "meta-llama/Llama-3.1-8B-Instruct":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+        "--enable-auto-tool-choice",
+        "--tool-call-parser", "llama3_json",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+  "mistralai/Mistral-7B-Instruct-v0.3":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+        "--enable-auto-tool-choice",
+        "--tool-call-parser", "mistral",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+  "deepseek-ai/DeepSeek-R1-Distill-Llama-8B":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+  "Qwen/Qwen2.5-7B-Instruct":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+        "--enable-auto-tool-choice",
+        "--tool-call-parser", "hermes",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+  "Qwen/Qwen2.5-Coder-3B-Instruct":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+        "--enable-auto-tool-choice",
+        "--tool-call-parser", "hermes",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+  "tiiuae/Falcon3-7B-Instruct":
+    configMapValues:
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+      HF_HUB_DISABLE_XET: "1"
+    extraCmdArgs:
+      [
+        "--dtype", "float16",
+        "--block-size", "64",
+        "--max-model-len", "8192",
+        "--gpu-memory-utilization", "0.90",
+        "--max-num-seqs", "128",
+        "--enforce-eager",
+      ]
+    tensor_parallel_size: "1"
+    pipeline_parallel_size: "1"
+
+defaultModelConfigs:
+  configMapValues:
+    VLLM_NO_USAGE_STATS: "1"
+    DO_NOT_TRACK: "1"
+    VLLM_WORKER_MULTIPROC_METHOD: "spawn"
+    HF_HUB_DISABLE_XET: "1"
+  extraCmdArgs:
+    [
+      "--dtype", "float16",
+      "--block-size", "16",
+      "--max-model-len", "8192",
+      "--gpu-memory-utilization", "0.90",
+      "--max-num-seqs", "128",
+      "--enforce-eager",
+    ]
+  tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
+  pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"
diff --git a/core/helm-charts/vllm/templates/deployment.yaml b/core/helm-charts/vllm/templates/deployment.yaml
@@ -126,11 +126,12 @@ spec:
           {{- if .Values.image.pullPolicy }}
           imagePullPolicy: {{ .Values.image.pullPolicy }}
           {{- end }}
-          # command:
-          #   - /bin/bash
-          #   - -c
-          #   - |
-          #     python3 -m vllm.entrypoints.openai.api_server --dtype {{ .Values.d_type }}  --model {{ .Values.LLM_MODEL_ID }} --port {{ .Values.port }} --tensor-parallel-size {{ .Values.tensor_parallel_size }} --block-size {{ .Values.block_size }} --max-model-len {{ .Values.max_model_len }} --disable-log-requests
+          {{- if .Values.image.command }}
+          command:
+            {{- range .Values.image.command }}
+            - {{ . | quote }}
+            {{- end }}
+          {{- end }}
           args:
             {{- $modelConfig := (index .Values.modelConfigs $modelName | default dict) }}
             {{- $modelArgs := $modelConfig.extraCmdArgs | default .Values.defaultModelConfigs.extraCmdArgs }}
@@ -195,7 +196,7 @@ spec:
             {{- end }}
             {{- else }}
             limits:
-              habana.ai/gaudi: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}}
+              {{ .Values.accelDeviceResource | default "habana.ai/gaudi" }}: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}}
             {{- end }}
           {{- end }}
 

diff --git a/core/helm-charts/vllm/values.yaml b/core/helm-charts/vllm/values.yaml
@@ -17,6 +17,8 @@ autoscaling:
 
 # empty for CPU (longer latencies are tolerated before HPA scaling unaccelerated service)
 accelDevice: ""
+# Kubernetes resource name for the accelerator (e.g. habana.ai/gaudi, gpu.intel.com/xe)
+accelDeviceResource: ""
 
 port: 2080
 shmSize: 1Gi

diff --git a/core/inference-stack-deploy.sh b/core/inference-stack-deploy.sh
@@ -50,7 +50,7 @@ NC=$(tput sgr0)
 # --keycloak-admin-password <password>: The Keycloak admin password.
 # --hugging-face-token <token>: The token for Huggingface.
 # --models <models>: The models to deploy (comma-separated list of model numbers or names).
-# --cpu-or-gpu <c/g>: Specify whether to run on CPU or GPU.
+# --device <cpu/hpu/xpu>: Specify the target device. 'cpu' for Xeon, 'hpu' for Gaudi GPU, 'xpu' for Intel Arc Battlemage GPU.
 
 # Main Menu
 
@@ -86,7 +86,7 @@ NC=$(tput sgr0)
 
 # Example
 # To perform a fresh installation with specific parameters, you can run:
-# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g"
+# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu"
 
 ##############################################################################
 
@@ -118,6 +118,7 @@ source "$SCRIPT_DIR/lib/cluster/drv-fw-update.sh"
 # Components deployment
 source "$SCRIPT_DIR/lib/components/kubernetes-setup.sh"
 source "$SCRIPT_DIR/lib/components/intel-base-operator.sh"
+source "$SCRIPT_DIR/lib/components/intel-gpu-plugin.sh"
 source "$SCRIPT_DIR/lib/components/ingress-controller.sh"
 source "$SCRIPT_DIR/lib/components/keycloak-controller.sh"
 source "$SCRIPT_DIR/lib/components/genai-gateway-controller.sh"
@@ -166,10 +167,11 @@ Options:
   --keycloak-admin-password <pw> Keycloak admin password.
   --hugging-face-token <token>   Huggingface token.
   --models <models>              Models to deploy (comma-separated).
-  --cpu-or-gpu <c/g>             Run on CPU (c) or GPU (g).
+  --device <cpu/hpu/xpu>         Target device: cpu (Xeon), hpu (Gaudi GPU), xpu (Intel Arc Battlemage GPU).
 
 Examples:
-  Setup cluster: ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g"
+  Setup cluster (Gaudi GPU): ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu"
+  Setup cluster (BMG GPU):   ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "36" --device "xpu"
 
 ###############################################################################  
 EOF

diff --git a/core/inventory/hosts.yaml b/core/inventory/hosts.yaml
@@ -1,28 +1,19 @@
 all:
   hosts:
-    master:
-      ansible_host: "{{ private_ip_control_plane_node }}"
-      ansible_user: "username_of_user_running_automation"
-      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
-    worker1:
-      ansible_host: "{{ private_ip_workload_node_1 }}"
-      ansible_user: "username_of_user_running_automation"
-      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
-    worker2:
-      ansible_host: "{{ private_ip_workload_node_2 }}"
-      ansible_user: "username_of_user_running_automation"
-      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
+    master1:
+      ansible_connection: local
+      ansible_user: gta
+      ansible_become: true
   children:
     kube_control_plane:
       hosts:
-        master:
+        master1:
     kube_node:
       hosts:
-        worker1:
-        worker2:
+        master1:
     etcd:
       hosts:
-        master:
+        master1:
     k8s_cluster:
       children:
         kube_control_plane:

diff --git a/core/inventory/inference-config.cfg b/core/inventory/inference-config.cfg
@@ -2,14 +2,15 @@ cluster_url=api.example.com
 cert_file=~/certs/cert.pem
 key_file=~/certs/key.pem
 keycloak_client_id=my-client-id
-keycloak_admin_user=your-keycloak-admin-user
+keycloak_admin_user=admin
 keycloak_admin_password=changeme
-hugging_face_token=your_hugging_face_token
-hugging_face_token_falcon3=your_hugging_face_token
-models=11
-cpu_or_gpu=cpu
+hugging_face_token=
+hugging_face_token_falcon3=
+models=36
+# Hardware selection: 'cpu' for Xeon CPU, 'hpu'/'gpu'/'gaudi2'/'gaudi3' for Gaudi GPU, 'xpu'/'bmg' for Intel Arc Battlemage GPU
+device=xpu
 vault_pass_code=place-holder-123
-deploy_kubernetes_fresh=on
+deploy_kubernetes_fresh=yes
 deploy_ingress_controller=on
 deploy_keycloak_apisix=on
 deploy_genai_gateway=off
@@ -20,4 +21,7 @@ deploy_istio=off
 uninstall_ceph=off
 
 # Agentic AI Plugin
-deploy_agenticai_plugin=off
+deploy_agenticai_plugin=off
+http_proxy=""
+https_proxy=""
+no_proxy="localhost,127.0.0.1,10.96.0.0/12,10.244.0.0/16,192.168.0.0/16,.svc,.cluster.local,10.233.0.1,10.233.0.0/18,10.0.0.0/8,api.example.com"