Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,27 @@
<<<<<<< HEAD
# Intel® AI for Enterprise Inference

Unleash the power of AI Inference on Intel Silicon

The Intel® AI for Enterprise Inference is aimed to streamline and enhance the deployment and management of AI inference services on Intel hardware. Utilizing the power of Kubernetes Orchestration, this solution automates the deployment of LLM models to run faster inference, provision compute resources, and configure the optimal settings to minimize the complexities and reduce manual efforts.

It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors and Intel® Gaudi® AI Accelerators, ensuring flexibility and scalability to meet diverse enterprise needs.
It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors, Intel® Gaudi® AI Accelerators, and **Intel® Arc™ Battlemage (BMG) GPUs**, ensuring flexibility and scalability to meet diverse enterprise needs.

Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly.
Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly.

![Intel AI for Enterprise Inference](docs/pictures/Enterprise-Inference-Architecture.png)

#### Key Components:
- **Kubernetes**: A powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications, ensuring high availability and efficient resource utilization.
- **Intel Gaudi Base Operator**: A specialized operator that manages the lifecycle of Habana AI resources within the Kubernetes cluster, enabling efficient utilization of Intel® Gaudi® hardware for AI workloads. (Applicable only to Gaudi based deployments)
- **Intel GPU Plugin**: A Kubernetes device plugin that manages Intel® Arc™ GPU resources within the cluster, enabling efficient utilization of Intel® Arc™ Battlemage (BMG) hardware for AI workloads. (Applicable only to BMG based deployments)
- **Ingress NGINX Controller**: A high-performance reverse proxy and load balancer for traffic, responsible for routing incoming requests to the appropriate services within the Kubernetes cluster, ensuring seamless access to deployed AI models.
- **Keycloak**: An open-source identity and access management solution that provides robust authentication and authorization capabilities, ensuring secure access to AI services and resources within the cluster.
- **APISIX**: A cloud-native API gateway, handling API traffic and providing advanced features caching, and authentication, enabling efficient and secure access to deployed AI models.
- **Observability**: An open-source monitoring solution designed to operate natively within Kubernetes clusters, providing comprehensive visibility into the performance, health, and resource utilization of deployed applications and cluster components through metrics, visualization, and alerting capabilities.
- **Model Deployments**: Automated deployment and management of AI LLM models within the Kubernetes inference cluster, enabling scalable and reliable AI inference capabilities.
- **GenAI Gateway**: An integrated gateway leveraging LiteLLM and Langfuse to provide flexible interfaces for routing and managing generative AI models. It enables user and key management, user token telemetry, and analytics for LLM inference workflows.

## Table of Contents
- [Usage](#usage)
- [Support](#support)
Expand All @@ -32,6 +34,8 @@ Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI
The Usage instructions for the AI Inference as a Service Deployment Automation can be found in the [docs/README.md](docs/README.md) file.
To setup, follow the step-by-step instructions provided in the `docs/README.md` file.

For Intel® Arc™ Battlemage (BMG) GPU setup, refer to [docs/intel-arc-bmg-setup.md](docs/intel-arc-bmg-setup.md).

## Support
For feature requests, bugs or questions about the project, [open an issue](https://github.com/opea-project/Enterprise-Inference/issues) on the GitHub Issues page. Provide as much details as possible, including steps to reproduce the issue, expected behavior, and actual behavior.

Expand All @@ -42,7 +46,7 @@ Intel® AI for Enterprise Inference is licensed under the [Apache License Versio
The [Security Policy](SECURITY.md) outlines our guidelines and procedures for ensuring the highest level of security and trust for our users who consume Intel® AI for Enterprise Inference.

## Trademark Information
Intel, the Intel logo, Xeon, and Gaudi are trademarks of Intel Corporation or its subsidiaries.
Intel, the Intel logo, Xeon, Gaudi, and Arc are trademarks of Intel Corporation or its subsidiaries.

* Other names and brands may be claimed as the property of others.
&copy; Intel Corporation
194 changes: 194 additions & 0 deletions core/helm-charts/vllm/bmg-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Copyright (C) 2025-2026 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Intel® Arc™ Battlemage (BMG) GPU optimized override values for vLLM deployments.
# This file contains BMG-specific overrides for Intel Arc B-series GPU (e.g., B580, B770).
# Requires the Intel GPU Plugin (intel-device-plugins-gpu) to be installed on the cluster.

# Intel XPU accelerator device (Arc GPU)
accelDevice: "xpu"
# Kubernetes resource name exposed by the Intel GPU device plugin
accelDeviceResource: "gpu.intel.com/xe"

block_size: 64 # XPU-optimised KV cache block size (must be >= 64 for 0.14.1-xpu IPEX chunked prefill)
max_num_seqs: 128 # Max concurrent sequences (tuned for Arc B-series VRAM)
max_seq_len_to_capture: 2048
d_type: "float16"
max_model_len: 8192

image:
repository: intel/vllm
tag: "0.14.1-xpu"
pullPolicy: IfNotPresent
command: ["vllm", "serve"]

# intel/vllm:0.14.1-xpu runs as root (no user defined in image)
podSecurityContext:
fsGroup: 0
runAsUser: 0

securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- SYS_NICE
readOnlyRootFilesystem: false
runAsNonRoot: false
runAsUser: 0

# Node affinity for BMG inference nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: ei-inference-eligible
operator: In
values: ["true"]

# Intel XPU runtime settings
VLLM_NO_USAGE_STATS: 1
DO_NOT_TRACK: 1

# vLLM device backend - set via env var in 0.14.1-xpu (VLLM_TARGET_DEVICE=xpu is already baked in)
VLLM_WORKER_MULTIPROC_METHOD: "spawn"

LLM_MODEL_ID: "Qwen/Qwen2.5-Coder-3B-Instruct"

modelConfigs:

"meta-llama/Llama-3.1-8B-Instruct":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
"--enable-auto-tool-choice",
"--tool-call-parser", "llama3_json",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

"mistralai/Mistral-7B-Instruct-v0.3":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
"--enable-auto-tool-choice",
"--tool-call-parser", "mistral",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

"deepseek-ai/DeepSeek-R1-Distill-Llama-8B":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

"Qwen/Qwen2.5-7B-Instruct":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
"--enable-auto-tool-choice",
"--tool-call-parser", "hermes",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

"Qwen/Qwen2.5-Coder-3B-Instruct":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
"--enable-auto-tool-choice",
"--tool-call-parser", "hermes",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

"tiiuae/Falcon3-7B-Instruct":
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "64",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
]
tensor_parallel_size: "1"
pipeline_parallel_size: "1"

defaultModelConfigs:
configMapValues:
VLLM_NO_USAGE_STATS: "1"
DO_NOT_TRACK: "1"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--dtype", "float16",
"--block-size", "16",
"--max-model-len", "8192",
"--gpu-memory-utilization", "0.90",
"--max-num-seqs", "128",
"--enforce-eager",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"
13 changes: 7 additions & 6 deletions core/helm-charts/vllm/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -126,11 +126,12 @@ spec:
{{- if .Values.image.pullPolicy }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
{{- end }}
# command:
# - /bin/bash
# - -c
# - |
# python3 -m vllm.entrypoints.openai.api_server --dtype {{ .Values.d_type }} --model {{ .Values.LLM_MODEL_ID }} --port {{ .Values.port }} --tensor-parallel-size {{ .Values.tensor_parallel_size }} --block-size {{ .Values.block_size }} --max-model-len {{ .Values.max_model_len }} --disable-log-requests
{{- if .Values.image.command }}
command:
{{- range .Values.image.command }}
- {{ . | quote }}
{{- end }}
{{- end }}
args:
{{- $modelConfig := (index .Values.modelConfigs $modelName | default dict) }}
{{- $modelArgs := $modelConfig.extraCmdArgs | default .Values.defaultModelConfigs.extraCmdArgs }}
Expand Down Expand Up @@ -195,7 +196,7 @@ spec:
{{- end }}
{{- else }}
limits:
habana.ai/gaudi: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}}
{{ .Values.accelDeviceResource | default "habana.ai/gaudi" }}: {{ .Values.tensor_parallel_size | default (index .Values.modelConfigs .Values.LLM_MODEL_ID | default dict).tensor_parallel_size | default .Values.defaultModelConfigs.tensor_parallel_size | quote}}
{{- end }}
{{- end }}

Expand Down
2 changes: 2 additions & 0 deletions core/helm-charts/vllm/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ autoscaling:

# empty for CPU (longer latencies are tolerated before HPA scaling unaccelerated service)
accelDevice: ""
# Kubernetes resource name for the accelerator (e.g. habana.ai/gaudi, gpu.intel.com/xe)
accelDeviceResource: ""

port: 2080
shmSize: 1Gi
Expand Down
10 changes: 6 additions & 4 deletions core/inference-stack-deploy.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ NC=$(tput sgr0)
# --keycloak-admin-password <password>: The Keycloak admin password.
# --hugging-face-token <token>: The token for Huggingface.
# --models <models>: The models to deploy (comma-separated list of model numbers or names).
# --cpu-or-gpu <c/g>: Specify whether to run on CPU or GPU.
# --device <cpu/hpu/xpu>: Specify the target device. 'cpu' for Xeon, 'hpu' for Gaudi GPU, 'xpu' for Intel Arc Battlemage GPU.

# Main Menu

Expand Down Expand Up @@ -86,7 +86,7 @@ NC=$(tput sgr0)

# Example
# To perform a fresh installation with specific parameters, you can run:
# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g"
# ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/to/cert.pem" --key-file "/path/to/key.pem" --keycloak-client-id "my-client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu"

##############################################################################

Expand Down Expand Up @@ -118,6 +118,7 @@ source "$SCRIPT_DIR/lib/cluster/drv-fw-update.sh"
# Components deployment
source "$SCRIPT_DIR/lib/components/kubernetes-setup.sh"
source "$SCRIPT_DIR/lib/components/intel-base-operator.sh"
source "$SCRIPT_DIR/lib/components/intel-gpu-plugin.sh"
source "$SCRIPT_DIR/lib/components/ingress-controller.sh"
source "$SCRIPT_DIR/lib/components/keycloak-controller.sh"
source "$SCRIPT_DIR/lib/components/genai-gateway-controller.sh"
Expand Down Expand Up @@ -166,10 +167,11 @@ Options:
--keycloak-admin-password <pw> Keycloak admin password.
--hugging-face-token <token> Huggingface token.
--models <models> Models to deploy (comma-separated).
--cpu-or-gpu <c/g> Run on CPU (c) or GPU (g).
--device <cpu/hpu/xpu> Target device: cpu (Xeon), hpu (Gaudi GPU), xpu (Intel Arc Battlemage GPU).

Examples:
Setup cluster: ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --cpu-or-gpu "g"
Setup cluster (Gaudi GPU): ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "1,3,5" --device "hpu"
Setup cluster (BMG GPU): ./inference-stack-deploy.sh --cluster-url "https://example.com" --cert-file "/path/cert.pem" --key-file "/path/key.pem" --keycloak-client-id "client-id" --keycloak-admin-user "user" --keycloak-admin-password "password" --hugging-face-token "token" --models "36" --device "xpu"

###############################################################################
EOF
Expand Down
23 changes: 7 additions & 16 deletions core/inventory/hosts.yaml
Original file line number Diff line number Diff line change
@@ -1,28 +1,19 @@
all:
hosts:
master:
ansible_host: "{{ private_ip_control_plane_node }}"
ansible_user: "username_of_user_running_automation"
ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
worker1:
ansible_host: "{{ private_ip_workload_node_1 }}"
ansible_user: "username_of_user_running_automation"
ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
worker2:
ansible_host: "{{ private_ip_workload_node_2 }}"
ansible_user: "username_of_user_running_automation"
ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
master1:
ansible_connection: local
ansible_user: gta
ansible_become: true
children:
kube_control_plane:
hosts:
master:
master1:
kube_node:
hosts:
worker1:
worker2:
master1:
etcd:
hosts:
master:
master1:
k8s_cluster:
children:
kube_control_plane:
Expand Down
18 changes: 11 additions & 7 deletions core/inventory/inference-config.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@ cluster_url=api.example.com
cert_file=~/certs/cert.pem
key_file=~/certs/key.pem
keycloak_client_id=my-client-id
keycloak_admin_user=your-keycloak-admin-user
keycloak_admin_user=admin
keycloak_admin_password=changeme
hugging_face_token=your_hugging_face_token
hugging_face_token_falcon3=your_hugging_face_token
models=11
cpu_or_gpu=cpu
hugging_face_token=
hugging_face_token_falcon3=
models=36
# Hardware selection: 'cpu' for Xeon CPU, 'hpu'/'gpu'/'gaudi2'/'gaudi3' for Gaudi GPU, 'xpu'/'bmg' for Intel Arc Battlemage GPU
device=xpu
vault_pass_code=place-holder-123
deploy_kubernetes_fresh=on
deploy_kubernetes_fresh=yes
deploy_ingress_controller=on
deploy_keycloak_apisix=on
deploy_genai_gateway=off
Expand All @@ -20,4 +21,7 @@ deploy_istio=off
uninstall_ceph=off

# Agentic AI Plugin
deploy_agenticai_plugin=off
deploy_agenticai_plugin=off
http_proxy=""
https_proxy=""
no_proxy="localhost,127.0.0.1,10.96.0.0/12,10.244.0.0/16,192.168.0.0/16,.svc,.cluster.local,10.233.0.1,10.233.0.0/18,10.0.0.0/8,api.example.com"
Loading