diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..4d2009e6 --- /dev/null +++ b/.gitignore @@ -0,0 +1,11 @@ +# Local investigation / working notes — not for upstream +FIXES.md +INVESTIGATION.md +JOURNEY.md +REMAINING_WORK.md +UPSTREAM_BUG_REPORT.md + +# Security scan outputs +bandit-report.html +bandit-screen-output.txt +trivy-reports/ diff --git a/core/helm-charts/sglang/Chart.yaml b/core/helm-charts/sglang/Chart.yaml new file mode 100644 index 00000000..c1f9e636 --- /dev/null +++ b/core/helm-charts/sglang/Chart.yaml @@ -0,0 +1,17 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: v2 +name: sglang +description: A Helm chart for deploying SGLang inference server (Xeon CPU build) +type: application +version: 0.1.0 +appVersion: "v0.5.11-xeon" +keywords: + - sglang + - xeon + - cpu + - llm + - inference + - gpt-oss + - openai-compatible diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md new file mode 100644 index 00000000..8ebc61aa --- /dev/null +++ b/core/helm-charts/sglang/README.md @@ -0,0 +1,643 @@ +# SGLang Helm Chart — Intel Xeon CPU + +## Overview + +Deploys [SGLang](https://github.com/sgl-project/sglang) on a Kubernetes +cluster as a model-agnostic inference server on Intel Xeon CPU nodes, +including the OPEA-standard nginx-ingress → APISIX → Keycloak (OIDC) +auth chain. + +The chart has **no built-in default model** — `modelSource` and +`modelName` must be supplied at install time, either via `--set` or a +values file. Model-specific recipes (helm command, values overrides, +model card) live under `third_party/Dell/model-deployment//`. +Notable example: **gpt-oss-20b**, which required a patched SGLang image +to work on CPU (see [Noteworthy: gpt-oss-20b](#noteworthy-gpt-oss-20b) +below). + +The chart targets a **patched** SGLang image (`enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`). +The most important patch (fix1) rebuilds `sgl-kernel` with the correct +AVX-512-BF16 / AMX-BF16 compile flags — the upstream +`lmsysorg/sglang:v0.5.12-xeon` ships the shared library without them, so +every bf16 forward pass crashes with `tinygemm_kernel_nn: scalar path +not implemented!` regardless of model. The remaining patches are +gpt-oss-specific and are runtime no-ops for other models. The image is +built once via a self-contained Dockerfile and imported directly into +the local containerd image store — no registry required. + +## Features + +- **Model-agnostic SGLang on Xeon CPU** — any HF model SGLang supports loads through the same chart +- **Patched image** that unblocks bf16 inference on Xeon (every model benefits) and adds MXFP4 + sinks-attention support for gpt-oss +- **OPEA-standard auth chain**: TLS at nginx, OIDC bearer validation at APISIX, token issuance by Keycloak +- **No external registry**: image builds locally into the cluster's containerd image store (works on both kubeadm/containerd and k3s) +- **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` +- **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks + +--- + +## Which Scenario Applies to You? + +| | Scenario 1 | Scenario 2 | +|---|---|---| +| **Cluster setup** | OPEA Ansible playbooks already run | Fresh box — no existing cluster | +| **k3s / nginx / APISIX / Keycloak** | Already provisioned | You set them up manually | +| **Starting point** | Go to [Prerequisites](#prerequisites) | Go to [Scenario 2: k3s Bootstrap](#scenario-2-k3s-bootstrap-standalone-setup) | +| **Converges at** | [Build the Image](#build-the-image) | [Build the Image](#build-the-image) | + +Both scenarios use the same chart, the same image, and the same `helm install` command. +They differ only in how the cluster and auth stack are set up beforehand. + +--- + +## Scenario 1: EI Deployment (OPEA Ansible Cluster) + +Use this path when your cluster was provisioned by the OPEA Ansible playbooks. +k3s, nginx-ingress, APISIX, Keycloak, the Keycloak edge routes, and the OIDC +client are already in place. Skip straight to **Prerequisites** and then +**Build the Image**. + +### Prerequisites + +- **Operating System**: Ubuntu 22.04+ +- **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) +- **Memory**: ≥ 64 GiB RAM for mid-size models (gpt-oss-20b uses ~25 GiB dequantized + KV cache) +- **Disk**: ≥ 100 GiB free on the root partition +- **Kubernetes**: 1.24+ — validated on kubeadm/containerd (the cluster `inference-stack-deploy.sh` produces) and on k3s +- **Helm**: 3+ +- **NodePorts free on the host**: 30080, 30443 (nginx), 32080 (APISIX) +- **HuggingFace token** for gated models (e.g. `meta-llama/*`); not required for open models like `openai/gpt-oss-20b` or `Qwen/Qwen3-8B` +- **Sudo access** for the one-shot image build + +--- + +## Scenario 2: k3s Bootstrap (Standalone Setup) + +Use this path when you are starting from a **fresh single-node Ubuntu box** +with no existing Kubernetes cluster. The steps below reproduce the same +cluster shape the OPEA Ansible playbooks produce: k3s + nginx + Keycloak + +APISIX (with GatewayProxy/IngressClass wiring), the TLS secret in both +namespaces that need it, a `KC_HOSTNAME`-pinned Keycloak, the `/realms`, +`/admin`, and `/token` edge Ingresses, and the `my-client-id` OIDC client. + +After completing this scenario, `generate-token.sh` and the model deploy +work identically to the OPEA Ansible flow — both scenarios converge at +**Build the Image** below. + +### S2.1 k3s + Helm + +```bash +sudo bash scripts/bootstrap-k3s.sh +export KUBECONFIG=$HOME/.kube/config +kubectl get nodes -o wide +helm version --short +``` + +The script installs k3s (`--disable traefik`), symlinks `kubectl`, copies +kubeconfig to `~/.kube/config`, and installs Helm 3. + +### S2.2 nginx-ingress + +```bash +helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx +helm install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=NodePort \ + --set controller.service.nodePorts.http=30080 \ + --set controller.service.nodePorts.https=30443 \ + --set controller.admissionWebhooks.enabled=false \ + --set controller.ingressClassResource.default=true + +kubectl wait --for=condition=ready pod -n ingress-nginx \ + -l app.kubernetes.io/component=controller --timeout=120s +``` + +### S2.3 Keycloak (dev mode) + +```bash +kubectl apply -f - <<'EOF' +apiVersion: apps/v1 +kind: Deployment +metadata: { name: keycloak, namespace: default } +spec: + replicas: 1 + selector: { matchLabels: { app: keycloak } } + template: + metadata: { labels: { app: keycloak } } + spec: + containers: + - name: keycloak + image: quay.io/keycloak/keycloak:26.0 + args: ["start-dev"] + env: + - { name: KEYCLOAK_ADMIN, value: admin } + - { name: KEYCLOAK_ADMIN_PASSWORD, value: admin } + - { name: KC_HTTP_RELATIVE_PATH, value: "/" } + - { name: KC_PROXY_HEADERS, value: xforwarded } + # Pin the issuer hostname so tokens are always stamped with the + # cluster-internal name, no matter which edge hostname the + # request came in on. APISIX validates the `iss` claim against + # this hostname (chart's oidc.discovery default). + - { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } + - { name: KC_HOSTNAME_STRICT, value: "false" } + - { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } + ports: [{ containerPort: 8080, name: http }] +--- +apiVersion: v1 +kind: Service +metadata: { name: keycloak, namespace: default } +spec: + selector: { app: keycloak } + ports: [{ port: 80, targetPort: 8080 }] +EOF +kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s +``` + +Create the OIDC client. The `clientId` and `secret` here must exactly +match what you'll later pass to the chart via `--set oidc.clientId=...` +and `--set oidc.clientSecret=...`. The values below are the defaults used +by the gpt-oss-20b deployment guide — substitute your own for any +non-test deployment. + +```bash +ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=admin-cli" -d "username=admin" -d "password=admin" -d "grant_type=password"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +CLIENT_ID=my-client-id +CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR + +kubectl run kc-create --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ + -H 'Content-Type: application/json' \ + http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ + -d '{\"clientId\":\"${CLIENT_ID}\",\"secret\":\"${CLIENT_SECRET}\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" +``` + +Verify the client was created: + +```bash +kubectl run kc-check --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d 'client_id=${CLIENT_ID}' -d 'client_secret=${CLIENT_SECRET}' -d 'grant_type=client_credentials'" \ + | head -c 80 +# expect: JSON with "access_token":"..." +``` + +### S2.4 APISIX + +The Apache APISIX chart installs the dataplane + etcd + ingress +controller. On v2 of the ingress controller (current as of this writing) +you additionally need a `GatewayProxy` CR and an `IngressClass` whose +`parameters` reference it — without those, the controller silently drops +every `ApisixRoute` and the chart's route ends up unreachable. + +```bash +helm repo add apisix https://charts.apiseven.com +helm install auth-apisix apisix/apisix \ + -n auth-apisix --create-namespace \ + --set service.type=NodePort \ + --set ingress-controller.enabled=true \ + --set ingress-controller.config.apisix.serviceNamespace=auth-apisix + +kubectl wait --for=condition=ready pod -n auth-apisix --all --timeout=300s + +# Grab the admin key the chart generated for the dataplane +ADMIN_KEY=$(helm get values auth-apisix -n auth-apisix --all \ + | python3 -c "import sys,yaml; print(yaml.safe_load(sys.stdin)['apisix']['admin']['credentials']['admin'])") +echo "APISIX admin key: $ADMIN_KEY" + +# Create the GatewayProxy that the ingress controller will use as its +# dataplane handle. +kubectl apply -f - < **Both scenarios converge here.** Whether your cluster came from the OPEA +> Ansible playbooks (Scenario 1) or from the k3s bootstrap above (Scenario 2), +> the image build and all subsequent steps are identical. + +```bash +git clone https://github.com/cld2labs/Enterprise-Inference.git +cd Enterprise-Inference +git checkout cld2labs/sglang-gpt-oss + +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +First run takes ~5–10 minutes. The script auto-detects the runtime: + +- **kubeadm + containerd** (OPEA Ansible-deployed clusters): builds via + `nerdctl` directly into containerd's `k8s.io` namespace. Installs + `buildkit` from upstream GitHub on demand if it isn't already present. +- **k3s**: installs `docker.io` on demand, builds, then + `docker save | k3s ctr images import -`. + +In both cases the image lands where kubelet pulls from. Verify with +whichever tool matches your runtime: + +```bash +# kubeadm / containerd +sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang + +# k3s +sudo k3s ctr images ls | grep enterprise-inference/sglang +``` + +Either way the expected line is `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. + +## Deploy a Model + +`modelSource` and `modelName` are required at install time. The chart +template fails fast if either is empty. + +### Generic install + +```bash +helm install ./core/helm-charts/sglang \ + --set modelSource="" \ + --set modelName="" \ + --set huggingface.token="$HF_TOKEN" # only if the model is gated +``` + +### Model-specific recipes + +Models that need additional configuration ship with their own values file +and deployment guide: + +| Model | Deployment guide | +| ----- | ---------------- | +| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | + +The deployment guide carries the full `helm install` command line for +that model — all model-specific flags (parsers, attention backend, +extraArgs) come through as `--set` overrides. The chart's own +`values.yaml` stays model-agnostic. + +Wait for the pod (first start downloads the weights — duration depends +on model size and network): + +```bash +kubectl wait --for=condition=ready pod -l app=sglang --timeout=600s +kubectl logs -l app=sglang --tail=5 +# expect: INFO: Uvicorn running on http://0.0.0.0:30000 +``` + +## Inference + +### Smoke test (no auth, via port-forward) + +```bash +kubectl port-forward svc/-sglang 30000:30000 & +sleep 2 + +curl -sS http://localhost:30000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool +``` + +### Auth-routed call (nginx → APISIX → Keycloak → sglang) + +Fetch a token from inside the cluster (so the `iss` claim matches what +APISIX validates against), then call through the ingress: + +```bash +TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=my-client-id" \ + -d "client_secret=" \ + -d "grant_type=client_credentials"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ + -H "Host: api.example.com" \ + -H "Authorization: Bearer $TOKEN" \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool +``` + +### API endpoints + +| Endpoint | Description | +|----------|-------------| +| `/v1/models` | List loaded models | +| `/v1/chat/completions` | OpenAI-compatible chat completions | +| `/v1/completions` | OpenAI-compatible text completions | +| `/health` | Liveness probe | + +## Configuration + +### Key values + +| Key | Default | Description | +|-----|---------|-------------| +| `image.repository` | `enterprise-inference/sglang` | Patched image (set to `lmsysorg/sglang` to use upstream, but bf16 inference will crash) | +| `image.tag` | `v0.5.12-xeon-fix11-debug` | Pinned to the validated build | +| `image.pullPolicy` | `IfNotPresent` | Set to `Never` if the image is only in local containerd | +| `modelSource` | _(required)_ | HuggingFace repo to load (chart fails to render if empty) | +| `modelName` | _(required)_ | Served name (also used in route URI; chart fails if empty) | +| `server.dtype` | `bfloat16` | Compute dtype | +| `server.extraArgs` | `[]` | Extra CLI flags to `sglang serve` | +| `server.maxTotalTokens` | `32768` | Caps KV-cache memory (SGLang reads host RAM, not cgroup limits) | +| `extraEnv` | `[MXFP4_NIBBLE_ORDER=low_first]` | Env vars; the default is required for MXFP4 models and a runtime no-op for others | +| `oidc.enabled` | `true` | Enable APISIX `openid-connect` plugin | +| `apisixRoute.enabled` | `true` | Create `ApisixRoute` for the service | +| `ingress.enabled` | `true` | Create `Ingress` for the service | +| `huggingface.token` | `""` | Required for gated models (e.g. `meta-llama/*`) | + +The complete configuration surface is documented inline in `values.yaml`. + +### Debug env vars (off by default, baked into the image) + +| Variable | Effect | Applies to | +| -------- | ------ | ---------- | +| `ALLOW_FP32_MXFP4=1` | Lets you pass `--dtype float32` with MXFP4 models | MXFP4 models only | +| `MXFP4_OUT_DTYPE=float32\|float16\|bfloat16` | Dequant output dtype | MXFP4 models only | +| `FP32_PROMOTE_MOE=1` | Compute per-expert MoE forward in fp32 | MoE models only | +| `--kv-cache-dtype float32` | Allowed by our patched allowlist (allocates fp32 KV) | All models | + +These were used during a precision investigation A/B; see commit history +on `cld2labs/sglang-gpt-oss` for context. + +## What's Patched + +The image-build directory contains a series of small Python patches +applied to SGLang's installed source at image build time: + +| # | Patch | Scope | Purpose | +|---|-------|-------|---------| +| 1 | (Dockerfile step 1) | **All bf16 models** | Rebuild `sgl-kernel` with `-mavx512bf16 -mamx-bf16` so bf16 matmuls emit `vdpbf16ps` instead of crashing with "scalar path not implemented" | +| 2 | `enable-mxfp4-cpu.py` | MXFP4 models | Register `mxfp4` quantization for CPU (upstream gates it behind `is_cuda() or is_hip()`) | +| 2b | `enable-gpt-oss-cpu.py` | gpt-oss | Add `torch_native`/`intel_amx` to GptOss's CPU attention-backend allowlist | +| 3 | `enable-gpt-oss-cpu-loaders.py` | gpt-oss | Guard `.cuda()` calls in gpt-oss weight loaders for CPU-only torch | +| 4 | `enable-gpt-oss-cpu-moe.py` | MXFP4 MoE | Add a CPU branch to `Mxfp4MoEMethod` that dequants MXFP4 → bf16 at load time | +| 5 | `enable-cpu-sinks-attention.py` | sinks-attention models (gpt-oss) | Add sinks-attention support to `torch_native_backend` | +| 6/7 | `enable-gpt-oss-cpu-dequant-v2.py` | MXFP4 models | Self-contained MXFP4 dequant with explicit nibble-order control via `MXFP4_NIBBLE_ORDER` | +| 8 | `enable-gpt-oss-cpu-moe-v2.py` | gpt-oss | Route the MoE forward through `moe_forward_native` so gpt-oss's swiglu+α+clamp+biases is computed correctly | +| 9–11 | `enable-*-debug.py` | Debug knobs | Precision A/B knobs (off by default; see env-var table above) | + +Patch 1 is a **genuine upstream regression** that affects every Xeon +SGLang user, not just gpt-oss — the published image's `sgl-kernel` `.so` +contains zero AVX-512-BF16 instructions, so any bf16 forward pass +crashes with `tinygemm_kernel_nn: scalar path not implemented!`. + +Patches 2–8 are **gpt-oss-specific** in scope. They are runtime no-ops +for models that don't trigger them (e.g. a Qwen3 deployment never enters +the MXFP4 dequant path or the sinks-attention wrapper), so leaving them +baked into the image carries no cost for other models. + +## Noteworthy: gpt-oss-20b + +`openai/gpt-oss-20b` is the most complex model this chart serves and the +driver for most of the patch stack above. Specifically: + +- **MXFP4 quantization is GPU-gated upstream.** Patches 2, 4, 6/7 enable + it on CPU by registering the quantization method and adding a CPU + weight-load dequant path (MXFP4 → bf16 at startup). +- **gpt-oss uses sinks attention** (a learnable per-head scalar added to + the softmax denominator). No upstream CPU attention backend supports + it; patch 5 adds it to `torch_native_backend`. +- **MoE forward needs gpt-oss-specific math** (swiglu + α + clamp + + biases). Patch 8 routes through `moe_forward_native`, which handles + this correctly at the cost of throughput vs the AMX kernel. + +The full deployment recipe — model card, helm command, verification, +parameter reference — is in +[`third_party/Dell/model-deployment/gpt-oss-20b/`](../../third_party/Dell/model-deployment/gpt-oss-20b/). + +**Known limitations specific to gpt-oss-20b:** + +- **Long-form drift after ~150 tokens.** Output past ~150 tokens + collapses into broken tokens, emoji, and special-token leaks. A + precision A/B (fp32 per-expert MoE, fp32 KV cache, + `--enable-fp32-lm-head`) conclusively ruled out precision as the + cause. Surviving hypotheses: sliding-window-attention bookkeeping in + our patched `torch_native_backend`, or Harmony channel-switch + tokenization interacting with the sinks wrapper. +- **Throughput.** The chart routes through `moe_forward_native` for + correctness, not speed; expect ~4 tok/s. +- **No tensor parallelism.** Chart currently runs `--tp-size=1`. Setting + `--tp-size=2` to split across NUMA nodes should give multi-x speedup + but the patch stack has not been validated under TP. + +## Troubleshooting + +See [`third_party/Dell/model-deployment/sglang-troubleshooting.md`](../../third_party/Dell/model-deployment/sglang-troubleshooting.md) +for a symptom-indexed guide covering: + +- Gateway Timeout (504) on inference requests +- Response `content` field is null (gpt-oss Harmony format) +- "Unknown quantization method: mxfp4" at startup +- "scalar path not implemented!" on the first forward pass +- Random-vocab gibberish in `content` (nibble order) +- Long-form drift past ~150 tokens (gpt-oss) +- 401 Unauthorized from APISIX with a valid-looking token (issuer mismatch) + +Quick log + describe: + +```bash +kubectl logs -l app=sglang -f +kubectl describe pod -l app=sglang +``` + +### Stop / restart + +```bash +helm uninstall +kubectl delete pvc -l app.kubernetes.io/instance= # frees the model cache +``` + +## Project Structure + +``` +core/helm-charts/sglang/ +├── README.md # this file +├── Chart.yaml +├── values.yaml # full configuration surface +├── templates/ # Helm templates (Deployment, Service, PVC, Ingress, ApisixRoute, Secret) +└── image-build/ + ├── Dockerfile # FROM lmsysorg/sglang:v0.5.12-xeon + 11 patch steps + ├── build-and-import.sh # one-shot build + load into local containerd (kubeadm or k3s) + └── enable-*.py # patch scripts applied at image build time + +third_party/Dell/model-deployment/ +├── sglang-troubleshooting.md # symptom-indexed troubleshooting for the SGLang chart +└── gpt-oss-20b/ + ├── model-card.md # gpt-oss-20b model card + └── deployment.md # gpt-oss-20b deployment guide (carries the full helm command) +``` + +## References + +- [SGLang documentation](https://docs.sglang.io) +- [SGLang CPU server guide](https://docs.sglang.io/docs/hardware-platforms/cpu_server) +- [OpenAI gpt-oss model card](https://huggingface.co/openai/gpt-oss-20b) diff --git a/core/helm-charts/sglang/image-build/Dockerfile b/core/helm-charts/sglang/image-build/Dockerfile new file mode 100644 index 00000000..995533a7 --- /dev/null +++ b/core/helm-charts/sglang/image-build/Dockerfile @@ -0,0 +1,147 @@ +# Custom sglang xeon image with two fixes layered onto the upstream image: +# 1. sgl-kernel rebuilt with AVX-512-BF16 / AMX flags so bf16 inference +# doesn't crash on the unimplemented tinygemm_kernel_nn stub. +# 2. mxfp4 quantization registered for CPU device so openai/gpt-oss-* +# can be loaded and served (it dequantizes to bf16 at weight-load +# time via gpt_oss._load_weights_mxfp4 → fp8_utils.dequant_mxfp4, +# which is pure PyTorch and CPU-friendly). +# +# Tested on Intel Xeon 6972P (Granite Rapids). +# Build: docker build -t enterprise-inference/sglang:v0.5.12-xeon-fix1 . + +FROM lmsysorg/sglang:v0.5.12-xeon + +# ---- 1) Rebuild sgl-kernel with proper CPU compile flags ---- +# The upstream image's published .so is compiled without -mavx512bf16, so +# the at::BFloat16 specialization of tinygemm_kernel_nn is effectively missing +# and falls through to a TORCH_CHECK(false, "scalar path not implemented!"). +# We rebuild it from the in-image source with the right flags. +ENV CMAKE_BF16_FLAGS="-march=sapphirerapids -mtune=native -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512bf16 -mamx-bf16 -mamx-int8 -mamx-tile -O3 -DNDEBUG" + +RUN bash -lc '\ + set -ex; \ + source /opt/.venv/bin/activate; \ + UV=/root/.local/bin/uv; \ + "$UV" pip install --no-deps scikit-build-core ninja cmake setuptools wheel pyproject_metadata pathspec; \ + cd /sgl-workspace/sglang/sgl-kernel; \ + cp pyproject_cpu.toml pyproject.toml; \ + export CMAKE_CXX_FLAGS="$CMAKE_BF16_FLAGS"; \ + export CMAKE_C_FLAGS="$CMAKE_BF16_FLAGS"; \ + export CMAKE_BUILD_PARALLEL_LEVEL=64; \ + export SKBUILD_CMAKE_ARGS="-DCMAKE_CXX_FLAGS=$CMAKE_BF16_FLAGS;-DCMAKE_C_FLAGS=$CMAKE_BF16_FLAGS;-DCMAKE_BUILD_TYPE=Release"; \ + "$UV" pip install --force-reinstall --no-deps --no-build-isolation -v . 2>&1 | tail -20; \ + SO=$(find /opt/.venv -name "common_ops*.so" | head -1); \ + echo "=== rebuilt $SO ==="; \ + ls -la "$SO"; \ + BF16=$(objdump -d "$SO" 2>/dev/null | grep -cE "vdpbf16ps|vfmadd.*bh" || true); \ + echo "AVX-512 BF16 instructions in rebuilt .so: $BF16"; \ + if [ "$BF16" -lt 100 ]; then echo "ERROR: rebuild did not emit BF16 instructions"; exit 1; fi \ +' + +# ---- 2) Patch quantization registration so mxfp4 works on CPU ---- +# The upstream code gates "mxfp4" behind is_cuda() or is_hip(); on CPU it +# never registers, and any model with quant_method=mxfp4 fails at config +# validation. The CPU dequant + bf16 forward path for gpt_oss already exists +# in the codebase (fp8_utils.dequant_mxfp4 + gpt_oss._load_weights_mxfp4) — +# the registration gate is the only missing piece. +COPY enable-mxfp4-cpu.py /tmp/enable-mxfp4-cpu.py +RUN /opt/.venv/bin/python3 /tmp/enable-mxfp4-cpu.py && rm /tmp/enable-mxfp4-cpu.py + +# Sanity check: after the patch, importing should not error and mxfp4 should +# be in QUANTIZATION_METHODS when SGLANG_USE_CPU_ENGINE=1. +RUN SGLANG_USE_CPU_ENGINE=1 /opt/.venv/bin/python3 -c "\ +from sglang.srt.layers.quantization import QUANTIZATION_METHODS, CPU_QUANTIZATION_METHODS; \ +assert 'mxfp4' in QUANTIZATION_METHODS, 'mxfp4 not in QUANTIZATION_METHODS'; \ +assert 'mxfp4' in CPU_QUANTIZATION_METHODS, 'mxfp4 not in CPU_QUANTIZATION_METHODS'; \ +print('OK: mxfp4 registered for CPU')" + +# ---- 3) Allow CPU attention backends for GptOssForCausalLM ---- +# server_args.py hardcodes an allowlist of attention backends for gpt-oss +# that omits CPU options. Patch it to default to intel_amx on CPU and to +# accept intel_amx / torch_native as valid backends. +COPY enable-gpt-oss-cpu.py /tmp/enable-gpt-oss-cpu.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu.py && rm /tmp/enable-gpt-oss-cpu.py + +# ---- 4) Make gpt_oss.py weight loaders CPU-safe ---- +# The model file hardcodes `.cuda()` and `torch.cuda.empty_cache/synchronize` +# in its MXFP4 weight loader and dequant helper, which abort on CPU-only torch. +# Guard each call with `if torch.cuda.is_available():`. +COPY enable-gpt-oss-cpu-loaders.py /tmp/enable-gpt-oss-cpu-loaders.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-loaders.py && rm /tmp/enable-gpt-oss-cpu-loaders.py + +# ---- 5) Wire a CPU forward path into Mxfp4MoEMethod ---- +# Mxfp4MoEMethod ships only GPU branches (marlin/cutlass/flashinfer/aiter/ +# triton_kernels). This patch adds a CPU branch that dequantizes MXFP4 -> bf16 +# at weight-loading time and then routes the MoE forward through +# `torch.ops.sgl_kernel.fused_experts_cpu` (the same kernel the unquantized +# bf16 MoE method already uses in unquant.py:forward_cpu). +COPY enable-gpt-oss-cpu-moe.py /tmp/enable-gpt-oss-cpu-moe.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-moe.py && rm /tmp/enable-gpt-oss-cpu-moe.py + +# ---- 6) Add sinks-attention support to torch_native_backend ---- +# gpt-oss uses sink attention (a learnable per-head scalar added to the softmax +# denominator). No CPU backend in sglang supports the `sinks` kwarg today. +# This patch adds it to torch_native_backend with the exact math sglang's own +# triton kernel uses (extend_attention.py lines 535-537). +COPY enable-cpu-sinks-attention.py /tmp/enable-cpu-sinks-attention.py +RUN /opt/.venv/bin/python3 /tmp/enable-cpu-sinks-attention.py && rm /tmp/enable-cpu-sinks-attention.py + +# ---- 7) Replace _process_weights_for_cpu with self-contained dequant ---- +# fix6 produced /generate 200 but with random-vocab output — classic signature +# of corrupted weights. Hypothesis: MXFP4 nibble packing order in gpt-oss's +# storage doesn't match what MXFP4QuantizeUtil uses. This patch swaps the +# dequant to a self-contained function with explicit control over nibble +# order via MXFP4_NIBBLE_ORDER env var ("low_first" is correct for gpt-oss +# per the numerical sanity check on layer-0 weight magnitudes). +COPY enable-gpt-oss-cpu-dequant-v2.py /tmp/enable-gpt-oss-cpu-dequant-v2.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-dequant-v2.py && rm /tmp/enable-gpt-oss-cpu-dequant-v2.py + +# ---- 8) Route Mxfp4MoEMethod.forward_cpu through moe_forward_native ---- +# After fix7 dequant produces sane weights, but fused_experts_cpu uses plain +# silu(gate)*up with no biases, no alpha, no clamp — wrong activation for +# gpt-oss → gibberish output. moe_forward_native is sglang's pure-Python MoE +# reference that already handles gpt-oss-specific swiglu (alpha + clamp + +# interleaved gate/up + (up+1)) and W13/W2 biases. Also strips the AMX-pack +# call because moe_forward_native uses F.linear on un-packed bf16 weights. +COPY enable-gpt-oss-cpu-moe-v2.py /tmp/enable-gpt-oss-cpu-moe-v2.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-moe-v2.py && rm /tmp/enable-gpt-oss-cpu-moe-v2.py + +# ---- 9) DEBUG: allow --dtype float32 with mxfp4 (for precision-drift A/B) ---- +# server_args.py hard-forces dtype=bfloat16 for mxfp4 models. Gate that behind +# ALLOW_FP32_MXFP4=1 so we can A/B bf16 vs fp32 for Phase 2 numerical +# investigation. Not for production — fp32 is 2x memory and significantly +# slower than the bf16 path. +COPY enable-fp32-override-debug.py /tmp/enable-fp32-override-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-override-debug.py && rm /tmp/enable-fp32-override-debug.py + +# ---- 10) DEBUG: make MXFP4-CPU dequant output dtype env-controlled ---- +# fix7's _process_weights_for_cpu hardcoded bf16 output. With fix9-debug +# allowing --dtype half, the dequant output dtype needs to match the rest +# of the model's compute dtype. Drive it from MXFP4_OUT_DTYPE env var. +COPY enable-dequant-dtype-debug.py /tmp/enable-dequant-dtype-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-dequant-dtype-debug.py && rm /tmp/enable-dequant-dtype-debug.py + +# ---- 11) DEBUG: fp32 promotion inside moe_forward_native per-expert loop ---- +# Phase 2 confirmed bf16 intermediate precision is a contributor to long-form +# drift (fp16 shifted the drift point ~30%). Option 1 in REMAINING_WORK.md: +# keep layer weights/KV in their native dtype, but compute the per-expert +# forward in fp32 — both F.linear matmuls, biases, and the swiglu chain — +# casting back only at the expert output boundary. Gated behind +# FP32_PROMOTE_MOE=1 so the image is safe to run with the flag off. +COPY enable-fp32-moe-promotion-debug.py /tmp/enable-fp32-moe-promotion-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-moe-promotion-debug.py && rm /tmp/enable-fp32-moe-promotion-debug.py + +# ---- 12) DEBUG: allow --kv-cache-dtype float32 end-to-end ---- +# Phase 2 Option 2 — add float32 to the argparse choices, map it through +# configure_kv_cache_dtype, and fix torch_native_backend's dtype-mismatch +# handler so it upcasts Q to fp32 instead of silently downcasting K/V back +# to bf16. With anything other than float32/fp32 selected, all three sites +# are byte-identical to upstream. +COPY enable-fp32-kv-cache-debug.py /tmp/enable-fp32-kv-cache-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-kv-cache-debug.py && rm /tmp/enable-fp32-kv-cache-debug.py + +# Mirror the upstream env vars so behavior is unchanged +ENV SGLANG_USE_CPU_ENGINE=1 +ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:/usr/lib/x86_64-linux-gnu/libtbbmalloc.so:/opt/.venv/lib/libiomp5.so + +WORKDIR /sgl-workspace/sglang diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh new file mode 100755 index 00000000..c83f324d --- /dev/null +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -0,0 +1,101 @@ +#!/usr/bin/env bash +# One-shot script to build the patched sglang xeon image and load it +# into the local containerd image store, so the chart can use it without +# pushing to an external registry. +# +# Auto-detects the runtime: +# - OPEA / kubeadm-based clusters: containerd accessed via `nerdctl` +# under the `k8s.io` namespace (where kubelet pulls from). Built +# directly there; no separate import step. +# - k3s clusters: `docker build` then `docker save | k3s ctr images +# import -`. Installs docker.io if missing. +# +# Run with: sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +set -euo pipefail + +IMAGE_TAG="${IMAGE_TAG:-enterprise-inference/sglang:v0.5.12-xeon-fix11-debug}" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + +cd "$SCRIPT_DIR" + +if command -v nerdctl >/dev/null 2>&1 && command -v containerd >/dev/null 2>&1; then + RUNTIME=nerdctl +elif command -v k3s >/dev/null 2>&1; then + RUNTIME=k3s +else + echo "ERROR: neither nerdctl (kubeadm/containerd) nor k3s detected." >&2 + echo "Install one of them, or build manually and push to a registry." >&2 + exit 1 +fi + +echo "==> Detected container runtime: $RUNTIME" + +case "$RUNTIME" in + nerdctl) + # nerdctl needs buildkitd to run `nerdctl build`. buildkit isn't in + # Ubuntu apt — install from upstream GitHub releases (~30 MB). + if ! command -v buildctl >/dev/null 2>&1; then + BUILDKIT_VERSION="${BUILDKIT_VERSION:-v0.18.1}" + echo "==> Installing buildkit ${BUILDKIT_VERSION} from GitHub releases" + tmpdir=$(mktemp -d) + curl -fsSL \ + "https://github.com/moby/buildkit/releases/download/${BUILDKIT_VERSION}/buildkit-${BUILDKIT_VERSION}.linux-amd64.tar.gz" \ + | tar -xz -C "$tmpdir" + install -m 0755 "$tmpdir/bin/buildctl" /usr/local/bin/buildctl + install -m 0755 "$tmpdir/bin/buildkitd" /usr/local/bin/buildkitd + rm -rf "$tmpdir" + fi + if ! pgrep -x buildkitd >/dev/null 2>&1; then + echo "==> Starting buildkitd in the background" + mkdir -p /run/buildkit + nohup /usr/local/bin/buildkitd >/var/log/buildkitd.log 2>&1 & + for i in 1 2 3 4 5 6 7 8 9 10; do + [ -S /run/buildkit/buildkitd.sock ] && break + sleep 1 + done + [ -S /run/buildkit/buildkitd.sock ] || { + echo "buildkitd did not come up; see /var/log/buildkitd.log" >&2 + exit 1 + } + fi + + # nerdctl builds directly into containerd's image store. Pin namespace + # to k8s.io so kubelet can find the image without a separate import. + echo "==> Building $IMAGE_TAG via nerdctl (namespace k8s.io)" + nerdctl --namespace k8s.io build -t "$IMAGE_TAG" . + + echo "==> Verifying" + nerdctl --namespace k8s.io images "$IMAGE_TAG" --format '{{.Repository}}:{{.Tag}}' \ + | grep -F "$IMAGE_TAG" || { + echo "Image not found in containerd k8s.io namespace" >&2 + exit 1 + } + ;; + + k3s) + echo "==> Ensuring docker is installed" + if ! command -v docker >/dev/null 2>&1; then + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get install -y docker.io + systemctl enable --now docker + fi + docker version --format 'Server: {{.Server.Version}}' + + echo "==> Building $IMAGE_TAG via docker" + docker build -t "$IMAGE_TAG" . + + echo "==> Importing into k3s containerd" + docker save "$IMAGE_TAG" | k3s ctr images import - + + echo "==> Verifying" + k3s ctr images ls -q | grep -F "$IMAGE_TAG" || { + echo "Imported image not found in k3s containerd" >&2 + exit 1 + } + ;; +esac + +echo +echo "==> Done. Image $IMAGE_TAG is loaded in the local containerd image store." +echo "==> The chart's values.yaml already defaults to this tag with" +echo " pullPolicy: IfNotPresent. No further overrides required." diff --git a/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py b/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py new file mode 100644 index 00000000..20aca592 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py @@ -0,0 +1,219 @@ +"""Add sinks-attention forward support to torch_native_backend. + +gpt-oss uses sink attention (a learnable per-head scalar added to the softmax +denominator). sglang's GPU kernels (triton, fa3, trtllm_mha, aiter) accept a +`sinks` kwarg in their `forward_extend` / `forward_decode`, but none of the +CPU backends do (`intel_amx`, `torch_native`). + +This patch teaches `TorchNativeAttnBackend` to accept and apply sinks. The +math is exactly what sglang's own triton kernel does +(see srt/layers/attention/triton_ops/extend_attention.py lines 535-537): + + deno += exp(cur_sink - e_max) + +i.e. a fake extra "row" with logit = sinks[h] is included in the softmax +denominator but excluded from the value-weighted sum. With sinks the +attention probabilities sum to <1. + +Implementation: when sinks is provided, bypass PyTorch's SDPA (which has no +sinks API) and do attention manually in ~15 lines. Falls back to SDPA fast +path when sinks is None (zero perf cost for non-sink models). +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/attention/torch_native_backend.py" +) +src = F.read_text() +original = src + +# 1) Add sinks=None kwarg to forward_extend and forward_decode signatures, and +# plumb it through to the SDPA wrapper. +src = src.replace( + " def forward_extend(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " ):\n", + " def forward_extend(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " sinks=None,\n" + " ):\n" + " self._sinks = sinks\n", +) + +src = src.replace( + " def forward_decode(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " ):\n", + " def forward_decode(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " sinks=None,\n" + " ):\n" + " self._sinks = sinks\n", +) + +# 2) Replace the SDPA call(s) inside _run_sdpa_forward_extend / _run_sdpa_forward_decode +# with our sinks-aware wrapper. The wrapper is appended as a module-level +# function and the existing call sites are routed through it. +# +# We do this by injecting a helper function near the top of the module and +# monkey-patching torch.nn.functional.scaled_dot_product_attention's local +# import to point at it inside this file. Cleanest: append the helper, then +# swap the SDPA call inside the class methods. + +# Inject the wrapper right after the existing imports block. +WRAPPER = ''' + +# ---- sinks-aware SDPA wrapper (added by enable-cpu-sinks-attention.py) ---- +import math as _math +def _sdpa_with_sinks(query, key, value, *, attn_mask=None, dropout_p=0.0, + is_causal=False, scale=None, enable_gqa=False, + sinks=None): + """Forward-only scaled_dot_product_attention with optional sinks. + + When sinks is None this is equivalent to torch's SDPA. + When sinks is a (H,) tensor of per-head scalars, the softmax denominator + is augmented by exp(sinks[h] - row_max) — i.e. an attention sink. + """ + if sinks is None: + return torch.nn.functional.scaled_dot_product_attention( + query, key, value, + attn_mask=attn_mask, dropout_p=dropout_p, + is_causal=is_causal, scale=scale, enable_gqa=enable_gqa, + ) + + # Manual attention path with sinks + # query/key/value: (B, H_q, Sq, D) and (B, H_kv, Sk, D) + if scale is None: + scale = 1.0 / _math.sqrt(query.shape[-1]) + + if enable_gqa and key.shape[-3] != query.shape[-3]: + # repeat KV heads to match Q heads + rep = query.shape[-3] // key.shape[-3] + key = key.repeat_interleave(rep, dim=-3) + value = value.repeat_interleave(rep, dim=-3) + + # scores: (B, H, Sq, Sk) + scores = torch.matmul(query, key.transpose(-2, -1)) * scale + + if is_causal: + Sq, Sk = scores.shape[-2], scores.shape[-1] + causal_mask = torch.ones(Sq, Sk, dtype=torch.bool, device=scores.device).tril( + diagonal=Sk - Sq + ) + scores = scores.masked_fill(~causal_mask, float("-inf")) + if attn_mask is not None: + if attn_mask.dtype == torch.bool: + scores = scores.masked_fill(~attn_mask, float("-inf")) + else: + scores = scores + attn_mask + + # Stable softmax with sinks + row_max = scores.amax(dim=-1, keepdim=True) + row_max = row_max.masked_fill(row_max == float("-inf"), 0.0) + exp_scores = torch.exp(scores - row_max) + # sinks: (H,) -> broadcast to (1, H, 1) so sink_exp is (B, H, Sq) + sinks_t = sinks.to(scores.dtype).to(scores.device).view(1, -1, 1) + sink_exp = torch.exp(sinks_t - row_max.squeeze(-1)) + denom = exp_scores.sum(dim=-1) + sink_exp # (B, H, Sq) + attn_weights = exp_scores / denom.unsqueeze(-1) + if dropout_p > 0.0: + attn_weights = torch.nn.functional.dropout(attn_weights, p=dropout_p) + return torch.matmul(attn_weights, value) +# ---- end sinks wrapper ---- +''' + +# Place the wrapper just after the last `from ... import ...` block. Simple anchor. +anchor_for_wrapper = "class TorchNativeAttnBackend(AttentionBackend):" +if anchor_for_wrapper not in src: + print("ERROR: class anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace( + anchor_for_wrapper, + WRAPPER + "\n" + anchor_for_wrapper, + 1, +) + +# 3) Route the existing SDPA calls through _sdpa_with_sinks with the stored sink. +# The class has at least two call sites for `scaled_dot_product_attention` +# inside _run_sdpa_forward_extend / _run_sdpa_forward_decode. Both fully- +# qualified and bare-name (imported) forms appear. Rewrite both. +src = src.replace( + "torch.nn.functional.scaled_dot_product_attention(", + "_sdpa_with_sinks(", +) +# The bare form: the file does `from torch.nn.functional import scaled_dot_product_attention` +# and calls it directly. Match those too. Use a word boundary via the preceding +# whitespace + name to avoid matching the import line itself. +import re as _re +src = _re.sub( + r"(? bf16 dequant. + +After fix4-fix6 the gpt-oss-20b pipeline ran end-to-end and returned 200, but +the generated tokens were random vocabulary — the classic signature of +corrupted weights producing essentially random logits. The dequant math in +`MXFP4QuantizeUtil.dequantize` is OCP-spec-compliant, but there is one +implementation choice that differs in the wild: the **nibble packing order** +inside each uint8. + +`MXFP4QuantizeUtil` uses: + even index <- low 4 bits + odd index <- high 4 bits + +while triton_kernels / NVIDIA's reference uses: + even index <- high 4 bits + odd index <- low 4 bits + +If gpt-oss is stored with the latter convention, our previous dequant +swapped every (even, odd) pair, producing structurally garbage weights. + +This patch: + +1. Inlines a self-contained `_dequant_mxfp4_cpu` function that: + - Has explicit control over nibble order via `MXFP4_NIBBLE_ORDER` env var + ("low_first" or "high_first"; default "high_first" — the triton_kernels + convention which is what gpt-oss is stored as) + - Logs basic stats (shape, dtype, min/max/mean abs) so we can verify the + dequantized weights look sane +2. Calls it from `_process_weights_for_cpu` instead of MXFP4QuantizeUtil. + +The function is conservative: it only changes the nibble extraction logic; +sign/magnitude/E2M1/scale math is identical to MXFP4QuantizeUtil. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# Replace the body of _process_weights_for_cpu and add a helper. +# Anchor: the full helper as written by fix4's enable-gpt-oss-cpu-moe.py. +old_helper = ( + " def _process_weights_for_cpu(self, layer):\n" + " \"\"\"Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu.\n" + "\n" + " Layer params after this call:\n" + " - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K)\n" + " - layer.w2_weight: bf16, AMX-packed, shape (E, K, N)\n" + " - layer.w13_weight_scale / w2_weight_scale: deleted\n" + " \"\"\"\n" + " import torch\n" + " from torch.nn import Parameter\n" + " from sglang.srt.layers.quantization.mxfp4_tensor import (\n" + " MXFP4QuantizeUtil,\n" + " )\n" + " from sglang.srt.layers.amx_utils import (\n" + " _amx_process_weight_after_loading,\n" + " )\n" + "\n" + " def _dequant(weight, scale):\n" + " return MXFP4QuantizeUtil.dequantize(\n" + " quantized_data=weight,\n" + " dtype=torch.bfloat16,\n" + " scale=scale,\n" + " block_sizes=[32],\n" + " )\n" + "\n" + " w13_bf16 = _dequant(layer.w13_weight, layer.w13_weight_scale)\n" + " w2_bf16 = _dequant(layer.w2_weight, layer.w2_weight_scale)\n" + "\n" + " del layer.w13_weight\n" + " del layer.w2_weight\n" + " del layer.w13_weight_scale\n" + " del layer.w2_weight_scale\n" + " layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False)\n" + " layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False)\n" + "\n" + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n" +) + +new_helper = ''' def _process_weights_for_cpu(self, layer): + """Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu. + + Layer params after this call: + - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K) + - layer.w2_weight: bf16, AMX-packed, shape (E, K, N) + - layer.w13_weight_scale / w2_weight_scale: deleted + """ + import os + import torch + from torch.nn import Parameter + from sglang.srt.layers.amx_utils import ( + _amx_process_weight_after_loading, + ) + + nibble_order = os.environ.get("MXFP4_NIBBLE_ORDER", "high_first").lower() + + # E2M1 lookup table (OCP MXFP4 spec) + _E2M1 = torch.tensor( + [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0], + dtype=torch.float32, + ) + + def _dequant_mxfp4_cpu(weight_packed, scale_e8m0): + """Dequantize MXFP4 packed uint8 weights to bf16. + + weight_packed: (..., K_packed) uint8, where K_packed = K / 2 + (2 mxfp4 values per uint8 byte) + scale_e8m0: (..., K_blocks) uint8, where K_blocks = K / 32 + (one E8M0 scale per 32 elements) + + Returns: (..., K) bf16 + """ + assert weight_packed.dtype == torch.uint8 + assert scale_e8m0.dtype == torch.uint8 + device = weight_packed.device + e2m1 = _E2M1.to(device) + + # Extract the two nibbles per byte + low_nibble = (weight_packed & 0x0F) # bits 3:0 + high_nibble = (weight_packed >> 4) & 0x0F # bits 7:4 + + # Interleave to undo the packing + shape = list(weight_packed.shape) + shape[-1] = shape[-1] * 2 + unfused = torch.empty(shape, dtype=torch.uint8, device=device) + if nibble_order == "low_first": + # MXFP4QuantizeUtil convention: even <- low, odd <- high + unfused[..., 0::2] = low_nibble + unfused[..., 1::2] = high_nibble + else: + # triton_kernels / NVIDIA reference convention: + # even <- high, odd <- low + unfused[..., 0::2] = high_nibble + unfused[..., 1::2] = low_nibble + + # E2M1: bit 3 = sign, bits 2:0 = magnitude index + sign = 1.0 - 2.0 * ((unfused >> 3) & 1).float() + magnitude_idx = (unfused & 0x07).long() + values = e2m1[magnitude_idx] * sign + + # Apply E8M0 scale: each scale covers 32 consecutive elements + *batch_dims, K = values.shape + K_blocks = scale_e8m0.shape[-1] + if K != K_blocks * 32: + raise ValueError( + f"dequant shape mismatch: dequantized K={K}, " + f"K_blocks*32={K_blocks*32} from scale shape {tuple(scale_e8m0.shape)}" + ) + values = values.view(*batch_dims, K_blocks, 32) + scale_f = torch.exp2(scale_e8m0.float() - 127.0).unsqueeze(-1) + out = (values * scale_f).view(*batch_dims, K).to(torch.bfloat16) + return out + + import logging as _logging + _log = _logging.getLogger(__name__) + + w13_bf16 = _dequant_mxfp4_cpu(layer.w13_weight, layer.w13_weight_scale) + w2_bf16 = _dequant_mxfp4_cpu(layer.w2_weight, layer.w2_weight_scale) + + # One-line sanity log so we can see if the dequantized values look sane. + # Healthy bf16 model weights typically have |w| in [1e-3, ~1.0]; gibberish- + # producing weights often show abs-mean either suspiciously huge or near 0. + _log.info( + "[mxfp4-cpu-dequant] nibble_order=%s w13: shape=%s abs(min=%.4g, max=%.4g, mean=%.4g) " + "w2: shape=%s abs(min=%.4g, max=%.4g, mean=%.4g)", + nibble_order, + tuple(w13_bf16.shape), + float(w13_bf16.abs().min()), + float(w13_bf16.abs().max()), + float(w13_bf16.abs().float().mean()), + tuple(w2_bf16.shape), + float(w2_bf16.abs().min()), + float(w2_bf16.abs().max()), + float(w2_bf16.abs().float().mean()), + ) + + del layer.w13_weight + del layer.w2_weight + del layer.w13_weight_scale + del layer.w2_weight_scale + layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False) + layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False) + + _amx_process_weight_after_loading(layer, ["w13_weight", "w2_weight"]) +''' + +if old_helper not in src: + print("ERROR: old _process_weights_for_cpu helper not found " + "(was fix4 applied?)", file=sys.stderr) + sys.exit(1) +src = src.replace(old_helper, new_helper) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py new file mode 100644 index 00000000..51ef69b6 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py @@ -0,0 +1,68 @@ +"""Patch gpt_oss.py to make its weight-loading paths CPU-safe. + +The model file hard-codes a handful of `.cuda()` / `torch.cuda.*` calls +in the MXFP4 weight loader and the dequant helper. On a CPU-only torch +those fail with `AssertionError: Torch not compiled with CUDA enabled`. + +We guard each call so it becomes a no-op on CPU and behaves exactly as +before on a CUDA host. + +Patched call sites: + - _load_mxfp4_experts_weights: weight = weight.cuda() + - set_embed_and_head: torch.cuda.empty_cache(); torch.cuda.synchronize() + - _dequant_mlp_weight: w_blocks = w_blocks.cuda(); w_scales = w_scales.cuda() +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py" +) +src = F.read_text() +original = src + +substitutions = [ + # _load_mxfp4_experts_weights: weight = weight.cuda() + ( + " for name, weight in weights:\n" + " weight = weight.cuda()\n", + " for name, weight in weights:\n" + " if torch.cuda.is_available():\n" + " weight = weight.cuda()\n", + ), + # set_embed_and_head: torch.cuda.empty_cache / synchronize + ( + " self.lm_head.weight = head\n" + " torch.cuda.empty_cache()\n" + " torch.cuda.synchronize()\n", + " self.lm_head.weight = head\n" + " if torch.cuda.is_available():\n" + " torch.cuda.empty_cache()\n" + " torch.cuda.synchronize()\n", + ), + # _dequant_mlp_weight: w_blocks / w_scales .cuda() + ( + " w_blocks = w_blocks.cuda()\n" + " w_scales = w_scales.cuda()\n", + " if torch.cuda.is_available():\n" + " w_blocks = w_blocks.cuda()\n" + " w_scales = w_scales.cuda()\n", + ), +] + +for needle, replacement in substitutions: + if needle not in src: + print( + f"ERROR: patch site not found:\n---\n{needle}---", + file=sys.stderr, + ) + sys.exit(1) + src = src.replace(needle, replacement) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py new file mode 100644 index 00000000..686403d3 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py @@ -0,0 +1,122 @@ +"""Reroute Mxfp4MoEMethod's CPU forward through sglang's reference +``moe_forward_native`` instead of ``fused_experts_cpu``. + +After fix7 we got gpt-oss-20b past dequant with sane numerics (low_first +nibble order), but the output was still gibberish. The cause: gpt-oss uses +a custom Swish-GLU activation: + + gate, up = x[..., ::2], x[..., 1::2] # INTERLEAVED gate/up + gate = gate.clamp(max=gemm1_limit) + up = up.clamp(min=-gemm1_limit, max=gemm1_limit) + out = gate * sigmoid(gate * gemm1_alpha) * (up + 1) + +plus per-expert biases on both W13 and W2. ``fused_experts_cpu`` only +implements plain ``silu(gate) * up`` with no alpha, no clamp, no biases. + +sglang already has a pure-PyTorch reference that handles all of this: +``sglang.srt.layers.moe.fused_moe_native.moe_forward_native``. It calls +``swiglu_gpt_oss_sigmoid_alpha`` (pure torch with @torch.compile) when +``gemm1_alpha`` is set, and adds W13/W2 biases when present on the layer. + +This patch: + +1. Removes the ``_amx_process_weight_after_loading`` call from + ``_process_weights_for_cpu`` — we no longer need AMX-packed weights + because ``moe_forward_native`` uses ``F.linear`` and ``torch.einsum`` + on plain bf16 weights. +2. Rewrites ``forward_cpu`` to delegate to ``moe_forward_native``. +""" + +import re +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# 1. Strip the AMX-pack call from _process_weights_for_cpu. +src = src.replace( + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n", + " # _amx_process_weight_after_loading skipped: moe_forward_native uses\n" + " # plain F.linear / einsum, which expect un-packed (E, OUT, IN) bf16.\n", +) + +# 2. Replace the body of forward_cpu with a delegation to moe_forward_native. +# Anchor on the full forward_cpu added by fix4's enable-gpt-oss-cpu-moe.py. +old_forward = ( + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"Mirrors unquant.py:UnquantizedFusedMoEMethod.forward_cpu.\n" + "\n" + " After _process_weights_for_cpu has run, the layer's weights are\n" + " plain bf16 AMX-packed tensors, so the CPU MoE kernel can serve\n" + " them with the UNQUANT quant method.\n" + " \"\"\"\n" + " import torch\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import apply_topk_weights_cpu\n" + " from sglang.srt.layers.amx_utils import CPUQuantMethod\n" + "\n" + " x = dispatch_output.hidden_states\n" + " topk_output = dispatch_output.topk_output\n" + "\n" + " topk_weights, topk_ids, _ = topk_output\n" + " x, topk_weights = apply_topk_weights_cpu(\n" + " self.moe_runner_config.apply_router_weight_on_input,\n" + " topk_weights,\n" + " x,\n" + " )\n" + " output = torch.ops.sgl_kernel.fused_experts_cpu(\n" + " x,\n" + " layer.w13_weight,\n" + " layer.w2_weight,\n" + " topk_weights,\n" + " topk_ids,\n" + " False, # inplace\n" + " CPUQuantMethod.UNQUANT,\n" + " None, # w1_scale\n" + " None, # w2_scale\n" + " None, # w1_zp\n" + " None, # w2_zp\n" + " None, # block_size\n" + " True, # is_vnni\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" +) + +new_forward = ( + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"CPU MoE forward via moe_forward_native (gpt-oss-aware).\n" + "\n" + " Uses sglang's reference pure-PyTorch MoE forward, which handles:\n" + " - W13 / W2 biases (gpt-oss has both)\n" + " - The gpt-oss-specific swiglu variant\n" + " (interleaved gate/up + sigmoid(alpha * gate) + clamp + (up+1))\n" + " when ``moe_runner_config.gemm1_alpha`` is set.\n" + " \"\"\"\n" + " from sglang.srt.layers.moe.fused_moe_native import moe_forward_native\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + "\n" + " output = moe_forward_native(\n" + " layer,\n" + " dispatch_output.hidden_states,\n" + " dispatch_output.topk_output,\n" + " self.moe_runner_config,\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" +) + +if old_forward not in src: + print("ERROR: old forward_cpu not found (fix4 may have been changed)", + file=sys.stderr) + sys.exit(1) +src = src.replace(old_forward, new_forward) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py new file mode 100644 index 00000000..fa269d25 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py @@ -0,0 +1,190 @@ +"""Add a CPU forward path to sglang.srt.layers.quantization.mxfp4.Mxfp4MoEMethod. + +Upstream `Mxfp4MoEMethod` only ships GPU branches (Marlin, FlashInfer cutlass +SM90, FlashInfer TRT-LLM SM100, AMD aiter, NVIDIA triton_kernels). On CPU, +both its `process_weights_after_loading` and `apply` raise (the former tries +to `import triton_kernels`; the latter has no CPU branch at all). + +This patch: + +1. Adds a CPU branch at the top of `process_weights_after_loading` that: + a. Dequantizes the MXFP4-packed `w13_weight` / `w2_weight` to bf16 + using the pure-PyTorch `MXFP4QuantizeUtil.dequantize` helper that + already ships in `mxfp4_tensor.py`. + b. Calls `_amx_process_weight_after_loading` (the same helper that the + bf16 unquantized MoE method uses in `unquant.py:process_weights_after_loading`) + to AMX-pack the bf16 weights for `fused_experts_cpu`. + c. Returns early so none of the CUDA-only branches run. + +2. Adds a `forward_cpu` method that mirrors the unquantized bf16 MoE method's + CPU forward path (`unquant.py:forward_cpu`) verbatim — apply_topk_weights_cpu, + then `torch.ops.sgl_kernel.fused_experts_cpu(..., CPUQuantMethod.UNQUANT, ...)`. + +After this patch the weights are stored as bf16 inside the layer (the MXFP4 +packed storage is replaced), so the existing CPU `fused_experts_cpu` AMX +kernel handles them like any other bf16 MoE. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# ----- 1. Insert CPU branch + helper into Mxfp4MoEMethod.process_weights_after_loading ----- +# +# Anchor on the first line of the existing method body. We prepend a CPU +# branch that does the dequant + AMX pack, then returns. Existing logic +# (marlin / cutlass / flashinfer / triton_kernels / `torch.cuda.empty_cache`) +# is untouched on GPU. + +needle_pwal = ( + " def process_weights_after_loading(self, layer):\n" + " if self.use_marlin:\n" +) +replacement_pwal = ( + " def process_weights_after_loading(self, layer):\n" + " # ---- CPU branch added by enable-gpt-oss-cpu-moe.py ----\n" + " from sglang.srt.utils import is_cpu, cpu_has_amx_support\n" + " if is_cpu() and cpu_has_amx_support():\n" + " self._process_weights_for_cpu(layer)\n" + " return\n" + " # ---- end CPU branch ----\n" + " if self.use_marlin:\n" +) +if needle_pwal not in src: + print("ERROR: process_weights_after_loading anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_pwal, replacement_pwal) + +# ----- 2. Add the _process_weights_for_cpu helper + forward_cpu method right +# BEFORE the `def apply(` of Mxfp4MoEMethod (so they live on the class). +# Anchor on the exact apply signature we read from the running image. +needle_apply = ( + " def apply(\n" + " self,\n" + " layer: torch.nn.Module,\n" + " dispatch_output: StandardDispatchOutput,\n" + " ) -> CombineInput:\n" +) +new_methods = ( + " # ---- CPU methods added by enable-gpt-oss-cpu-moe.py ----\n" + " def _process_weights_for_cpu(self, layer):\n" + " \"\"\"Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu.\n" + "\n" + " Layer params after this call:\n" + " - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K)\n" + " - layer.w2_weight: bf16, AMX-packed, shape (E, K, N)\n" + " - layer.w13_weight_scale / w2_weight_scale: deleted\n" + " \"\"\"\n" + " import torch\n" + " from torch.nn import Parameter\n" + " from sglang.srt.layers.quantization.mxfp4_tensor import (\n" + " MXFP4QuantizeUtil,\n" + " )\n" + " from sglang.srt.layers.amx_utils import (\n" + " _amx_process_weight_after_loading,\n" + " )\n" + "\n" + " def _dequant(weight, scale):\n" + " return MXFP4QuantizeUtil.dequantize(\n" + " quantized_data=weight,\n" + " dtype=torch.bfloat16,\n" + " scale=scale,\n" + " block_sizes=[32],\n" + " )\n" + "\n" + " w13_bf16 = _dequant(layer.w13_weight, layer.w13_weight_scale)\n" + " w2_bf16 = _dequant(layer.w2_weight, layer.w2_weight_scale)\n" + "\n" + " del layer.w13_weight\n" + " del layer.w2_weight\n" + " del layer.w13_weight_scale\n" + " del layer.w2_weight_scale\n" + " layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False)\n" + " layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False)\n" + "\n" + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n" + "\n" + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"Mirrors unquant.py:UnquantizedFusedMoEMethod.forward_cpu.\n" + "\n" + " After _process_weights_for_cpu has run, the layer's weights are\n" + " plain bf16 AMX-packed tensors, so the CPU MoE kernel can serve\n" + " them with the UNQUANT quant method.\n" + " \"\"\"\n" + " import torch\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import apply_topk_weights_cpu\n" + " from sglang.srt.layers.amx_utils import CPUQuantMethod\n" + "\n" + " x = dispatch_output.hidden_states\n" + " topk_output = dispatch_output.topk_output\n" + "\n" + " topk_weights, topk_ids, _ = topk_output\n" + " x, topk_weights = apply_topk_weights_cpu(\n" + " self.moe_runner_config.apply_router_weight_on_input,\n" + " topk_weights,\n" + " x,\n" + " )\n" + " output = torch.ops.sgl_kernel.fused_experts_cpu(\n" + " x,\n" + " layer.w13_weight,\n" + " layer.w2_weight,\n" + " topk_weights,\n" + " topk_ids,\n" + " False, # inplace\n" + " CPUQuantMethod.UNQUANT,\n" + " None, # w1_scale\n" + " None, # w2_scale\n" + " None, # w1_zp\n" + " None, # w2_zp\n" + " None, # block_size\n" + " True, # is_vnni\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" + " # ---- end CPU methods ----\n" + "\n" +) +replacement_apply = new_methods + needle_apply +if needle_apply not in src: + print("ERROR: Mxfp4MoEMethod.apply anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_apply, replacement_apply, 1) + +# ----- 3. Route Mxfp4MoEMethod.apply() to forward_cpu() on CPU. ----- +# FusedMoE.run_moe_core calls apply() directly; our forward_cpu would be +# dead code unless apply() itself delegates. Insert the delegation as the +# very first statement of apply() (after its imports). +needle_apply_body = ( + " ) -> CombineInput:\n" + "\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import TopKOutputChecker\n" +) +replacement_apply_body = ( + " ) -> CombineInput:\n" + "\n" + " # ---- CPU delegation added by enable-gpt-oss-cpu-moe.py ----\n" + " from sglang.srt.utils import is_cpu, cpu_has_amx_support\n" + " if is_cpu() and cpu_has_amx_support():\n" + " return self.forward_cpu(layer, dispatch_output)\n" + " # ---- end CPU delegation ----\n" + "\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import TopKOutputChecker\n" +) +if needle_apply_body not in src: + print("ERROR: apply() body anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_apply_body, replacement_apply_body, 1) + +if src == original: + print("ERROR: nothing changed", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py new file mode 100644 index 00000000..612ccd4f --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py @@ -0,0 +1,84 @@ +"""Patch sglang's server_args.py so GptOssForCausalLM accepts CPU attention backends. + +The upstream gate at the GptOssForCausalLM branch: + 1. Has no `is_cpu()` case for default backend selection — falls to "triton", + which has no CPU implementation. + 2. The `supported_backends` allowlist omits "intel_amx" and "torch_native", + even though both are valid CPU attention backends registered via + attention_registry.py. + +We extend both: pick `intel_amx` as the default for the CPU engine, and add +intel_amx + torch_native to the allowlist so users can choose either. +""" + +import sys +from pathlib import Path + +SA = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/server_args.py" +) +src = SA.read_text() +original = src + +# 1) Inject is_cpu() branch into the default attention backend selector for +# GptOssForCausalLM. We sit between the existing `elif is_hip(): aiter` +# and the final `else: triton` so CPU users get intel_amx. +needle = ( + ' elif is_hip():\n' + ' self.attention_backend = "aiter"\n' + ' else:\n' + ' self.attention_backend = "triton"\n' +) +replacement = ( + ' elif is_hip():\n' + ' self.attention_backend = "aiter"\n' + ' elif os.getenv("SGLANG_USE_CPU_ENGINE", "0") == "1":\n' + ' self.attention_backend = "intel_amx"\n' + ' else:\n' + ' self.attention_backend = "triton"\n' +) +if needle not in src: + print("ERROR: default attention backend selector for GptOss not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle, replacement) + +# 2) Extend supported_backends to include CPU options. +needle2 = ( + ' supported_backends = [\n' + ' "triton",\n' + ' "trtllm_mha",\n' + ' "fa3",\n' + ' "fa4",\n' + ' "ascend",\n' + ' "intel_xpu",\n' + ' "aiter",\n' + ' ]\n' +) +replacement2 = ( + ' supported_backends = [\n' + ' "triton",\n' + ' "trtllm_mha",\n' + ' "fa3",\n' + ' "fa4",\n' + ' "ascend",\n' + ' "intel_xpu",\n' + ' "aiter",\n' + ' "intel_amx",\n' + ' "torch_native",\n' + ' ]\n' +) +if needle2 not in src: + print("ERROR: supported_backends list for GptOss not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle2, replacement2) + +# 3) Ensure `os` is imported (cheap idempotent check) +if "\nimport os" not in src and not src.startswith("import os"): + src = "import os\n" + src + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +SA.write_text(src) +print(f"Patched {SA}") diff --git a/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py b/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py new file mode 100644 index 00000000..19fa55b1 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py @@ -0,0 +1,61 @@ +"""Patch sglang's quantization/__init__.py to enable MXFP4 on CPU. + +The upstream code gates the mxfp4 registration behind is_cuda()/is_hip(). +On CPU this prevents loading models with quant_method=mxfp4 (e.g. +openai/gpt-oss-*), even though the model file's CPU-friendly dequantization +path (fp8_utils.dequant_mxfp4 → MXFP4QuantizeUtil.dequantize, pure PyTorch) +is fully functional. This patch widens the gate so mxfp4 is registered +when SGLANG_USE_CPU_ENGINE=1 is set and adds it to the CPU-supported +quantization allowlist. +""" + +import re +import sys +from pathlib import Path + +INIT = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/__init__.py" +) + +src = INIT.read_text() +original = src + +# 1) Ensure `os` is imported (we use it to gate behind the env var) +if not re.search(r"^import os\b", src, flags=re.M): + src = src.replace( + "import builtins\n", + "import builtins\nimport os\n", + 1, + ) + +# 2) Widen the gate: register mxfp4 also when running with the CPU engine +src = src.replace( + "if is_cuda() or (_is_mxfp_supported and is_hip()):\n" + " BASE_QUANTIZATION_METHODS.update(\n" + " {\n" + ' "mxfp4": Mxfp4Config,\n' + " }\n" + " )", + 'if is_cuda() or (_is_mxfp_supported and is_hip()) or os.getenv("SGLANG_USE_CPU_ENGINE", "0") == "1":\n' + " BASE_QUANTIZATION_METHODS.update(\n" + " {\n" + ' "mxfp4": Mxfp4Config,\n' + " }\n" + " )", +) + +# 3) Add mxfp4 to the CPU allowlist so get_quantization_config() returns it +src = src.replace( + "CPU_QUANTIZATION_METHODS = {\n" + ' "fp8": Fp8Config,\n', + "CPU_QUANTIZATION_METHODS = {\n" + ' "fp8": Fp8Config,\n' + ' "mxfp4": Mxfp4Config,\n', +) + +if src == original: + print("ERROR: no patch site matched. The file may have changed shape.", file=sys.stderr) + sys.exit(1) + +INIT.write_text(src) +print(f"Patched {INIT}") diff --git a/core/helm-charts/sglang/templates/_helpers.tpl b/core/helm-charts/sglang/templates/_helpers.tpl new file mode 100644 index 00000000..138d100c --- /dev/null +++ b/core/helm-charts/sglang/templates/_helpers.tpl @@ -0,0 +1,75 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- define "sglang.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{- define "sglang.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{- define "sglang.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{- define "sglang.labels" -}} +helm.sh/chart: {{ include "sglang.chart" . }} +{{ include "sglang.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{- define "sglang.selectorLabels" -}} +app.kubernetes.io/name: {{ include "sglang.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- with .Values.podLabels }} +{{ toYaml . }} +{{- end }} +{{- end }} + +{{- define "sglang.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "sglang.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} + +{{- define "sglang.storageVolume" -}} +{{- if .Values.storage.persistentVolume.enabled }} +persistentVolumeClaim: + claimName: {{ .Values.storage.persistentVolume.existingClaim | default (include "sglang.fullname" .) }} +{{- else if .Values.storage.emptyDir.enabled }} +emptyDir: + {{- if .Values.storage.emptyDir.sizeLimit }} + sizeLimit: {{ .Values.storage.emptyDir.sizeLimit }} + {{- end }} +{{- else }} +emptyDir: {} +{{- end }} +{{- end }} + +{{- define "sglang.oidcSecretName" -}} +{{- printf "%s-oidc" (include "sglang.fullname" .) }} +{{- end }} + +{{- define "sglang.imagePullSecrets" -}} +{{- if .Values.imagePullSecrets }} +imagePullSecrets: +{{- range .Values.imagePullSecrets }} + - name: {{ . }} +{{- end }} +{{- end }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/apisixroute.yaml b/core/helm-charts/sglang/templates/apisixroute.yaml new file mode 100644 index 00000000..c5aa0653 --- /dev/null +++ b/core/helm-charts/sglang/templates/apisixroute.yaml @@ -0,0 +1,47 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if .Values.apisixRoute.enabled }} +apiVersion: apisix.apache.org/v2 +kind: ApisixRoute +metadata: + name: {{ include "sglang.fullname" . }}-apisixroute + namespace: {{ .Values.apisixRoute.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +spec: + {{- if .Values.apisixRoute.ingressClassName }} + ingressClassName: {{ .Values.apisixRoute.ingressClassName }} + {{- end }} + http: + - name: {{ .Values.modelName }}-route + match: + hosts: + - {{ .Values.apisixRoute.host }} + paths: + - /{{ .Values.modelName }}-sglang/* + backends: + - serviceName: {{ include "sglang.fullname" . }} + servicePort: {{ .Values.service.port }} + plugins: + - name: proxy-rewrite + enable: true + config: + regex_uri: + - ^/{{ .Values.modelName }}-sglang/(.*) + - /$1 + headers: + Content-Type: application/json + {{- if .Values.oidc.enabled }} + - name: openid-connect + enable: true + secretRef: {{ include "sglang.fullname" . }}-secret + config: + discovery: {{ .Values.oidc.discovery }} + scope: openid profile email + bearer_only: true + realm: {{ .Values.oidc.realm }} + introspection_endpoint: {{ .Values.oidc.introspectionEndpoint }} + introspection_endpoint_auth_method: client_secret_basic + {{- end }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/deployment.yaml b/core/helm-charts/sglang/templates/deployment.yaml new file mode 100644 index 00000000..1f3422cb --- /dev/null +++ b/core/helm-charts/sglang/templates/deployment.yaml @@ -0,0 +1,192 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if not .Values.modelSource }} +{{- fail "modelSource is required. Set --set modelSource= (e.g. openai/gpt-oss-20b) or use a model-specific values file (see third_party/Dell/model-deployment//values.yaml)." }} +{{- end }} +{{- if not .Values.modelName }} +{{- fail "modelName is required. Set --set modelName= (e.g. gpt-oss-20b)." }} +{{- end }} + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.replicaCount }} + selector: + matchLabels: + {{- include "sglang.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "sglang.selectorLabels" . | nindent 8 }} + spec: + {{- include "sglang.imagePullSecrets" . | nindent 6 }} + {{- if .Values.serviceAccount.create }} + serviceAccountName: {{ include "sglang.serviceAccountName" . }} + {{- end }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} + containers: + - name: sglang + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} + command: ["/opt/.venv/bin/python3", "-m", "sglang.launch_server"] + args: + - "--model-path={{ .Values.modelSource }}" + - "--served-model-name={{ .Values.modelName }}" + - "--host={{ .Values.server.host }}" + - "--port={{ .Values.server.port }}" + - "--device={{ .Values.server.device }}" + - "--tp-size={{ .Values.server.tpSize }}" + {{- if .Values.server.dpSize }} + - "--dp-size={{ .Values.server.dpSize }}" + {{- end }} + {{- if .Values.server.dtype }} + - "--dtype={{ .Values.server.dtype }}" + {{- end }} + {{- if .Values.server.quantization }} + - "--quantization={{ .Values.server.quantization }}" + {{- end }} + {{- if .Values.server.trustRemoteCode }} + - "--trust-remote-code" + {{- end }} + {{- if .Values.server.disableOverlapSchedule }} + - "--disable-overlap-schedule" + {{- end }} + {{- if .Values.server.enableTorchCompile }} + - "--enable-torch-compile" + - "--torch-compile-max-bs={{ .Values.server.torchCompileMaxBs }}" + {{- end }} + {{- if .Values.server.contextLength }} + - "--context-length={{ .Values.server.contextLength }}" + {{- end }} + {{- if .Values.server.maxRunningRequests }} + - "--max-running-requests={{ .Values.server.maxRunningRequests }}" + {{- end }} + {{- if .Values.server.maxTotalTokens }} + - "--max-total-tokens={{ .Values.server.maxTotalTokens }}" + {{- end }} + {{- if .Values.server.memFractionStatic }} + - "--mem-fraction-static={{ .Values.server.memFractionStatic }}" + {{- end }} + {{- range .Values.server.extraArgs }} + - {{ . | quote }} + {{- end }} + ports: + - containerPort: {{ .Values.server.port }} + name: http + protocol: TCP + env: + - name: HF_HOME + value: {{ .Values.hfCacheMountPath | quote }} + - name: HUGGINGFACE_HUB_CACHE + value: "{{ .Values.hfCacheMountPath }}/hub" + - name: TRANSFORMERS_CACHE + value: "{{ .Values.hfCacheMountPath }}/hub" + {{- if .Values.cpuEngine.enabled }} + - name: SGLANG_USE_CPU_ENGINE + value: "1" + {{- if .Values.cpuEngine.ldPreload }} + - name: LD_PRELOAD + value: {{ .Values.cpuEngine.ldPreload | quote }} + {{- end }} + {{- if .Values.cpuEngine.ompThreadsBind }} + - name: SGLANG_CPU_OMP_THREADS_BIND + value: {{ .Values.cpuEngine.ompThreadsBind | quote }} + {{- end }} + {{- end }} + {{- if .Values.huggingface.token }} + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + key: {{ .Values.huggingface.secretKey | default "token" }} + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + key: {{ .Values.huggingface.secretKey | default "token" }} + {{- end }} + {{- with .Values.extraEnv }} + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.extraEnvFrom }} + envFrom: + {{- toYaml . | nindent 12 }} + {{- end }} + volumeMounts: + - name: hf-cache + mountPath: {{ .Values.hfCacheMountPath }} + {{- if .Values.shm.enabled }} + - name: dshm + mountPath: /dev/shm + {{- end }} + {{- with .Values.extraVolumeMounts }} + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- if .Values.server.livenessProbe.enabled }} + livenessProbe: + httpGet: + path: {{ .Values.server.livenessProbe.httpGet.path }} + port: http + initialDelaySeconds: {{ .Values.server.livenessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.server.livenessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.server.livenessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.server.livenessProbe.failureThreshold }} + {{- end }} + {{- if .Values.server.readinessProbe.enabled }} + readinessProbe: + httpGet: + path: {{ .Values.server.readinessProbe.httpGet.path }} + port: http + initialDelaySeconds: {{ .Values.server.readinessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.server.readinessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.server.readinessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.server.readinessProbe.failureThreshold }} + {{- end }} + volumes: + - name: hf-cache + {{- include "sglang.storageVolume" . | nindent 10 }} + {{- if .Values.shm.enabled }} + - name: dshm + emptyDir: + medium: Memory + sizeLimit: {{ .Values.shm.sizeLimit }} + {{- end }} + {{- with .Values.extraVolumes }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} diff --git a/core/helm-charts/sglang/templates/ingress.yaml b/core/helm-charts/sglang/templates/ingress.yaml new file mode 100644 index 00000000..75b4713a --- /dev/null +++ b/core/helm-charts/sglang/templates/ingress.yaml @@ -0,0 +1,33 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if .Values.ingress.enabled }} +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.ingress.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + annotations: + nginx.ingress.kubernetes.io/rewrite-target: /{{ .Values.modelName }}-sglang/$1 +spec: + ingressClassName: {{ .Values.ingress.className }} + {{- if .Values.ingress.secretname }} + tls: + - hosts: + - {{ .Values.ingress.host }} + secretName: {{ .Values.ingress.secretname }} + {{- end }} + rules: + - host: {{ .Values.ingress.host }} + http: + paths: + - path: /{{ .Values.modelName }}-sglang/(.*) + pathType: ImplementationSpecific + backend: + service: + name: {{- if .Values.apisixRoute.enabled }} auth-apisix-gateway{{- else }} {{ include "sglang.fullname" . }}{{- end }} + port: + number: 80 +{{- end }} diff --git a/core/helm-charts/sglang/templates/pvc.yaml b/core/helm-charts/sglang/templates/pvc.yaml new file mode 100644 index 00000000..1c1b2796 --- /dev/null +++ b/core/helm-charts/sglang/templates/pvc.yaml @@ -0,0 +1,25 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if and .Values.storage.persistentVolume.enabled (not .Values.storage.persistentVolume.existingClaim) }} +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + {{- if not .Values.storage.persistentVolume.deleteOnUninstall }} + annotations: + "helm.sh/resource-policy": keep + {{- end }} +spec: + accessModes: + - {{ .Values.storage.persistentVolume.accessMode | default "ReadWriteOnce" }} + {{- if .Values.storage.persistentVolume.storageClass }} + storageClassName: {{ .Values.storage.persistentVolume.storageClass }} + {{- end }} + resources: + requests: + storage: {{ .Values.storage.persistentVolume.size | default "80Gi" }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/secret.yaml b/core/helm-charts/sglang/templates/secret.yaml new file mode 100644 index 00000000..135c441e --- /dev/null +++ b/core/helm-charts/sglang/templates/secret.yaml @@ -0,0 +1,36 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if or .Values.oidc.enabled .Values.secrets.enabled }} +apiVersion: v1 +kind: Secret +metadata: + name: {{ include "sglang.fullname" . }}-secret + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +type: Opaque +data: + {{- if .Values.oidc.enabled }} + client_id: {{ .Values.oidc.clientId | b64enc }} + client_secret: {{ .Values.oidc.clientSecret | b64enc }} + {{- end }} + {{- if .Values.secrets.enabled }} + {{- range $key, $value := .Values.secrets.data }} + {{ $key }}: {{ $value | b64enc }} + {{- end }} + {{- end }} +{{- end }} +--- +{{- if .Values.huggingface.token }} +apiVersion: v1 +kind: Secret +metadata: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +type: Opaque +data: + {{ .Values.huggingface.secretKey | default "token" }}: {{ .Values.huggingface.token | b64enc }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/service.yaml b/core/helm-charts/sglang/templates/service.yaml new file mode 100644 index 00000000..e1975fc1 --- /dev/null +++ b/core/helm-charts/sglang/templates/service.yaml @@ -0,0 +1,26 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: v1 +kind: Service +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + {{- with .Values.service.labels }} + {{- toYaml . | nindent 4 }} + {{- end }} + {{- with .Values.service.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +spec: + type: {{ .Values.service.type }} + ports: + - name: http + port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + selector: + {{- include "sglang.selectorLabels" . | nindent 4 }} diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml new file mode 100644 index 00000000..305f9cf1 --- /dev/null +++ b/core/helm-charts/sglang/values.yaml @@ -0,0 +1,266 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# Default values for the sglang Helm chart. +# Targets lmsysorg/sglang:v0.5.11-xeon on an Intel Xeon (AMX) CPU node. +# +# IMPORTANT — quantization support on this image: +# The Xeon CPU build of sglang supports a small, explicit subset of +# quantization methods (see CPU_QUANTIZATION_METHODS in +# sglang/srt/layers/quantization/__init__.py): +# fp8, w8a8_int8, compressed-tensors, awq, gptq +# It does NOT support mxfp4, which is the native quantization of +# openai/gpt-oss-{20b,120b}. Those models require a CUDA/HIP sglang +# image (e.g. lmsysorg/sglang:v0.5.11-cuda) on a GPU host. To serve +# gpt-oss on CPU, use llama.cpp/Ollama/vLLM-CPU on a GGUF variant, +# not this chart. See README.md for the supported-model list. + +nameOverride: "" +fullnameOverride: "" + +namespace: default +replicaCount: 1 + +image: + # Patched image built from image-build/. fix1 (sgl-kernel rebuild with + # -mavx512bf16) benefits any bf16 model on Xeon; fix2..fix8 add MXFP4 + # + sinks-attention support specifically for gpt-oss and are runtime + # no-ops for other models; fix9..fix11 are precision-debug knobs (off + # by default). Build + import with `image-build/build-and-import.sh`. + # + # IMPORTANT: `tag` below MUST match the IMAGE_TAG in build-and-import.sh + # exactly. With `pullPolicy: IfNotPresent` (the right setting for a + # locally-imported image), a tag mismatch causes the kubelet to try a + # docker.io pull and ImagePullBackOff on a private/non-existent image. + repository: enterprise-inference/sglang + tag: "v0.5.12-xeon-fix11-debug" + pullPolicy: IfNotPresent + +imagePullSecrets: [] + +serviceAccount: + create: false + annotations: {} + name: "" + +podAnnotations: {} +podLabels: + app: sglang + +# SGLang writes to HF cache + /dev/shm; do not lock the root FS. +podSecurityContext: + runAsNonRoot: false + fsGroup: 0 + +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false + +# ---- Model ---- +# Required. Set both at install time: +# --set modelSource= (HF repo, passed to --model-path) +# --set modelName= (served name; used in route URI) +# The chart has no default model: `helm install` fails loudly if either +# is empty. For a model-specific recipe (e.g. gpt-oss-20b), use the +# bundled values file (see README). +modelSource: "" +modelName: "" + +# HuggingFace Hub token. Required for gated repos (e.g. meta-llama/*). +# Either: +# 1. --set huggingface.token=$HF_TOKEN (chart creates the secret), or +# 2. pre-create: kubectl create secret generic hf-token-secret \ +# --from-literal=token= +huggingface: + token: "" + secretName: "hf-token-secret" + secretKey: "token" + +# ---- Server / launch flags ---- +server: + port: 30000 + host: "0.0.0.0" + + # Force CPU device for the xeon image. + device: "cpu" + + # Tensor parallel rank count. CPU build typically runs tp=1. + # Set >1 only when binding ranks to separate NUMA domains via + # cpuEngine.ompThreadsBind. + tpSize: 1 + # Data parallel rank count. Omit (leave empty) unless you know you want + # multiple replicas of the model loaded in one process. + dpSize: "" + + # dtype: bfloat16 is the recommended dtype on Xeon AMX. Leave empty to + # let sglang infer from the checkpoint. + dtype: "bfloat16" + + # --quantization. Must be one of: fp8 | w8a8_int8 | compressed-tensors | + # awq | gptq, OR leave empty to use whatever is declared in the model + # config.json. Anything else (mxfp4, modelopt_fp4, etc.) will be rejected + # by sglang at startup. + quantization: "" + + # Trust model code from HF (required by some recent models). + trustRemoteCode: true + + # Recommended for CPU per sglang docs/platforms/cpu_server.md + disableOverlapSchedule: true + + # --enable-torch-compile can give a sizeable speedup on Xeon but slows + # cold start substantially. Off by default; flip on for benchmarks. + enableTorchCompile: false + torchCompileMaxBs: 4 + + # Optional caps. Leave empty to use the model default. + contextLength: "" + maxRunningRequests: "" + + # --max-total-tokens caps KV cache size in tokens. STRONGLY recommended on + # Kubernetes — sglang reads host memory via psutil and ignores cgroup + # limits, so without this it tries to claim ~85-93% of the *node's* RAM + # for KV cache and gets OOMKilled. Sized below for 32Ki context * a few + # in-flight requests on an 8B bf16 model. + maxTotalTokens: 32768 + + # --mem-fraction-static. Leave empty to keep sglang's default (0.85+). + # On k8s, prefer maxTotalTokens above. If you must use a fraction, + # remember it is a fraction of host RAM, not the pod limit. + memFractionStatic: "" + + # Any extra command-line flags appended verbatim, e.g. + # extraArgs: ["--mem-fraction-static", "0.85"] + extraArgs: [] + + livenessProbe: + enabled: true + httpGet: + path: /health + port: http + initialDelaySeconds: 600 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 5 + readinessProbe: + enabled: true + httpGet: + path: /health + port: http + initialDelaySeconds: 120 + periodSeconds: 15 + timeoutSeconds: 10 + failureThreshold: 30 + +# ---- CPU engine tuning ---- +# The image already bakes ENV SGLANG_USE_CPU_ENGINE=1 and LD_PRELOAD into +# the runtime, but we set them explicitly here so the chart is +# self-documenting and survives image-tag changes. +cpuEngine: + enabled: true + # Per-rank core binding for SGLang's OMP threads. Format: pipe-separated + # per-rank ranges, e.g. "0-31|32-63" for a 2-rank tp on a 64-core node. + # Leave empty to let SGLang use defaults. + ompThreadsBind: "" + # LD_PRELOAD baked into xeon.Dockerfile. Set to "" to drop it entirely + # (only do this if you know why). + ldPreload: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:/usr/lib/x86_64-linux-gnu/libtbbmalloc.so:/opt/.venv/lib/libiomp5.so" + +# ---- Resources ---- +# Starting points for Qwen3-8B (bf16, ~16Gi weights). For larger models +# bump both requests and limits. +resources: + requests: + cpu: "16" + memory: "32Gi" + limits: + cpu: "32" + memory: "64Gi" + +# SGLang uses /dev/shm heavily for inter-process tensor sharing on CPU. +shm: + enabled: true + sizeLimit: "16Gi" + +# ---- Storage (HuggingFace cache) ---- +# PVC keeps downloaded weights across pod restarts. +storage: + persistentVolume: + enabled: true + storageClass: "" + accessMode: ReadWriteOnce + size: 60Gi + existingClaim: "" + deleteOnUninstall: true + emptyDir: + enabled: false + sizeLimit: 60Gi + +# Mount point inside the container for the HF cache (HF_HOME). +hfCacheMountPath: "/root/.cache/huggingface" + +# ---- Service ---- +service: + type: ClusterIP + port: 30000 + annotations: {} + labels: {} + +# ---- OIDC + APISIX + Ingress ---- +# Mirrors the OVMS chart so the auth-apisix stack picks this up the same way. +oidc: + enabled: true + realm: master + clientId: "my-client-id" + clientSecret: "tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR" + discovery: "http://keycloak.default.svc.cluster.local/realms/master/.well-known/openid-configuration" + introspectionEndpoint: "http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token/introspect" + +apisixRoute: + enabled: true + namespace: default + name: "" + host: "api.example.com" + # IngressClass that the APISIX ingress controller v2 watches. Required + # so the controller picks up this ApisixRoute and syncs it into APISIX's + # runtime route table. Set to "" to omit the field (e.g. for older + # controller versions that auto-discover). + ingressClassName: "apisix" + +ingress: + enabled: true + className: nginx + namespace: auth-apisix + host: "api.example.com" + secretname: "api.example.com" + +secrets: + enabled: false + data: {} + +nodeSelector: {} +tolerations: [] +affinity: {} +priorityClassName: "" + +extraVolumes: [] +extraVolumeMounts: [] +# extraEnv: extra environment variables exposed to the sglang container. +# +# MXFP4_NIBBLE_ORDER=low_first is read only by the patched image's MXFP4 +# dequant path and is a runtime no-op for non-MXFP4 models. It is required +# for any MXFP4 model loaded on CPU (e.g. openai/gpt-oss-*) — without it +# the model serves but emits random-vocab garbage. +# +# Other debug flags exposed by the patched image (off by default; set +# only when reproducing a precision investigation): +# - FP32_PROMOTE_MOE=1 per-expert MoE forward in fp32 +# - ALLOW_FP32_MXFP4=1 allow --dtype float32 with mxfp4 models +# - MXFP4_OUT_DTYPE=float32|float16 dequant output dtype +extraEnv: + - name: MXFP4_NIBBLE_ORDER + value: "low_first" +extraEnvFrom: [] diff --git a/scripts/bootstrap-k3s.sh b/scripts/bootstrap-k3s.sh new file mode 100755 index 00000000..fc254a7a --- /dev/null +++ b/scripts/bootstrap-k3s.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +# One-shot bootstrap for testing the sglang Helm chart on a single Xeon box. +# Installs: k3s (single-node), helm, kubectl symlink. Sets up kubeconfig for $USER. +# Run with: sudo bash scripts/bootstrap-k3s.sh +set -euo pipefail + +REAL_USER="${SUDO_USER:-$USER}" +REAL_HOME="$(getent passwd "$REAL_USER" | cut -d: -f6)" + +echo "==> Installing k3s (single-node, embedded containerd, embedded etcd)..." +# --write-kubeconfig-mode 644 so non-root can read it +# --disable traefik because we don't need an ingress for the smoke test +curl -sfL https://get.k3s.io | \ + INSTALL_K3S_EXEC="--write-kubeconfig-mode 644 --disable traefik" \ + sh - + +echo "==> Waiting for k3s API to be ready..." +for i in $(seq 1 60); do + if k3s kubectl get nodes >/dev/null 2>&1; then break; fi + sleep 2 +done +k3s kubectl get nodes -o wide + +echo "==> Setting up kubectl + kubeconfig for $REAL_USER..." +ln -sf /usr/local/bin/k3s /usr/local/bin/kubectl +install -d -o "$REAL_USER" -g "$REAL_USER" "$REAL_HOME/.kube" +install -m 600 -o "$REAL_USER" -g "$REAL_USER" /etc/rancher/k3s/k3s.yaml "$REAL_HOME/.kube/config" + +echo "==> Installing helm..." +curl -sfL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + +echo "==> Versions:" +kubectl version --client=true 2>&1 | head -3 +helm version --short +echo +echo "==> Bootstrap complete. As $REAL_USER, you can now run:" +echo " kubectl get nodes" +echo " helm lint core/helm-charts/sglang" diff --git a/third_party/Dell/model-deployment/README.md b/third_party/Dell/model-deployment/README.md deleted file mode 100644 index 43d98118..00000000 --- a/third_party/Dell/model-deployment/README.md +++ /dev/null @@ -1 +0,0 @@ -# PLACEHOLDER \ No newline at end of file diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md new file mode 100644 index 00000000..1a35686e --- /dev/null +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -0,0 +1,186 @@ +## Step 1: Prerequisites to Deploy gpt-oss-20b Model on Xeon with Keycloak + +Ensure the Enterprise Inference stack with Keycloak is already deployed +before proceeding. If you're standing the cluster up from scratch +yourself, the appendix in `core/helm-charts/sglang/README.md` walks the +full bootstrap. + +Edit `core/scripts/generate-token.sh` and set your values before +sourcing it: + +| Variable | Description | +| ------------------------- | ------------------------------------------------------------------------ | +| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` | +| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username | +| `KEYCLOAK_PASSWORD` | Keycloak admin password | +| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment | + +Then run: + +```bash +export HUGGING_FACE_HUB_TOKEN="your_token_here" + +cd ~/Enterprise-Inference +source core/scripts/generate-token.sh +``` + +This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, +and `TOKEN`. Verify with: + +```bash +echo "BASE_URL=$BASE_URL" +echo "TOKEN length=${#TOKEN} (expect 1000+; empty means the script failed silently)" +``` + +> Empty `TOKEN` means the script could not reach +> `https://${BASE_URL}/realms/master/...` or `https://${BASE_URL}/token`. +> The EI deployment provisions both as ingress routes to Keycloak — if +> they're missing, the cluster bootstrap is incomplete; see Appendix +> A.7 of the chart README. + +## Step 2: Build the Patched SGLang Image + +gpt-oss-20b ships natively in MXFP4 quantization, and the upstream +`lmsysorg/sglang:v0.5.12-xeon` image cannot serve it on CPU (MXFP4 is +GPU-gated, sinks attention is unsupported on the CPU backends, and the +published `sgl-kernel` shared library is missing the AVX-512-BF16 +compile flags needed for any bf16 matmul). + +The SGLang chart ships a one-shot build script that produces a patched +image and loads it directly into the local containerd image store. No +external registry is required. + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +First run takes ~5-10 minutes. The script auto-detects the runtime — +`nerdctl` on a kubeadm/containerd cluster (what `inference-stack-deploy.sh` +produces) or `k3s ctr` on a k3s cluster. Verify with whichever matches +your cluster: + +```bash +# kubeadm / containerd +sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang + +# k3s +sudo k3s ctr images ls | grep enterprise-inference/sglang +``` + +Either should report `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. + +For a detailed breakdown of what each patch does, see +`core/helm-charts/sglang/README.md` (section: What's Patched). + +## Step 3: Deploy gpt-oss-20b + +```bash +helm install sglang-gpt-oss-20b ./core/helm-charts/sglang \ + --set modelSource="openai/gpt-oss-20b" \ + --set modelName="gpt-oss-20b" \ + --set huggingface.token="$HUGGING_FACE_HUB_TOKEN" \ + --set ingress.enabled=true \ + --set ingress.secretName="${BASE_URL}" \ + --set ingress.host="${BASE_URL}" \ + --set oidc.clientId="$KEYCLOAK_CLIENT_ID" \ + --set oidc.clientSecret="$KEYCLOAK_CLIENT_SECRET" \ + --set apisixRoute.enabled=true \ + --set 'server.extraArgs={--attention-backend,torch_native,--reasoning-parser,gpt-oss,--tool-call-parser,gpt-oss}' +``` + +The chart's `values.yaml` already targets the patched image, sets bf16, +sizes resources for a Xeon node, and enables the +`MXFP4_NIBBLE_ORDER=low_first` env var required for correct MXFP4 +weight decode. The `--set` above adds the gpt-oss-specific runtime +flags (Harmony reasoning/tool-call parsers, CPU attention backend) and +the per-cluster ingress/OIDC overrides. + +## Step 4: Verify the Deployment + +```bash +kubectl get pods +kubectl get apisixroutes +``` + +Expected output (the sglang pod is what matters here; your existing +Keycloak / APISIX / ingress pods will appear in the listing too, with +names that depend on how those components were deployed in your +cluster): + +``` +NAME READY STATUS RESTARTS +sglang-gpt-oss-20b-- 1/1 Running 0 +... 1/1 Running 0 # keycloak, apisix, ingress-nginx, etc. +``` + +> Note: First pod start takes ~4-5 minutes (downloading ~12 GB of +> weights from Hugging Face, then dequantizing MXFP4 → bf16 in memory). +> Subsequent restarts are fast because the cache PVC persists the +> weights. + +``` +NAME HOSTS +sglang-gpt-oss-20b-apisixroute api.example.com +``` + +## Step 5: Test the Deployed Model + +```bash +curl -k https://${BASE_URL}/gpt-oss-20b-sglang/v1/chat/completions \ + -X POST \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $TOKEN" \ + -d '{ + "model": "gpt-oss-20b", + "messages": [{"role": "user", "content": "In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' +``` + +> The exact `${BASE_URL}` value depends on how the cluster was +> bootstrapped — it's what `core/scripts/generate-token.sh` exports +> after sourcing. Self-bootstrapped clusters following the chart +> README's appendix will have `${BASE_URL}=api.example.com:30443`. + +If successful, the model returns a chat-completion response with the +answer in `choices[0].message.content` and the model's internal +reasoning in `choices[0].message.reasoning_content`. + +> If the request times out with a 504, CPU inference at ~4 tokens/s can exceed the default 60 s upstream timeout for longer responses. See [Gateway Timeout (504)](../sglang-troubleshooting.md#1-gateway-timeout-504-on-inference-requests) in the troubleshooting guide to bump both the nginx and APISIX timeouts. + +### A Note on `max_tokens` + +gpt-oss uses the Harmony chat format: every response starts in an +internal "analysis" channel and only switches to the user-visible +"final" channel when reasoning is complete. With small budgets the +model spends them all reasoning and the `content` field comes back null: + +| `max_tokens` | What you'll see | +| ------------ | -------------------------------------------------- | +| ≤ 100 | `content: null`, reasoning truncated | +| 150 | One short sentence — good for quick verification | +| 300 | Paragraph with light formatting | +| > 400 | Hits documented long-form drift (see troubleshooting) | + +## To Undeploy the Model + +```bash +helm uninstall sglang-gpt-oss-20b +kubectl delete pvc -l app.kubernetes.io/instance=sglang-gpt-oss-20b # frees the cached weights +``` + +## Parameters + +| Parameter | Description | +| --------- | ----------- | +| `--set modelSource="openai/gpt-oss-20b"` | HuggingFace repo to load (passed to `sglang serve --model-path`). | +| `--set modelName="gpt-oss-20b"` | Served name, also used in the ApisixRoute URI prefix `/gpt-oss-20b-sglang/*`. | +| `--set huggingface.token="..."` | HF token for gated models. `openai/gpt-oss-20b` is public, so leave empty. | +| `--set ingress.enabled=true` | Creates a Kubernetes Ingress that terminates TLS at nginx. | +| `--set ingress.host="${BASE_URL}"` | Hostname the ingress matches (same value used in the TLS secret name). | +| `--set ingress.secretName="${BASE_URL}"` | TLS Secret used at the ingress layer — its name equals the hostname by chart convention. | +| `--set oidc.clientId="..."` | Keycloak OIDC client ID; APISIX validates tokens against this client. | +| `--set oidc.clientSecret="..."` | Keycloak OIDC client secret. | +| `--set apisixRoute.enabled=true` | Creates the APISIX route with `openid-connect` plugin for bearer validation. | +| `--set 'server.extraArgs={...}'` | gpt-oss-specific runtime flags: `torch_native` CPU attention backend, Harmony `--reasoning-parser` and `--tool-call-parser`. | diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md b/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md new file mode 100644 index 00000000..2a295773 --- /dev/null +++ b/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md @@ -0,0 +1,66 @@ +# gpt-oss-20b + +This model uses gpt-oss-20b, a 20.9 billion-parameter open-weight mixture-of-experts model from OpenAI. It is part of the gpt-oss family released under the Apache 2.0 license and is optimized for reasoning, agentic workflows, and tool use, with a configurable reasoning effort. + +For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page** + +https://huggingface.co/openai/gpt-oss-20b + +This model provides inference services only; weights are hosted by Hugging Face under OpenAI's Apache 2.0 release. + +Ensure compliance with OpenAI's Apache 2.0 release terms and usage policy before using this model. + +### Model Attribution + +**Developer:** OpenAI + +**Purpose:** Open-weight reasoning, agentic, and tool-using model with configurable analysis depth (low / medium / high reasoning effort) + +**Sizes/Variants:** 20 B total parameters with mixture-of-experts (3.6 B active per token); the gpt-oss family also includes a 120 B variant + +**Modalities:** Text input → Text (including code, structured outputs, and tool calls) output + +**Parameter Size:** ~20.9 billion total (~3.6 billion active per token) + +**Max Context:** Up to 128 k tokens + +**License:** Apache 2.0 + +**Native Quantization:** MXFP4 (4-bit microscaling float) on the MoE weights, dequantized to bf16 at weight-load time for CPU inference + +**Minimum required CPU Cores:** 64 (recommended 96+ for usable throughput) + +**Minimum required PCIe Cards:** 0 (CPU-only deployment via the patched SGLang Xeon image) + +### Usage Notice + +**By using this model, you agree that:** + +- Inputs and outputs are processed through gpt-oss-20b under OpenAI's Apache 2.0 release. +- You will comply with OpenAI's usage policy and the Apache 2.0 license terms, including attribution and notice requirements when redistributing. +- All generated content (text, code, or tool calls) must be reviewed for accuracy, compliance, and safety before deployment. +- The model should not be used for generating malicious content, disallowed content, or for automating decisions in high-risk or regulated systems without appropriate safeguards. + +### Intended Applications + +- Reasoning-heavy chatbots and assistants with adjustable reasoning effort. +- Agentic workflows: tool calling, structured function invocation, multi-step task decomposition. +- Code generation, completion, and refactoring across multiple programming languages. +- Long-context tasks: summarization of long documents, dialog over long history, RAG (retrieve-and-generate) over extended context (subject to the long-form notes in the deployment guide). +- Research, prototyping, and commercial workflows under Apache 2.0 terms. + +### Limitations + +- The 20 B size — while strong for reasoning — still trails much larger models on knowledge-intensive tasks. +- As with all large language models, risk of hallucinations, biases, or unsafe outputs remains; outputs should be reviewed before downstream use. +- The model uses the Harmony chat format with separate analysis and final channels; small `max_tokens` budgets often leave responses in the analysis channel with no user-visible content. See the deployment guide for guidance. +- CPU-only deployment via the patched SGLang image is throughput-limited (~4 tokens/s on a Xeon 6972P with the current pure-Python MoE path) and exhibits a documented long-form drift past ~150 generated tokens. Short-form generation is solid. +- Native MXFP4 quantization requires the patched SGLang image; the upstream `lmsysorg/sglang:v0.5.12-xeon` cannot serve this model. + +### References + +"Introducing gpt-oss". https://openai.com/index/introducing-gpt-oss/ + +Hugging Face Model Card: https://huggingface.co/openai/gpt-oss-20b + +OpenAI gpt-oss GitHub Repository. https://github.com/openai/gpt-oss diff --git a/third_party/Dell/model-deployment/sglang-troubleshooting.md b/third_party/Dell/model-deployment/sglang-troubleshooting.md new file mode 100644 index 00000000..ccc2398f --- /dev/null +++ b/third_party/Dell/model-deployment/sglang-troubleshooting.md @@ -0,0 +1,174 @@ +# SGLang Troubleshooting Guide + +This section provides common issues observed when running inference against models deployed via the SGLang Helm chart on Intel® AI for Enterprise Inference, along with step-by-step resolutions. + +**Issues:** +1. [Gateway Timeout (504) on Inference Requests](#1-gateway-timeout-504-on-inference-requests) +2. [Response `content` field is null](#2-response-content-field-is-null) +3. [Pod startup fails with "Unknown quantization method: mxfp4"](#3-pod-startup-fails-with-unknown-quantization-method-mxfp4) +4. [Pod startup fails with "scalar path not implemented!"](#4-pod-startup-fails-with-scalar-path-not-implemented) +5. [Model serves but emits random-vocab gibberish in `content`](#5-model-serves-but-emits-random-vocab-gibberish-in-content) +6. [Long-form responses degrade into broken tokens after ~150 tokens](#6-long-form-responses-degrade-into-broken-tokens-after-150-tokens) +7. [401 Unauthorized from APISIX with a valid-looking token](#7-401-unauthorized-from-apisix-with-a-valid-looking-token-issuer-mismatch) + +--- + +### 1. Gateway Timeout (504) on Inference Requests + +**Context:** Model deployed via the SGLang chart. Inference request sent through the ingress stack (ingress-nginx → APISIX → SGLang service). + +**Error:** Inference requests return `504 Gateway Timeout` after 60 seconds. + +**Cause:** CPU-based MoE inference (gpt-oss-20b on Xeon) generates tokens at ~4 tokens/s. Responses requiring more than ~240 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX. + +**Fix:** + +**Step 1 – Increase the nginx ingress timeout** + +Find the ingress and annotate it: + +```bash +kubectl get ingress -A | grep sglang +kubectl annotate ingress -n \ + nginx.ingress.kubernetes.io/proxy-read-timeout="600" \ + nginx.ingress.kubernetes.io/proxy-send-timeout="600" \ + nginx.ingress.kubernetes.io/proxy-connect-timeout="60" \ + --overwrite +``` + +**Step 2 – Increase the APISIX route timeout** + +```bash +kubectl get apisixroute -A | grep sglang +kubectl patch apisixroute -n --type='json' \ + -p='[{"op":"add","path":"/spec/http/0/timeout","value":{"connect":"5s","read":"600s","send":"600s"}}]' +``` + +**Verification:** Re-run the inference request and confirm a `200 OK` response within the new window. + +**Notes:** +- Annotations apply immediately; no pod restart required. +- For shorter responses (`max_tokens ≤ 200`), the default 60s timeout is usually sufficient. + +--- + +### 2. Response `content` field is null + +**Context:** gpt-oss-20b deployed via the SGLang chart. Inference request returns HTTP 200 but `choices[0].message.content` is `null`; `choices[0].message.reasoning_content` is populated. + +**Cause:** gpt-oss uses the Harmony chat format with separate analysis and final channels. The model always begins in the analysis channel (internal reasoning) and only switches to the final channel when reasoning completes. With small `max_tokens` budgets, the model exhausts the budget while still reasoning and never emits visible content. `finish_reason` will be `length` and `reasoning_tokens` will equal `completion_tokens`. + +**Fix:** Raise `max_tokens` so the model has budget to finish reasoning AND emit a final answer: + +| `max_tokens` | Outcome | +| ------------ | ---------------------------------------- | +| ≤ 100 | Typically `content: null` | +| 150 | One short sentence (good for verification) | +| 300 | Paragraph with light formatting | + +The internal reasoning is always preserved in `reasoning_content` even when `content` is null. + +--- + +### 3. Pod startup fails with "Unknown quantization method: mxfp4" + +**Context:** Pod CrashLoopBackOff at startup. Logs show `ValueError: Unknown quantization method: mxfp4`. + +**Cause:** The pod is running the upstream `lmsysorg/sglang:v0.5.12-xeon` image. The upstream image gates MXFP4 quantization behind `is_cuda() or is_hip()`, so it cannot load MXFP4 models on CPU. + +**Fix:** Confirm the chart is using the patched image. The chart's `values.yaml` defaults to it, but a `--set image.repository=...` override may have switched it back: + +```bash +kubectl get pod -l app=sglang -o jsonpath='{.items[0].spec.containers[0].image}{"\n"}' +# expected: enterprise-inference/sglang:v0.5.12-xeon-fix11-debug +``` + +If the image is wrong, redeploy with the chart defaults or explicitly set: + +```bash +helm upgrade ./core/helm-charts/sglang \ + --reuse-values \ + --set image.repository=enterprise-inference/sglang \ + --set image.tag=v0.5.12-xeon-fix11-debug +``` + +If the patched image is not present locally, build it first: + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +--- + +### 4. Pod startup fails with "scalar path not implemented!" + +**Context:** Pod crashes on the first forward pass. Logs show `RuntimeError: tinygemm_kernel_nn: scalar path not implemented!`. + +**Cause:** The `sgl-kernel` shared library loaded by the pod was compiled without `-mavx512bf16`. The bf16 specialization of `tinygemm_kernel_nn` falls through to a stub that throws this error. This is the upstream regression the patched image's Dockerfile step 1 fixes. + +**Fix:** Verify the patched image is loaded (same check as issue #3). If the patched image is loaded and this error still appears, the build may have failed silently — rebuild and check the verification line: + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh 2>&1 | grep "AVX-512 BF16 instructions" +# expected: ~1400+ instructions +``` + +A count under 100 indicates the compile flags did not take effect during the build. + +--- + +### 5. Model serves but emits random-vocab gibberish in `content` + +**Context:** gpt-oss-20b deployed. Inference returns HTTP 200, `content` is non-null but looks like a sequence of unrelated tokens (e.g., `" the I the and a"`). + +**Cause:** MXFP4 weights are being dequantized with the wrong nibble packing order. gpt-oss stores its MXFP4 weights with the low nibble first; the patched image's dequant defaults to this via the `MXFP4_NIBBLE_ORDER` environment variable. + +**Fix:** Verify the env var is set on the pod: + +```bash +kubectl get pod -l app=sglang -o jsonpath='{range .items[0].spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' | grep MXFP4_NIBBLE_ORDER +# expected: MXFP4_NIBBLE_ORDER=low_first +``` + +The chart's `values.yaml` includes this default. If it is missing, redeploy without overriding `extraEnv` to an empty list. + +--- + +### 6. Long-form responses degrade into broken tokens after ~150 tokens + +**Context:** gpt-oss-20b deployed via the SGLang chart. Short-form responses are coherent. Responses past ~150 generated tokens collapse into broken tokens, repeated characters, mixed emoji, foreign-script characters, or special-token leaks like `<|channel|>` appearing in the visible output. + +**Cause:** Known limitation of the current pure-Python CPU MoE path used by the chart. A precision A/B (fp32 per-expert MoE intermediates, fp32 KV cache, `--enable-fp32-lm-head`) ruled out numerical precision as the dominant cause. Surviving hypotheses point at sliding-window-attention bookkeeping inside the patched `torch_native_backend` or Harmony channel-switch tokenization interacting with the sinks-attention wrapper. + +**Fix:** No fix at the chart level yet. Workarounds: +- Keep `max_tokens` at or below 200 for production calls. +- Phrase prompts to produce short, focused answers (e.g., `"In one sentence, ..."`). +- The internal `reasoning_content` is unaffected and can still be inspected. + +This is documented under "Known Limitations" in `core/helm-charts/sglang/README.md`. Long-form coherence requires further work on the attention or channel-switch path. + +--- + +### 7. 401 Unauthorized from APISIX with a valid-looking token (issuer mismatch) + +**Context:** Token was successfully obtained from Keycloak (via `source generate-token.sh` or equivalent), but the inference call returns `401 Unauthorized` from APISIX (response body mentions "openresty"). + +**Cause:** APISIX's OIDC plugin runs in `bearer_only` mode and validates the token's `iss` (issuer) claim against the issuer returned by the OIDC discovery URL the chart was configured with. If Keycloak was deployed without a fixed `KC_HOSTNAME`, it stamps the issuer based on the incoming request's host header — so a token fetched via `https://api.example.com:30443/token` carries `iss=https://api.example.com:30443/realms/master`, but the chart's default discovery URL is `http://keycloak.default.svc.cluster.local/realms/master`. The two don't match and APISIX rejects. + +**Fix:** Pin Keycloak's issuer at deploy time by setting `KC_HOSTNAME` on the Keycloak Deployment to the cluster-internal hostname the chart's `oidc.discovery` value points at. The appendix in `core/helm-charts/sglang/README.md` (A.3) shows the env vars; the relevant ones are: + +```yaml +- { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } +- { name: KC_HOSTNAME_STRICT, value: "false" } +- { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } +``` + +After updating the Deployment (`kubectl apply` the manifest from A.3 again, then wait for the new pod), re-source `generate-token.sh` to fetch a fresh token. Verify the issuer claim is now cluster-internal: + +```bash +echo "$TOKEN" | cut -d. -f2 | base64 -d 2>/dev/null \ + | python3 -c "import json,sys; print('iss =', json.loads(sys.stdin.read())['iss'])" +# expect: iss = http://keycloak.default.svc.cluster.local/realms/master +``` + +The mismatched-issuer 401 cannot happen on a production EI cluster — the Ansible playbooks set `KC_HOSTNAME` to the cluster's external hostname and the chart's `oidc.discovery` is set to the matching URL — but it's a common stumble for someone bootstrapping by hand from the appendix.