fix(#405): honest CPU fallback for local embeddings + ADR-101 ROCm strategy#408
Conversation
Frames issue #405's root cause as image packaging: the published kg-api:latest ships PyPI's default torch wheel, which bundles CUDA runtime (works on NVIDIA and CPU) but has no ROCm support. AMD users need a separate wheel from the ROCm-specific PyTorch index. Proposes three published ROCm variants alongside kg-api:latest — kg-api:rocm60, kg-api:rocm61 (PYTORCH_VARIANT wheels), and kg-api:rocm71-host (AMD's official rocm/pytorch base image) — with KG_API_IMAGE_TAG in docker-compose.ghcr.yml driving install-time selection from operator GPU_MODE. NVIDIA support unchanged: nvidia and cpu modes continue to use kg-api:latest. Single-arch (linux/amd64) for ROCm variants — no production-quality arm64 ROCm path exists. publish.sh gains an api-rocm target group that builds all three variants in one invocation. Visual-embedding parallel path noted as a deferred follow-up; #405 only names the text path.
Issue #405 reported AMD ROCm init looking fixed but silently broken: the API logs "Falling back to API-based embeddings" while the active embedding profile is still 'local', and every ingest concept then fails with "Embedding model not loaded. Call load_model() first." Two coupled defects, both addressed here: 1. init_embedding_model_manager assigned the module-global _model_manager *before* calling load_model(). When load_model() raised, the global was left pointing at a half-built manager with self.model = None — so downstream get_embedding_model_manager() returned the broken manager instead of raising "not initialized". Now the global is only published after load_model() succeeds, and is explicitly cleared on terminal failure. 2. No CPU fallback existed when the configured device was unusable. Per ADR-101, retry once with device='cpu' before giving up — that matches the workaround documented in #405 and means a user who picks the wrong ROCm variant gets a degraded-but-working install, not a silent-broken one. main.py's misleading "Falling back to API-based embeddings" log line becomes an honest error: the active profile is 'local', and if both the configured device and CPU fail, ingestion will fail until the user switches profiles or fixes the model config. No more lying about a fallback that never happened. Tests lock in: configured-device success path, CUDA-fails / CPU-rescues fallback, both-fail terminal state (global stays None, get_embedding_* raises "not initialized"), and CPU-as-configured failure short-circuits without a redundant retry.
…ch test, honest log ordering Addresses code-review feedback on PR #408: * Extracts the build-with-CPU-fallback logic to a module-level `_build_with_cpu_fallback` helper. The boot-time init path uses it; hot-reload deliberately does NOT — see the long-form comment in `reload_embedding_model_manager` for the asymmetry argument (boot uses stale config that may predate hardware changes and benefits from fallback; reload carries fresh just-typed operator intent and should fail loudly while the atomic-swap pattern preserves the previous working manager). * Case-insensitive 'cpu' check in the retry guard — config strings like 'CPU', ' cpu', 'Cpu' no longer trigger a redundant retry that double-logs the same failure. Parametrized test locks the contract. * Reorders the fallback log lines: 'Attempting CPU fallback...' before the retry, success-after-the-fact 'Loaded <model> on CPU (configured device=<x> was unavailable). Re-run...' once the fallback completes. Reads correctly whether the retry succeeded or failed, instead of the prior "Falling back" message that implied an action which might then error out. * Adds tests/unit/lib/test_embedding_model_manager_real_torch.py — integration-style test that calls the real sentence-transformers loader (no `load_model` mock) on a CUDA-less host with the all-MiniLM tiny model. Proves the catchable-exception contract the mocked tests can only assume (torch raises catchable Exception on cuda-no-driver rather than segfaulting past the except clause), and verifies the fallback manager actually produces real 384-dim embeddings. Skipped on hosts with usable CUDA — assertions are meaningful only when the configured device is genuinely missing. * ADR-101 status Draft → Accepted. The image-variant scheme is operational from this PR forward; outstanding image-build/publish and operator-selection work tracks as follow-up tasks.
… (ADR-101, #405) Implements the image-build and operator-selection halves of ADR-101. Image build (publish.sh images-rocm): * New cmd_images_rocm publishes kg-api ROCm variants. Builds the appropriate Dockerfile (rocm60/rocm61 → api/Dockerfile with PYTORCH_VARIANT; rocm72-host → api/Dockerfile.rocm-host) and pushes tags `kg-api:<variant>`, `kg-api:<VERSION>-<variant>`, and `kg-api:sha-<GIT_SHA>-<variant>`. Does NOT touch :latest. * Single-arch linux/amd64 (ROCm has no production arm64 path). * Two-tier variant gating: DEFAULT_VARIANTS (rocm72-host) build by default; DEFERRED_VARIANTS (rocm60, rocm61) require --force because no maintainer-verified hardware test has run for them yet. * api/Dockerfile.rocm-host updated: base image bumped rocm/pytorch:rocm7.1_ubuntu24.04_py3.13 → rocm/pytorch:rocm7.2.3_ubuntu24.04_py3.12 to match modern Arch ROCm 7.2.3 hosts (closer than 7.1's ABI-compatible guess). Python 3.12 — no 3.13-specific syntax in the API code. * Tag rename: rocm71-host → rocm72-host to reflect the actual base. Operator wiring: * docker/docker-compose.ghcr.yml: api service image becomes `kg-api:${KG_API_IMAGE_TAG:-latest}`, so AMD hosts pull the matching variant. NVIDIA/CPU/mac default to latest (PyPI torch ships CUDA runtime bundled — works for both). * operator/lib/common.sh load_operator_config: derives KG_API_IMAGE_TAG from GPU_MODE when not explicitly set, exports for docker-compose substitution. amd → rocm60 (rocm61 with ROCM_VERSION=rocm61); amd-host → rocm72-host; else latest. * operator/lib/guided-init.sh: persists KG_API_IMAGE_TAG into .operator.conf during init so the choice survives env resets and is visible to operators. Adds a ROCR_VISIBLE_DEVICES=0 hint when AMD mode is chosen — discrete + iGPU coexistence is a real foot-gun on Ryzen 7000+ hosts. ADR-101 updated to reflect the operational tag name (rocm72-host) and adds a naming-convention note: the suffix tracks the base image's ROCm version. Future base bumps (rocm/pytorch:rocm7.3) add new tags rather than overloading the existing one — keeps tag meaning immutable. Published this commit (manually verified): ghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host ghcr.io/aaronsb/knowledge-graph-system/kg-api:0.13.1-rocm72-host Outstanding: end-to-end verification on the AMD 7900 XTX host (task #6) — operator tear-down + fresh init → choose AMD GPU (host ROCm) → confirm ROCm device picked, model loads, ingest runs. The rocm60/rocm61 variants remain deferred per ADR-101 until a tester volunteers.
…nd amd-host overlay (ADR-101, #405) Two drifts between operator.sh (top-level standalone entry-point) and operator/lib/common.sh (dev helpers) that prevented end-to-end ROCm verification: 1. load_config never derived/exported KG_API_IMAGE_TAG from GPU_MODE. common.sh's load_operator_config got it in the previous commit, but operator.sh has its own load_config that the standalone start path (`./operator.sh start` after `init --image-source ghcr`) goes through. Without the export, docker-compose substituted ${KG_API_IMAGE_TAG:-latest} with `latest`, so AMD hosts pulled the CUDA-bundled image and silently landed on the #405 failure mode (now caught by #408's CPU fallback but defeating the purpose of the ROCm variant). 2. get_compose_cmd's GPU overlay case only handled nvidia, amd, mac — it was missing amd-host entirely. So with GPU_MODE=amd-host, no docker-compose.gpu-amd-host.yml overlay applied: no /dev/kfd or /dev/dri device mounts, no privileged:true / ipc:host for ROCm HSA init, and the container's ROCm runtime had nothing to talk to — "No HIP GPUs are available" inside the container despite a perfectly functional 7900 XTX on the host. Verified on 7900 XTX (gfx1100) + iGPU 9950X3D (gfx1036) host running Arch ROCm 7.2.3: Container: ghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host rocminfo (inside): both gfx1100 and gfx1036 enumerated torch.cuda.is_available(): True torch.version.hip: 7.2.53211-c2d9476115 device_count: 2, [0] AMD Radeon RX 7900 XTX, [1] iGPU Embedding model loaded: nomic-ai/nomic-embed-text-v1.5 (768 dims, cuda) Vision model loaded: nomic-ai/nomic-embed-vision-v1.5 on cuda rocm-smi: GPU[0] 12% busy under embedding load, GPU[1] idle (1%) Default cuda:0 device routes work to the discrete 7900 XTX without requiring ROCR_VISIBLE_DEVICES — the iGPU is enumerated but not used. The wizard hint about ROCR_VISIBLE_DEVICES=0 remains useful as a contingency for hosts where the runtime enumerates the iGPU first. Closes the ROCm half of ADR-101. The CPU fallback half (PR #408 core) also showed honestly during diagnosis: when devices weren't mounted, embedding load failed with 'No HIP GPUs are available' and the new fallback engaged with the new log wording 'Attempting CPU fallback...' 'Loaded ... on CPU (configured device='cuda' was unavailable)' — proof that both halves work together as the ADR claimed.
Review — post-
|
…visual log, doc-surface fixes Addresses code-reviewer findings on PR #408: * **operator/lib/image-tag.sh** is the single source of truth for GPU_MODE → KG_API_IMAGE_TAG mapping. operator.sh's load_config, operator/lib/common.sh's load_operator_config, and operator/lib/guided-init.sh all source it and call derive_kg_api_image_tag(). The earlier inline-three-times shape drifted (commit c448f09 was the empirical proof — fixed one of the copies after the bug had silently shipped). One definition now. * **Wizard option [3] no longer points at an unpublished image.** Previously `amd` mode resolved to KG_API_IMAGE_TAG=rocm60, but rocm60/rocm61 are deferred per ADR-101 §Negative — GHCR has only `latest` and `rocm72-host`. Users picking option [3] would have hit `manifest unknown` on pull — same "looks fixed, silently broken" shape this PR is supposed to eliminate. Now `amd` mode resolves to `latest` and relies on ADR-101's CPU fallback. Setting ROCM_VERSION=rocm60 in .operator.conf forces the variant tag once a tester confirms a build (tracked in #409). Wizard text reframed: "Linux with AMD GPU (ROCm 6.x — preview) / Falls back to CPU embeddings via :latest until variant ships". * **Visual-embedding startup log honest at main.py:218.** Parity with the text-embedding fix at main.py:199. No more "Visual embedding features may be limited" when the active profile asks for visual embeddings and the load failed (image ingestion will actually fail, not "be limited"). The new message names the failure mode and gives two concrete remediation paths (switch profile, or repair model config). Mirrors the text path. * **Doc/code surface drift fixed.** ADR-101 examples and publish.sh comment referenced `--variants rocm60,rocm61` flag and `images api-rocm` command that don't exist. Real surface is `images-rocm rocm60 --force` (positional variant args + force flag to opt into deferred variants). Issues opened for the items intentionally deferred from this PR: #409 — Build & publish ROCm 6.x variants (rocm60, rocm61) #410 — CPU fallback for visual embedding generator (text-parity) #411 — Operator: --image-source flag silently dropped by guided-init.sh #412 — Operator: persisted KG_API_IMAGE_TAG goes stale on manual edits Reload-asymmetry contract test (review item 4) intentionally deferred — covers existing pre-PR behavior, separate concern. Verified the helper across the three call sites: bash> source operator/lib/image-tag.sh bash> for m in amd amd-host nvidia cpu; do echo "$m -> $(derive_kg_api_image_tag $m)" done amd -> latest # was rocm60 (unpublished — review footgun) amd-host -> rocm72-host nvidia -> latest cpu -> latest bash> derive_kg_api_image_tag amd rocm60 rocm60 # ROCM_VERSION override still works
Closes #405. End-to-end verified on AMD 7900 XTX (gfx1100) + iGPU 9950X3D (gfx1036), Arch ROCm 7.2.3 host.
Summary
Issue #405 reported that
./operator.sh init"AMD ROCm" path produced a usable-looking config but every ingest then failed withEmbedding model not loaded. Call load_model() first.The startup log claimed "Falling back to API-based embeddings" — a lie; no fallback actually occurred. The root cause is image packaging: publishedkg-api:latestships PyPI's default torch wheel (CUDA runtime bundled, no ROCm). This PR ships both halves of the fix.What's in this PR
ADR-101 (Accepted) —
docs/architecture/infrastructure/ADR-101-rocm-image-variant-and-install-time-selection.md. Frames the root cause, proposes three ROCm variant tags (rocm60,rocm61,rocm72-host) with install-time selection viaKG_API_IMAGE_TAGsubstitution indocker-compose.ghcr.yml. NVIDIA stays on:latest(default PyPI torch wheel carries the CUDA runtime — works on NVIDIA and CPU hosts unchanged).Honest CPU fallback —
api/app/lib/embedding_model_manager.py. Whenload_model()fails on the configured device (CUDA-built torch on AMD host, missing /dev/kfd for ROCm, MPS on non-Apple, etc.), retries once withdevice='cpu'rather than leaving the platform silent-broken. Also closes a global-pollution bug where a failed init left_model_managerpointing at a half-built manager (the actual reason concepts saw "model not loaded" instead of "not initialized"). Hot-reload deliberately does NOT carry the fallback — operator just made an explicit just-typed device choice, atomic-swap preserves the previous working model on failure, silently downgrading would be the worse failure.Honest startup logs —
api/app/main.py. No more "Falling back to API-based embeddings" when the active profile is stilllocal. Mirror fix at line 218 for visual-embedding init. On both-paths-failure, the message surfaces what actually happened with concrete remediation paths.Published ROCm image —
ghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host. Built fromapi/Dockerfile.rocm-hoston AMD's officialrocm/pytorch:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.9.1base image. Single-arch linux/amd64 (ROCm has no production arm64 path).publish.sh images-rocmis the new target group; rocm60/rocm61 variants stay gated behind--forceuntil someone runs them on real hardware (#409).Operator wiring with single-source-of-truth helper —
operator/lib/image-tag.shis the one placeGPU_MODE → KG_API_IMAGE_TAGis defined.operator.sh:load_config,operator/lib/common.sh:load_operator_config, andoperator/lib/guided-init.shall source it. The earlier inline-three-times shape drifted between commits, the helper ends that.docker/docker-compose.ghcr.ymlparameterized to substitute the tag.operator/lib/guided-init.shpersists the tag in.operator.conf, warns aboutROCR_VISIBLE_DEVICES=0on hosts with both a discrete dGPU and an iGPU, and reframes wizard option [3] as "preview" (it now resolves to:latest+ CPU fallback because rocm60/rocm61 are deferred — previously it pointed at an unpublished image).operator.sh:get_compose_cmdgained the missingamd-hostGPU overlay case — without it, no/dev/kfddevice mounts, noprivileged: true, ROCm runtime had nothing to talk to inside the container.Verified on real hardware (7900 XTX)
ghcr.io/.../kg-api:rocm72-host/dev/kfd+/dev/driin containerrocminfoinside containertorch.cuda.is_available()torch.version.hipnomic-ai/nomic-embed-text-v1.5 (768 dims, cuda)nomic-ai/nomic-embed-vision-v1.5 on cudaDiagnosis also incidentally verified the CPU fallback fires correctly on the ROCm container when the GPU overlay wasn't yet applied — proof the two halves of the ADR work together as designed.
Commits
fac3daa3— ADR-101 draft → Accepted49fc13ba— Real CPU fallback ininit_embedding_model_manager; global-pollution fix; honest main.py logfd2e72cf— Review follow-ups: extract_build_with_cpu_fallbackhelper, case-insensitive'cpu'retry guard, real-torch integration test, attempt-then-success log ordering, ADR status promoted to Accepted, async-reload asymmetry comment9c7db593—publish.sh images-rocmtarget +kg-api:rocm72-hostpublished; operator/lib/common.sh KG_API_IMAGE_TAG export; ghcr compose parameterization; guided-init persistence + multi-GPU note; Dockerfile.rocm-host base bumped to ROCm 7.2.3c448f098— Drift fix between operator.sh (top-level) and operator/lib/common.sh: KG_API_IMAGE_TAG export, missingamd-hostoverlay case2583051c— Final review pass: collapse GPU_MODE→TAG triplication intooperator/lib/image-tag.sh, fix wizard option [3] footgun (was pointing at unpublishedrocm60image), honest visual-embedding log at main.py:218, doc-surface drift in ADR-101 + publish.sh commentTest plan
pytest tests/unit/ tests/api/— 921 passed, 15 skipped (no regressions)tests/unit/lib/test_embedding_model_manager_fallback.py— 8 cases (happy path, CUDA-fails / CPU-rescues, both-fail terminal state with cleared global, CPU-as-configured no-retry, case-insensitive'cpu'parametrized)tests/unit/lib/test_embedding_model_manager_real_torch.py— exercises real sentence-transformers loader (noload_modelmock) on no-CUDA host, proves the catchable-exception contractdocs/scripts/adr lint)derive_kg_api_image_tagsmoke-tested across all three call sitesTracking issues for deferred follow-ups
--image-sourceflag silently dropped by guided-init.shKG_API_IMAGE_TAGgoes stale on manualGPU_MODEeditsHousekeeping
Stale branch
fix/catalog-refresh-embedding-couplingwas deleted local + remote during this work (commits already on main via direct push, never PR'd).