Skip to content

fix(#405): honest CPU fallback for local embeddings + ADR-101 ROCm strategy#408

Merged
aaronsb merged 6 commits into
mainfrom
fix/issue-405-fallback-and-rocm-adr
May 25, 2026
Merged

fix(#405): honest CPU fallback for local embeddings + ADR-101 ROCm strategy#408
aaronsb merged 6 commits into
mainfrom
fix/issue-405-fallback-and-rocm-adr

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented May 24, 2026

Closes #405. End-to-end verified on AMD 7900 XTX (gfx1100) + iGPU 9950X3D (gfx1036), Arch ROCm 7.2.3 host.

Summary

Issue #405 reported that ./operator.sh init "AMD ROCm" path produced a usable-looking config but every ingest then failed with Embedding model not loaded. Call load_model() first. The startup log claimed "Falling back to API-based embeddings" — a lie; no fallback actually occurred. The root cause is image packaging: published kg-api:latest ships PyPI's default torch wheel (CUDA runtime bundled, no ROCm). This PR ships both halves of the fix.

What's in this PR

ADR-101 (Accepted)docs/architecture/infrastructure/ADR-101-rocm-image-variant-and-install-time-selection.md. Frames the root cause, proposes three ROCm variant tags (rocm60, rocm61, rocm72-host) with install-time selection via KG_API_IMAGE_TAG substitution in docker-compose.ghcr.yml. NVIDIA stays on :latest (default PyPI torch wheel carries the CUDA runtime — works on NVIDIA and CPU hosts unchanged).

Honest CPU fallbackapi/app/lib/embedding_model_manager.py. When load_model() fails on the configured device (CUDA-built torch on AMD host, missing /dev/kfd for ROCm, MPS on non-Apple, etc.), retries once with device='cpu' rather than leaving the platform silent-broken. Also closes a global-pollution bug where a failed init left _model_manager pointing at a half-built manager (the actual reason concepts saw "model not loaded" instead of "not initialized"). Hot-reload deliberately does NOT carry the fallback — operator just made an explicit just-typed device choice, atomic-swap preserves the previous working model on failure, silently downgrading would be the worse failure.

Honest startup logsapi/app/main.py. No more "Falling back to API-based embeddings" when the active profile is still local. Mirror fix at line 218 for visual-embedding init. On both-paths-failure, the message surfaces what actually happened with concrete remediation paths.

Published ROCm imageghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host. Built from api/Dockerfile.rocm-host on AMD's official rocm/pytorch:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.9.1 base image. Single-arch linux/amd64 (ROCm has no production arm64 path). publish.sh images-rocm is the new target group; rocm60/rocm61 variants stay gated behind --force until someone runs them on real hardware (#409).

Operator wiring with single-source-of-truth helperoperator/lib/image-tag.sh is the one place GPU_MODE → KG_API_IMAGE_TAG is defined. operator.sh:load_config, operator/lib/common.sh:load_operator_config, and operator/lib/guided-init.sh all source it. The earlier inline-three-times shape drifted between commits, the helper ends that. docker/docker-compose.ghcr.yml parameterized to substitute the tag. operator/lib/guided-init.sh persists the tag in .operator.conf, warns about ROCR_VISIBLE_DEVICES=0 on hosts with both a discrete dGPU and an iGPU, and reframes wizard option [3] as "preview" (it now resolves to :latest + CPU fallback because rocm60/rocm61 are deferred — previously it pointed at an unpublished image). operator.sh:get_compose_cmd gained the missing amd-host GPU overlay case — without it, no /dev/kfd device mounts, no privileged: true, ROCm runtime had nothing to talk to inside the container.

Verified on real hardware (7900 XTX)

Check Result
Container image pulled ghcr.io/.../kg-api:rocm72-host
/dev/kfd + /dev/dri in container Mounted
rocminfo inside container gfx1100 (7900 XTX) + gfx1036 (iGPU)
torch.cuda.is_available() True
torch.version.hip 7.2.53211-c2d9476115
Text embedding device nomic-ai/nomic-embed-text-v1.5 (768 dims, cuda)
Vision embedding device nomic-ai/nomic-embed-vision-v1.5 on cuda
GPU under load 7900 XTX 12% busy, iGPU 1% idle — work routes to discrete
VRAM detected 24268 MB free

Diagnosis also incidentally verified the CPU fallback fires correctly on the ROCm container when the GPU overlay wasn't yet applied — proof the two halves of the ADR work together as designed.

Commits

  1. fac3daa3 — ADR-101 draft → Accepted
  2. 49fc13ba — Real CPU fallback in init_embedding_model_manager; global-pollution fix; honest main.py log
  3. fd2e72cf — Review follow-ups: extract _build_with_cpu_fallback helper, case-insensitive 'cpu' retry guard, real-torch integration test, attempt-then-success log ordering, ADR status promoted to Accepted, async-reload asymmetry comment
  4. 9c7db593publish.sh images-rocm target + kg-api:rocm72-host published; operator/lib/common.sh KG_API_IMAGE_TAG export; ghcr compose parameterization; guided-init persistence + multi-GPU note; Dockerfile.rocm-host base bumped to ROCm 7.2.3
  5. c448f098 — Drift fix between operator.sh (top-level) and operator/lib/common.sh: KG_API_IMAGE_TAG export, missing amd-host overlay case
  6. 2583051c — Final review pass: collapse GPU_MODE→TAG triplication into operator/lib/image-tag.sh, fix wizard option [3] footgun (was pointing at unpublished rocm60 image), honest visual-embedding log at main.py:218, doc-surface drift in ADR-101 + publish.sh comment

Test plan

  • pytest tests/unit/ tests/api/ — 921 passed, 15 skipped (no regressions)
  • tests/unit/lib/test_embedding_model_manager_fallback.py — 8 cases (happy path, CUDA-fails / CPU-rescues, both-fail terminal state with cleared global, CPU-as-configured no-retry, case-insensitive 'cpu' parametrized)
  • tests/unit/lib/test_embedding_model_manager_real_torch.py — exercises real sentence-transformers loader (no load_model mock) on no-CUDA host, proves the catchable-exception contract
  • End-to-end ROCm verification on 7900 XTX — model loads on ROCm device, 24GB VRAM detected, embeddings generate on discrete GPU
  • ADR-101 lints clean (docs/scripts/adr lint)
  • Shared derive_kg_api_image_tag smoke-tested across all three call sites

Tracking issues for deferred follow-ups

Housekeeping

Stale branch fix/catalog-refresh-embedding-coupling was deleted local + remote during this work (commits already on main via direct push, never PR'd).

aaronsb added 5 commits May 24, 2026 16:39
Frames issue #405's root cause as image packaging: the published
kg-api:latest ships PyPI's default torch wheel, which bundles CUDA
runtime (works on NVIDIA and CPU) but has no ROCm support. AMD users
need a separate wheel from the ROCm-specific PyTorch index.

Proposes three published ROCm variants alongside kg-api:latest —
kg-api:rocm60, kg-api:rocm61 (PYTORCH_VARIANT wheels), and
kg-api:rocm71-host (AMD's official rocm/pytorch base image) — with
KG_API_IMAGE_TAG in docker-compose.ghcr.yml driving install-time
selection from operator GPU_MODE. NVIDIA support unchanged: nvidia and
cpu modes continue to use kg-api:latest.

Single-arch (linux/amd64) for ROCm variants — no production-quality
arm64 ROCm path exists. publish.sh gains an api-rocm target group
that builds all three variants in one invocation.

Visual-embedding parallel path noted as a deferred follow-up; #405
only names the text path.
Issue #405 reported AMD ROCm init looking fixed but silently broken:
the API logs "Falling back to API-based embeddings" while the active
embedding profile is still 'local', and every ingest concept then
fails with "Embedding model not loaded. Call load_model() first."

Two coupled defects, both addressed here:

1. init_embedding_model_manager assigned the module-global _model_manager
   *before* calling load_model(). When load_model() raised, the global
   was left pointing at a half-built manager with self.model = None —
   so downstream get_embedding_model_manager() returned the broken
   manager instead of raising "not initialized". Now the global is
   only published after load_model() succeeds, and is explicitly
   cleared on terminal failure.

2. No CPU fallback existed when the configured device was unusable.
   Per ADR-101, retry once with device='cpu' before giving up — that
   matches the workaround documented in #405 and means a user who
   picks the wrong ROCm variant gets a degraded-but-working install,
   not a silent-broken one.

main.py's misleading "Falling back to API-based embeddings" log line
becomes an honest error: the active profile is 'local', and if both
the configured device and CPU fail, ingestion will fail until the
user switches profiles or fixes the model config. No more lying
about a fallback that never happened.

Tests lock in: configured-device success path, CUDA-fails / CPU-rescues
fallback, both-fail terminal state (global stays None, get_embedding_*
raises "not initialized"), and CPU-as-configured failure short-circuits
without a redundant retry.
…ch test, honest log ordering

Addresses code-review feedback on PR #408:

* Extracts the build-with-CPU-fallback logic to a module-level
  `_build_with_cpu_fallback` helper. The boot-time init path uses it;
  hot-reload deliberately does NOT — see the long-form comment in
  `reload_embedding_model_manager` for the asymmetry argument (boot uses
  stale config that may predate hardware changes and benefits from
  fallback; reload carries fresh just-typed operator intent and should
  fail loudly while the atomic-swap pattern preserves the previous
  working manager).

* Case-insensitive 'cpu' check in the retry guard — config strings like
  'CPU', ' cpu', 'Cpu' no longer trigger a redundant retry that
  double-logs the same failure. Parametrized test locks the contract.

* Reorders the fallback log lines: 'Attempting CPU fallback...' before
  the retry, success-after-the-fact 'Loaded <model> on CPU (configured
  device=<x> was unavailable). Re-run...' once the fallback completes.
  Reads correctly whether the retry succeeded or failed, instead of the
  prior "Falling back" message that implied an action which might then
  error out.

* Adds tests/unit/lib/test_embedding_model_manager_real_torch.py —
  integration-style test that calls the real sentence-transformers
  loader (no `load_model` mock) on a CUDA-less host with the all-MiniLM
  tiny model. Proves the catchable-exception contract the mocked tests
  can only assume (torch raises catchable Exception on cuda-no-driver
  rather than segfaulting past the except clause), and verifies the
  fallback manager actually produces real 384-dim embeddings. Skipped
  on hosts with usable CUDA — assertions are meaningful only when the
  configured device is genuinely missing.

* ADR-101 status Draft → Accepted. The image-variant scheme is
  operational from this PR forward; outstanding image-build/publish and
  operator-selection work tracks as follow-up tasks.
… (ADR-101, #405)

Implements the image-build and operator-selection halves of ADR-101.

Image build (publish.sh images-rocm):
  * New cmd_images_rocm publishes kg-api ROCm variants. Builds the
    appropriate Dockerfile (rocm60/rocm61 → api/Dockerfile with
    PYTORCH_VARIANT; rocm72-host → api/Dockerfile.rocm-host) and pushes
    tags `kg-api:<variant>`, `kg-api:<VERSION>-<variant>`, and
    `kg-api:sha-<GIT_SHA>-<variant>`. Does NOT touch :latest.
  * Single-arch linux/amd64 (ROCm has no production arm64 path).
  * Two-tier variant gating: DEFAULT_VARIANTS (rocm72-host) build by
    default; DEFERRED_VARIANTS (rocm60, rocm61) require --force because
    no maintainer-verified hardware test has run for them yet.
  * api/Dockerfile.rocm-host updated: base image bumped
    rocm/pytorch:rocm7.1_ubuntu24.04_py3.13 →
    rocm/pytorch:rocm7.2.3_ubuntu24.04_py3.12 to match modern Arch
    ROCm 7.2.3 hosts (closer than 7.1's ABI-compatible guess).
    Python 3.12 — no 3.13-specific syntax in the API code.
  * Tag rename: rocm71-host → rocm72-host to reflect the actual base.

Operator wiring:
  * docker/docker-compose.ghcr.yml: api service image becomes
    `kg-api:${KG_API_IMAGE_TAG:-latest}`, so AMD hosts pull the matching
    variant. NVIDIA/CPU/mac default to latest (PyPI torch ships CUDA
    runtime bundled — works for both).
  * operator/lib/common.sh load_operator_config: derives
    KG_API_IMAGE_TAG from GPU_MODE when not explicitly set, exports
    for docker-compose substitution. amd → rocm60 (rocm61 with
    ROCM_VERSION=rocm61); amd-host → rocm72-host; else latest.
  * operator/lib/guided-init.sh: persists KG_API_IMAGE_TAG into
    .operator.conf during init so the choice survives env resets and
    is visible to operators. Adds a ROCR_VISIBLE_DEVICES=0 hint when
    AMD mode is chosen — discrete + iGPU coexistence is a real
    foot-gun on Ryzen 7000+ hosts.

ADR-101 updated to reflect the operational tag name (rocm72-host) and
adds a naming-convention note: the suffix tracks the base image's ROCm
version. Future base bumps (rocm/pytorch:rocm7.3) add new tags rather
than overloading the existing one — keeps tag meaning immutable.

Published this commit (manually verified):
  ghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host
  ghcr.io/aaronsb/knowledge-graph-system/kg-api:0.13.1-rocm72-host

Outstanding: end-to-end verification on the AMD 7900 XTX host
(task #6) — operator tear-down + fresh init → choose AMD GPU (host
ROCm) → confirm ROCm device picked, model loads, ingest runs.
The rocm60/rocm61 variants remain deferred per ADR-101 until a
tester volunteers.
…nd amd-host overlay (ADR-101, #405)

Two drifts between operator.sh (top-level standalone entry-point) and
operator/lib/common.sh (dev helpers) that prevented end-to-end ROCm
verification:

1. load_config never derived/exported KG_API_IMAGE_TAG from GPU_MODE.
   common.sh's load_operator_config got it in the previous commit, but
   operator.sh has its own load_config that the standalone start path
   (`./operator.sh start` after `init --image-source ghcr`) goes
   through. Without the export, docker-compose substituted ${KG_API_IMAGE_TAG:-latest}
   with `latest`, so AMD hosts pulled the CUDA-bundled image and
   silently landed on the #405 failure mode (now caught by #408's CPU
   fallback but defeating the purpose of the ROCm variant).

2. get_compose_cmd's GPU overlay case only handled nvidia, amd, mac —
   it was missing amd-host entirely. So with GPU_MODE=amd-host, no
   docker-compose.gpu-amd-host.yml overlay applied: no /dev/kfd or
   /dev/dri device mounts, no privileged:true / ipc:host for ROCm HSA
   init, and the container's ROCm runtime had nothing to talk to —
   "No HIP GPUs are available" inside the container despite a perfectly
   functional 7900 XTX on the host.

Verified on 7900 XTX (gfx1100) + iGPU 9950X3D (gfx1036) host running
Arch ROCm 7.2.3:

  Container: ghcr.io/aaronsb/knowledge-graph-system/kg-api:rocm72-host
  rocminfo (inside): both gfx1100 and gfx1036 enumerated
  torch.cuda.is_available(): True
  torch.version.hip: 7.2.53211-c2d9476115
  device_count: 2, [0] AMD Radeon RX 7900 XTX, [1] iGPU
  Embedding model loaded: nomic-ai/nomic-embed-text-v1.5 (768 dims, cuda)
  Vision model loaded: nomic-ai/nomic-embed-vision-v1.5 on cuda
  rocm-smi: GPU[0] 12% busy under embedding load, GPU[1] idle (1%)

Default cuda:0 device routes work to the discrete 7900 XTX without
requiring ROCR_VISIBLE_DEVICES — the iGPU is enumerated but not used.
The wizard hint about ROCR_VISIBLE_DEVICES=0 remains useful as a
contingency for hosts where the runtime enumerates the iGPU first.

Closes the ROCm half of ADR-101. The CPU fallback half (PR #408 core)
also showed honestly during diagnosis: when devices weren't mounted,
embedding load failed with 'No HIP GPUs are available' and the new
fallback engaged with the new log wording 'Attempting CPU fallback...'
'Loaded ... on CPU (configured device='cuda' was unavailable)' — proof
that both halves work together as the ADR claimed.
@aaronsb
Copy link
Copy Markdown
Owner Author

aaronsb commented May 25, 2026

Review — post-c448f098 final state

Focused on fd2e72cf, 9c7db593, c448f098. End-to-end verification on the 7900 XTX confirms the amd-host half lands clean; the amd (wheel-based) half ships a footgun that should be addressed before merge.


Must-fix / discuss

1. Wizard offers an image that doesn't exist in GHCR.

gh api .../kg-api/versions shows only latest, rocm72-host, and versioned aliases published. kg-api:rocm60 and kg-api:rocm61 are not in the registrycmd_images_rocm gates them behind --force and they've never been built.

But the wizard at operator/lib/guided-init.sh:187-193 offers [3] "Linux with AMD GPU (ROCm wheels)" as a peer option, with description "For systems with ROCm 6.x installed." A user on Ubuntu 22.04 + ROCm 6.x picks the option that names their hardware → GPU_MODE=amd → both operator.sh:75 and operator/lib/common.sh:47 derive KG_API_IMAGE_TAG="${ROCM_VERSION:-rocm60}"docker compose pull fails with manifest unknown for kg-api:rocm60.

This is the exact "looks fixed, silently broken at install" failure mode #405 was filed to eliminate — except now it bites the wheel-based AMD users instead of the host-mode ones. The PR claims to close #405 for AMD users; today it only closes it for amd-host users.

Pick one (don't ship as-is):

  • Gate option 3 in the wizard with [EXPERIMENTAL — requires \--image-source local` to build]` until rocm60 ships
  • Move option 3 behind a "Show advanced/unverified options" prompt
  • Have operator.sh print a clear pre-pull warning when KG_API_IMAGE_TAG resolves to a variant that won't be in GHCR
  • Build + push rocm60/rocm61 unverified (contradicts your own cmd_images_rocm "refuse to ship untested" stance — probably wrong, listed for completeness)

2. Doc/code drift on the publish-flow surface.

  • ADR-101.md:138-141 documents the command as ./publish.sh images api-rocm --variants rocm60.
  • scripts/publish.sh:767 header comment also says --variants rocm60,rocm61.
  • The actual implemented surface is ./publish.sh images-rocm rocm60 rocm61 — a separate subcommand with positional args, no --variants flag.

Two stale references to a non-existent flag, both in load-bearing maintainer-facing docs. Cheap to fix; should align with what cmd_images_rocm actually parses.

3. Triplication of the GPU_MODEKG_API_IMAGE_TAG derivation.

operator.sh:73-80, operator/lib/common.sh:39-54, and operator/lib/guided-init.sh:316-318 all carry the same case statement. The fact that c448f098 was necessary at all — because operator.sh's copy fell out of sync with common.sh's — is empirical evidence the duplication is unsafe.

You asked whether the two-copy version is intentional separation by entry-point. The answer would be yes for two. For three, the next ROCm version bump means three edits and a high probability of repeating exactly this drift. Suggest a tiny shared file — operator/lib/image-tag.sh exporting a derive_kg_api_image_tag() function — sourced by all three. The triplication is now the load-bearing problem, not the original two-copy split.


Should-fix

4. The asymmetry comment defends a contract no test locks.

The fix to extract _build_with_cpu_fallback + the long comment in reload_embedding_model_manager are sound, but tests/unit/lib/test_embedding_model_manager_fallback.py only exercises the init path. The reload path's claims — (a) does NOT fall back, (b) atomic-swap preserves the previous working manager on failure — are asserted by code reading only.

Add a test: existing _model_manager set → call reload_embedding_model_manager with a load_model that raises → assert _model_manager is still the original instance and a RuntimeError propagates. Otherwise a future refactor of reload_* breaks the contract silently and the comment becomes a lie.

5. Visual embedding init log still uses the dishonest pattern this PR explicitly killed for text.

api/app/main.py:218:

logger.info("   Visual embedding features may be limited")

Same shape as the "Falling back to API-based embeddings" line you just rewrote at line 199. When init_visual_embedding_generator raises and an image-using profile is active, "may be limited" is just as misleading as the old text-path message. The full fallback is correctly deferred per ADR-101, but the log line is a one-line edit (downgrade severity, surface what actually happens — ingestion of image content will fail until the profile is changed or the device is fixed) and it's in scope for this PR. Skipping it leaves a half-honest startup log.

6. No tracking issues for the deferred work.

PR description lists three follow-ups (rocm60/rocm61 publish, visual fallback, --image-source parity); GitHub issue search returns only #405. "Deferred to follow-up PR" without an issue is the kind of promise that decays. Open three before merge.

7. .operator.conf KG_API_IMAGE_TAG becomes stale if a user edits GPU_MODE by hand.

guided-init.sh persists the derived tag at init time. Both loaders short-circuit on if [ -z "$KG_API_IMAGE_TAG" ]. So a user who later edits GPU_MODE=cpuGPU_MODE=amd-host in .operator.conf (which the auto-generated header comment at guided-init.sh:313 actively suggests they do) will keep the old KG_API_IMAGE_TAG=latest and silently land on the CUDA-bundled image — the same shape of bug c448f098 just fixed.

Two viable shapes:

  • Persist only the user-overridable bits (GPU_MODE, DEV_MODE, IMAGE_SOURCE) and always re-derive KG_API_IMAGE_TAG on load. Honors edits.
  • Keep persisting, but document inline in the generated file: # DO NOT edit GPU_MODE without also updating KG_API_IMAGE_TAG. Worse — relies on the user noticing.

Related: the .operator.conf auto-generated comment (./operator.sh config --dev true --gpu nvidia|amd|mac|cpu) references a config subcommand that doesn't exist in operator.sh. Another doc/code drift, low severity.


Nits

8. ROCM_VERSION override is undocumented in the persisted config.

common.sh:47 honors ROCM_VERSION as a shell-level override (amd → ${ROCM_VERSION:-rocm60}). Nobody writes it; the header comment in .operator.conf doesn't mention it; the wizard doesn't ask. Either drop the override path (currently dead weight outside test/dev) or document it as a comment line in the generated file.

9. guided-init.sh:317 doesn't honor ROCM_VERSION (only the other two copies do).

Consistent across the three copies for amd mode in common.sh and operator.sh, but guided-init.sh ignores it. Either way, fixing this is moot if you collapse the triplication (item 3).


Praise

  • The _build_with_cpu_fallback extraction is cleanly keyword-only-args and module-scope; the asymmetry comment on reload_* is exactly the kind of long-form-rationale-in-code that earns its keep three months from now when someone asks "why don't we fall back here too?"
  • The case-insensitive cpu retry guard with the parametrized test is a small fix that prevents real double-log noise.
  • test_embedding_model_manager_real_torch.py is the right shape — proves the catchable-exception contract the mocked tests can only assume. The skip-if-CUDA guard makes the assertions honest.
  • The rocm71-host → rocm72-host rename is sound (the old tag was never published; gh api confirms) and the "naming convention" addition to ADR-101 — future ROCm 7.3 lands as a new tag rather than overloading — is the right immutability stance.
  • End-to-end verification on real hardware (7900 XTX gfx1100 + iGPU 9950X3D gfx1036), with the CPU fallback also firing correctly during diagnosis, is the kind of evidence that turns "should work" into "verified."
  • The ROCR_VISIBLE_DEVICES=0 hint at yellow-info-warning severity is the right level — discrete + iGPU coexistence on Ryzen 7000+ is a real foot-gun and runtime ordering isn't guaranteed across kernel/ROCm versions, even if your hardware happened to enumerate the discrete card first.

Empirical checks done in this review

  • gh api .../kg-api/versions → confirmed rocm60/rocm61 not in registry
  • requires-python floor → only fuse/pyproject.toml pins (>=3.11); API has no constraint. py3.13→py3.12 base bump safe.
  • grep -rn "rocm71" → no stale references survive the rename
  • ADR-101's configure.py embedding --device reference → verified the flag exists in operator/configure.py:774
  • kg-api:latest multi-arch (amd64+arm64) → Apple Silicon path through mac → latest is sound

…visual log, doc-surface fixes

Addresses code-reviewer findings on PR #408:

* **operator/lib/image-tag.sh** is the single source of truth for
  GPU_MODE → KG_API_IMAGE_TAG mapping. operator.sh's load_config,
  operator/lib/common.sh's load_operator_config, and
  operator/lib/guided-init.sh all source it and call
  derive_kg_api_image_tag(). The earlier inline-three-times shape
  drifted (commit c448f09 was the empirical proof — fixed one of the
  copies after the bug had silently shipped). One definition now.

* **Wizard option [3] no longer points at an unpublished image.**
  Previously `amd` mode resolved to KG_API_IMAGE_TAG=rocm60, but
  rocm60/rocm61 are deferred per ADR-101 §Negative — GHCR has only
  `latest` and `rocm72-host`. Users picking option [3] would have hit
  `manifest unknown` on pull — same "looks fixed, silently broken"
  shape this PR is supposed to eliminate. Now `amd` mode resolves to
  `latest` and relies on ADR-101's CPU fallback. Setting
  ROCM_VERSION=rocm60 in .operator.conf forces the variant tag once a
  tester confirms a build (tracked in #409). Wizard text reframed:
  "Linux with AMD GPU (ROCm 6.x — preview) / Falls back to CPU
  embeddings via :latest until variant ships".

* **Visual-embedding startup log honest at main.py:218.** Parity with
  the text-embedding fix at main.py:199. No more "Visual embedding
  features may be limited" when the active profile asks for visual
  embeddings and the load failed (image ingestion will actually fail,
  not "be limited"). The new message names the failure mode and gives
  two concrete remediation paths (switch profile, or repair model
  config). Mirrors the text path.

* **Doc/code surface drift fixed.** ADR-101 examples and publish.sh
  comment referenced `--variants rocm60,rocm61` flag and
  `images api-rocm` command that don't exist. Real surface is
  `images-rocm rocm60 --force` (positional variant args + force flag
  to opt into deferred variants).

Issues opened for the items intentionally deferred from this PR:
  #409 — Build & publish ROCm 6.x variants (rocm60, rocm61)
  #410 — CPU fallback for visual embedding generator (text-parity)
  #411 — Operator: --image-source flag silently dropped by guided-init.sh
  #412 — Operator: persisted KG_API_IMAGE_TAG goes stale on manual edits

Reload-asymmetry contract test (review item 4) intentionally deferred —
covers existing pre-PR behavior, separate concern.

Verified the helper across the three call sites:

  bash> source operator/lib/image-tag.sh
  bash> for m in amd amd-host nvidia cpu; do
          echo "$m -> $(derive_kg_api_image_tag $m)"
        done
  amd       -> latest        # was rocm60 (unpublished — review footgun)
  amd-host  -> rocm72-host
  nvidia    -> latest
  cpu       -> latest
  bash> derive_kg_api_image_tag amd rocm60
  rocm60                                  # ROCM_VERSION override still works
@aaronsb aaronsb merged commit fceabf3 into main May 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AMD ROCm init path silently broken: API image has CUDA-only PyTorch

1 participant