Skip to content

Fix ROCm build: add absolute include paths and use repo_env flags#129

Open
jvantuyl wants to merge 1 commit into
elixir-nx:mainfrom
jvantuyl:fix/rocm_path_herm_fix/1
Open

Fix ROCm build: add absolute include paths and use repo_env flags#129
jvantuyl wants to merge 1 commit into
elixir-nx:mainfrom
jvantuyl:fix/rocm_path_herm_fix/1

Conversation

@jvantuyl
Copy link
Copy Markdown

When building XLA with ROCm, Bazel's header validation rejects absolute-path includes from the ROCm clang resource directory because rocm_configure.bzl only adds the bazel-cache-relative path to cxx_builtin_include_directories.

Additionally, the rocm_configure repository rule cannot see TF_ROCM_CLANG and CLANG_COMPILER_PATH because they were set via --action_env (build actions only) rather than --repo_env (repository rules), causing the crosstool wrapper to default to gcc.

This commit:

  • Adds a patch to rocm_configure.bzl that includes both the real and symlink-resolved absolute ROCm clang resource paths in cxx_builtin_include_directories
  • Changes --action_env to --repo_env for TF_ROCM_CLANG and CLANG_COMPILER_PATH so the repository rule generates a clang-based crosstool wrapper
  • Adds TF_ROCM_CLANG and CLANG_COMPILER_PATH as ENV vars in the Dockerfile for container builds

When building XLA with ROCm, Bazel's header validation rejects
absolute-path includes from the ROCm clang resource directory because
rocm_configure.bzl only adds the bazel-cache-relative path to
cxx_builtin_include_directories.

Additionally, the rocm_configure repository rule cannot see
TF_ROCM_CLANG and CLANG_COMPILER_PATH because they were set via
--action_env (build actions only) rather than --repo_env (repository
rules), causing the crosstool wrapper to default to gcc.

This commit:
- Adds a patch to rocm_configure.bzl that includes both the real
  and symlink-resolved absolute ROCm clang resource paths in
  cxx_builtin_include_directories
- Changes --action_env to --repo_env for TF_ROCM_CLANG and
  CLANG_COMPILER_PATH so the repository rule generates a
  clang-based crosstool wrapper
- Adds TF_ROCM_CLANG and CLANG_COMPILER_PATH as ENV vars in the
  Dockerfile for container builds
Comment on lines 21 to +34
# See https://github.com/tensorflow/tensorflow/pull/86413 and the
# referenced threads.
git apply $dir/cuda_ncrtc_builtins.patch

# When building XLA with ROCm, the compiler resolves symlinks in the
# local_config_rocm repository and reports include paths at the real
# absolute location (e.g. /opt/rocm/llvm/lib/clang/22/include)
# rather than the symlinked bazel-cache path. Bazel's header
# validation rejects these as "absolute path inclusions" unless they
# are listed in cxx_builtin_include_directories. This patch adds
# the absolute resource directory paths alongside the relative ones.
if [[ -n "${XLA_TARGET:-}" && "${XLA_TARGET}" == "rocm" ]]; then
git apply $dir/rocm_absolute_includes.patch
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to apply cuda patches conditionally too?

Copy link
Copy Markdown
Author

@jvantuyl jvantuyl May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a good idea. It was a bit out of scope for me, though.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're touching this file, how about we apply rocm patches if XLA_TARGET starts with rocm and cuda patches if XLA_TARGET starts with cuda?

This allows for the existing cuda targets and opens the way for version-specific rocm targets

Comment thread lib/xla.ex
# the crosstool wrapper. Otherwise the wrapper defaults to
# gcc and chokes on Clang-only flags like -Qunused-arguments.
~s/--repo_env=TF_ROCM_CLANG="1"/,
~s|--repo_env=CLANG_COMPILER_PATH="/usr/lib/llvm-18/bin/clang"|,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to hardcode the clang path like this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either. I went with this because the clang path is already hard-coded (though less obviously) further down.

Specifically:

# We need to install "add-apt-repository" first
RUN apt-get update && \
  # Install basic system dependencies
  apt-get update && apt-get install -y ca-certificates curl git unzip wget xxd && \
  # Install Clang
  clang_version="18" && \
  apt-get install -y wget gnupg software-properties-common lsb-release && \
  wget -qO- https://apt.llvm.org/llvm.sh | bash -s -- $clang_version && \
  update-alternatives --install /usr/bin/clang clang /usr/bin/clang-$clang_version 100 && \
  update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-$clang_version 100 &&
 \
  apt-get clean -y && rm -rf /var/lib/apt/lists/*

It uses a variable, but the value is just hardcoded.

Any suggestions as to how to modify this to make this palatable?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this then! Maybe adding this reference as a comment to the file is a good idea.
My concern was exactly the path getting out of sync, but since we're hardcoding it in the installation too it's fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the path is specific to the Docker build, but technically someone could run the build outside, so ideally the code here would not be Docker specific.

Would this work?

~s|--repo_env=CLANG_COMPILER_PATH="#{System.find_executable("clang")}"|

and if not, then I would do:

~s|--repo_env=CLANG_COMPILER_PATH="#{System.fetch_env!("CLANG_COMPILER_PATH")"|

so that we require the caller to set the env (which the Docker already does).

wdyt?

@jvantuyl
Copy link
Copy Markdown
Author

For what it's worth, I encountered these issues building on Debian Trixie on an AMD RYZEN AI MAX+ 395 (Halo Strix) system. It is relatively affordable and has 128GB of unified memory. For your average developer, it's really in the sweet spot.

In the absence of making ROCm pre-built extensions, getting the build working really helps independent developers adopt local AI. Or at least it would help me. 😅

Comment thread builds/build.sh
--build-arg BASE_IMAGE=$base_image \
--build-arg ROCM_VERSION=$rocm_ver \
--build-arg XLA_TARGET=rocm \
--network host \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants