Fix ROCm build: add absolute include paths and use repo_env flags#129
Fix ROCm build: add absolute include paths and use repo_env flags#129jvantuyl wants to merge 1 commit into
Conversation
When building XLA with ROCm, Bazel's header validation rejects absolute-path includes from the ROCm clang resource directory because rocm_configure.bzl only adds the bazel-cache-relative path to cxx_builtin_include_directories. Additionally, the rocm_configure repository rule cannot see TF_ROCM_CLANG and CLANG_COMPILER_PATH because they were set via --action_env (build actions only) rather than --repo_env (repository rules), causing the crosstool wrapper to default to gcc. This commit: - Adds a patch to rocm_configure.bzl that includes both the real and symlink-resolved absolute ROCm clang resource paths in cxx_builtin_include_directories - Changes --action_env to --repo_env for TF_ROCM_CLANG and CLANG_COMPILER_PATH so the repository rule generates a clang-based crosstool wrapper - Adds TF_ROCM_CLANG and CLANG_COMPILER_PATH as ENV vars in the Dockerfile for container builds
| # See https://github.com/tensorflow/tensorflow/pull/86413 and the | ||
| # referenced threads. | ||
| git apply $dir/cuda_ncrtc_builtins.patch | ||
|
|
||
| # When building XLA with ROCm, the compiler resolves symlinks in the | ||
| # local_config_rocm repository and reports include paths at the real | ||
| # absolute location (e.g. /opt/rocm/llvm/lib/clang/22/include) | ||
| # rather than the symlinked bazel-cache path. Bazel's header | ||
| # validation rejects these as "absolute path inclusions" unless they | ||
| # are listed in cxx_builtin_include_directories. This patch adds | ||
| # the absolute resource directory paths alongside the relative ones. | ||
| if [[ -n "${XLA_TARGET:-}" && "${XLA_TARGET}" == "rocm" ]]; then | ||
| git apply $dir/rocm_absolute_includes.patch | ||
| fi |
There was a problem hiding this comment.
do we want to apply cuda patches conditionally too?
There was a problem hiding this comment.
I think that is a good idea. It was a bit out of scope for me, though.
There was a problem hiding this comment.
Since we're touching this file, how about we apply rocm patches if XLA_TARGET starts with rocm and cuda patches if XLA_TARGET starts with cuda?
This allows for the existing cuda targets and opens the way for version-specific rocm targets
| # the crosstool wrapper. Otherwise the wrapper defaults to | ||
| # gcc and chokes on Clang-only flags like -Qunused-arguments. | ||
| ~s/--repo_env=TF_ROCM_CLANG="1"/, | ||
| ~s|--repo_env=CLANG_COMPILER_PATH="/usr/lib/llvm-18/bin/clang"|, |
There was a problem hiding this comment.
I'm not sure if we want to hardcode the clang path like this
There was a problem hiding this comment.
I don't like it either. I went with this because the clang path is already hard-coded (though less obviously) further down.
Specifically:
# We need to install "add-apt-repository" first
RUN apt-get update && \
# Install basic system dependencies
apt-get update && apt-get install -y ca-certificates curl git unzip wget xxd && \
# Install Clang
clang_version="18" && \
apt-get install -y wget gnupg software-properties-common lsb-release && \
wget -qO- https://apt.llvm.org/llvm.sh | bash -s -- $clang_version && \
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-$clang_version 100 && \
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-$clang_version 100 &&
\
apt-get clean -y && rm -rf /var/lib/apt/lists/*It uses a variable, but the value is just hardcoded.
Any suggestions as to how to modify this to make this palatable?
There was a problem hiding this comment.
I'm ok with this then! Maybe adding this reference as a comment to the file is a good idea.
My concern was exactly the path getting out of sync, but since we're hardcoding it in the installation too it's fine.
There was a problem hiding this comment.
The issue is that the path is specific to the Docker build, but technically someone could run the build outside, so ideally the code here would not be Docker specific.
Would this work?
~s|--repo_env=CLANG_COMPILER_PATH="#{System.find_executable("clang")}"|and if not, then I would do:
~s|--repo_env=CLANG_COMPILER_PATH="#{System.fetch_env!("CLANG_COMPILER_PATH")"|so that we require the caller to set the env (which the Docker already does).
wdyt?
|
For what it's worth, I encountered these issues building on Debian Trixie on an AMD RYZEN AI MAX+ 395 (Halo Strix) system. It is relatively affordable and has 128GB of unified memory. For your average developer, it's really in the sweet spot. In the absence of making ROCm pre-built extensions, getting the build working really helps independent developers adopt local AI. Or at least it would help me. 😅 |
| --build-arg BASE_IMAGE=$base_image \ | ||
| --build-arg ROCM_VERSION=$rocm_ver \ | ||
| --build-arg XLA_TARGET=rocm \ | ||
| --network host \ |
When building XLA with ROCm, Bazel's header validation rejects absolute-path includes from the ROCm clang resource directory because rocm_configure.bzl only adds the bazel-cache-relative path to cxx_builtin_include_directories.
Additionally, the rocm_configure repository rule cannot see TF_ROCM_CLANG and CLANG_COMPILER_PATH because they were set via --action_env (build actions only) rather than --repo_env (repository rules), causing the crosstool wrapper to default to gcc.
This commit: