Skip to content
Draft
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@

# Emacs
*~

gpu/install_gpu_driver.sh.d
76 changes: 74 additions & 2 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Ver
-----| ------------ | --------- | --------- | -------| ---------------------------
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.4 | 12.4.1 | 590.48.01| 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 590.48.01| 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+

**Supported Operating Systems:**

Expand Down Expand Up @@ -289,6 +289,78 @@ handles metric creation and reporting.
older versions of the `report_gpu_metrics.py` service. The current script
and agent versions aim to mitigate this. If encountered, check agent logs.

## Development and Testing

If you are modifying this initialization action, you can use the provided test infrastructure to validate your changes locally before deploying them to production.

### Local Integration Testing (Bazel / Podman)

Before pushing any changes to GitHub, you **must** run the integration tests locally to validate your modifications against the full test matrix (`test_gpu.py`). These tests use `absl.testing.parameterized` and the `integration_tests.dataproc_test_case` framework to spin up ephemeral Dataproc clusters and validate GPU functionality (SINGLE, STANDARD, KERBEROS, MIG, etc.).

We provide a Podman wrapper to execute the Bazel test suite locally, perfectly simulating the remote CI sandbox environment.

1. **Credentials:** Ensure you have your Google Cloud Application Default Credentials (ADC) saved locally, typically at `~/.config/gcloud/application_default_credentials.json`, and copy it to `initialization-actions/key.json`.
2. **Environment:** You must have a configured `env.json` in the `gpu/` directory.

To run the full suite in the Podman container (Unfiltered):

> ⚠️ **WARNING: HIGH RESOURCE CONSUMPTION**
> An unfiltered run executes the entire test matrix (currently ~12 shards). Because the script is configured to run up to 10 jobs in parallel, this will concurrently provision up to 10 separate Dataproc clusters. This requires massive GCP quota (e.g., ~900 vCPUs and ~30 GPUs simultaneously if using `n1-standard-32` profiles) and will take 60-90 minutes.

```bash
cd initialization-actions
# Test a specific Dataproc image version against the full suite
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22"
```

To run a specific test filter to iterate quickly on a failure (Recommended):

```bash
cd initialization-actions

# Filter by a specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_gpu_allocation"

# Filter by another specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_install_gpu_cuda_nvidia_with_spark_job"

# Filter by the entire class
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=NvidiaGpuDriverTestCase"
```

### Manual Verification Scripts

If you have already provisioned a Dataproc cluster (e.g., `my-cluster`) and want to verify its GPU configuration without running the full Bazel test suite, you can use the standalone verification scripts.

```bash
# Verify using the local Python script
python3 gpu/verify_external_cluster.py \
--cluster=my-cluster \
--region=us-east4 \
--zone=us-east4-b \
--project=my-project \
--tests smi agent spark torch tf numa

# Or using the bash equivalent
export CLUSTER_NAME=my-cluster PROJECT_ID=my-project REGION=us-east4 ZONE=us-east4-b
./gpu/verify_external_gpu_cluster.sh
```

### Advanced Spark / ML Validation

For comprehensive validation of Spark RAPIDS, PyTorch, and TensorFlow on a running cluster, an external testing script is available in the associated `cloud-dataproc/gcloud` repository.

```bash
# Configure the gcloud test environment
cd ../cloud-dataproc/gcloud
source lib/env.sh # Populates environment variables from env.json

# Execute the comprehensive Spark GPU test suite against the configured cluster
./t/spark-gpu-test.sh
```

This script will remotely execute SSH commands to validate NUMA configurations, run PyTorch/TensorFlow isolated in their Conda environments, verify NVCC/cuDNN, and submit `SparkPi` and `JavaIndexToStringExample` Spark jobs configured to use the RAPIDS accelerator plugin.

## Important notes

* This initialization script will install NVIDIA GPU drivers in all nodes in
Expand Down
110 changes: 110 additions & 0 deletions gpu/TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Testing the GPU Initialization Script

This document details the recommended iterative development and testing process for the `install_gpu_driver.sh` script, bypassing the slow integration runner when developing and ensuring comprehensive testing when complete.

## Fast Iterative Development (SSH/Manual)

This initialization action is designed to be **idempotent**, meaning it can be run multiple times on the same node without breaking the environment. It achieves this by writing "completion sentinels" to `/opt/install-dpgce/complete/` after successfully finishing each phase (e.g., `build-dependencies`, `nccl`, `cuda`).

To facilitate rapid iteration, we use the tooling provided in the companion `cloud-dataproc/gcloud` repository. This repo contains the test infrastructure, environment configuration (`env.json`), and lifecycle management scripts (`recreate-dpgce`, `ssh-m`, `scp-m`) necessary to provision and interact with test clusters efficiently.

When making structural or execution logic changes, you want to avoid destroying and recreating the entire Dataproc cluster during each test cycle. Instead, follow this incremental workflow:

### 1. Provision a "Bare" GPU Cluster
First, configure your target OS and versions in `cloud-dataproc/gcloud/env.json`. Then, use the `--no-init-action` flag on the recreation script to provision a cluster with GPUs attached, but *without* running any initialization actions during boot.

```bash
cd cloud-dataproc/gcloud
./bin/recreate-dpgce --gpu --no-init-action
```

### 2. Compile and Stage the Script
The `install_gpu_driver.sh` script is built from fragments. First, compile the fragments, then use the optimized `scp-m` command to transfer your local changes to the -m node. This script stages the file in the GCS temp bucket and pulls it down to `/tmp/install_gpu_driver.sh` over SSH.

```bash
cd initialization-actions
cat gpu/install_gpu_driver.sh.d/*.sh > gpu/install_gpu_driver.sh
cd ../cloud-dataproc/gcloud
./bin/scp-m ../../initialization-actions/gpu/install_gpu_driver.sh
```

### 3. Execute and Monitor (Incremental Testing)
Execute the script manually over SSH as root. Pumping the output through `tee` captures the logs identically to how Dataproc normally records initialization scripts.

**Crucially, when re-running the script to test a specific fix, you must purge the relevant completion sentinels** (and partial build directories like `nccl`) so the script doesn't skip the phase you are trying to test.

* To run the *entire* script from scratch: `sudo rm -rf /opt/install-dpgce/complete`
* To re-test only the NCCL build: `sudo rm -f /opt/install-dpgce/complete/nccl && sudo rm -rf /opt/install-dpgce/nccl`

```bash
cd cloud-dataproc/gcloud
./bin/ssh-m 'sudo rm -rf /opt/install-dpgce/complete' # Example: clear everything
cd ../../initialization-actions
./gpu/install-in-screen.sh
```

If your SSH connection drops, simply run `./gpu/install-in-screen.sh` again to instantly re-attach to the running session without losing context or interrupting the installation.

### 4. Verify with the Test Suite
Once the installation script completes without errors, run the external testing suite to ensure all Conda environments (PyTorch, TensorFlow, RAPIDS) and Spark services correctly bind to the GPU.

```bash
cd cloud-dataproc/gcloud
bash t/spark-gpu-test.sh
```

## Continuous Integration Testing (Bazel/Podman)

Once the manual tests pass, you **must** verify the script behaves correctly within the isolated Python `absl` test harness (`test_gpu.py`) before pushing your changes to GitHub. This validates the full matrix of installation scenarios (SINGLE, STANDARD, KERBEROS, MIG, etc.).

We use a Podman wrapper to execute the Bazel test suite locally, perfectly simulating the remote CI environment.

1. **Credentials:** Ensure your Google Cloud Application Default Credentials (ADC) are saved locally (typically `~/.config/gcloud/application_default_credentials.json`). Copy them to the root of the repository:
```bash
cp ~/.config/gcloud/application_default_credentials.json ./key.json
```

2. **Execute Full Suite (Unfiltered):** To execute the entire parameterized test matrix, run the wrapper script without a test filter.

> ⚠️ **WARNING: HIGH RESOURCE CONSUMPTION**
> An unfiltered run executes all ~12 active parameterized shards. Because the script runs with `--jobs=10`, this will concurrently provision up to 10 separate Dataproc clusters. This requires massive GCP quota (roughly ~900 vCPUs and ~30 GPUs simultaneously if using `n1-standard-32` profiles) and will take approximately 60 to 90 minutes to complete. Do not run this unless you are finalizing a major PR.

```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22"
```

3. **Execute Specific Tests (Recommended for Iteration):** When iterating on a specific feature or failure, always pass Bazel arguments to filter the test execution. This saves significant time and quota. You can filter by test function name or class.

*Filter by a specific test function:*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_gpu_allocation"
```

*Filter by a specific test function that executes spark jobs:*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_install_gpu_cuda_nvidia_with_spark_job"
```

*Filter by test class (runs all tests in the class):*
```bash
cd initialization-actions
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=NvidiaGpuDriverTestCase"
```

## Compiling the AST Splitter Tool (`split.go`)

If you need to re-split `install_gpu_driver.sh` into its `.d/` fragments (e.g. if the main script was modified instead of the fragments), we use a Go-based AST parsing tool (`split.go`) to accurately chunk the bash script.

To compile the tool locally:

```bash
cd initialization-actions/gpu
go mod init split
go get mvdan.cc/sh/v3/syntax
go build -o split_ast split.go
```

Once compiled, executing `./split_ast install_gpu_driver.sh` will parse the script and populate the `install_gpu_driver.sh.d/` directory with the chunked components.
Loading