`ray_implicit_intersection` improvements by swahtz · Pull Request #663 · openvdb/fvdb-core

swahtz · 2026-05-28T06:40:19Z

Rewrite the per-ray SDF zero-crossing kernel (ray_implicit_intersection) to improve performance (and in the process fixed a few correctness issues); surface a shared cache-hint helper that other ops can reuse; lift compile-time specialisations into SampleRaysUniform.cu; clean up HDDAIterators.h to remove warp-divergence in the iterator setters; and broaden the test surface with 7 new behavioural tests adapted from the OpenVDB level-set / volume ray-intersector unit tests.

This optimization started after being inspired by Efty's 'branchless getValue' work in NanoVDB and wanting to apply the same principles to the ray-marching kernels but as I pulled the thread, it lead to other fruit falling out of the tree (to branch metaphors).

End-to-end on my RTX PRO 6000 Blackwell, 1 M rays per workload: 1.29×–1.48× speedup for ray_implicit_intersection (−25 to −28% kernel instructions, −2 registers/thread) and 1.05×–1.10× speedup for uniform_ray_samples (−6 registers/thread on the count pass). c10::Half inputs are now dispatched for ray_implicit_intersection; previously they raised NotImplementedError.

Motivation

ray_implicit_intersection was the dominant cost in SDF-driven rendering workflows for the recent simulation rendering work we were doing, and the current kernel had a few aspects I wanted to address for speed and correctness:

Runtime branch on eps. Every per-voxel iteration checked if (deltaT < eps) continue; even though the overwhelmingly common caller passes eps = 0. NVCC kept the branch and the live register holding eps.
Fixed-sign SDF convention. The kernel assumed "positive outside, negative inside", baking a sentinel value into the seed comparison. Rays that started inside the surface (the primary concern), or grids using the opposite convention (an ancillary benefit of the new change), returned spurious results.
No band-continuity gating. When the HDDA iterator skipped over inactive voxels (a gap between two disjoint SDF regions, or a run of NaN tile values), the kernel still linear-interpolated between the bracketing samples — emitting a "hit" in the empty space.

uniform_ray_samples had a related issue: the per-sample inner loop did stepSize = clamp(t * coneFactor, minStepSize, maxStepSize) on every step even when coneAngle == 0.0 (cone tracing disabled). That kept several extra live registers and a Clamp+mul in this latency-bound traversal.

Both kernels load read-only grid data (gridScalars, leaf masks) and emit write-once outputs (hit times, sample intervals). Both were paying L1 write-allocate cost on the output stores, evicting active-mask data from L1.

Changes

`src/fvdb/detail/utils/cuda/Caching.cuh` (new)

Three inline helpers, host/device:

_loadReadOnly(ptr) — read via __ldg (.NC qualifier in SASS) so read-only SDF data shares the read-only data-cache capacity instead of competing with the active-mask leaf data.
_storeStreaming(dst, value) / _storeStreamingPair(dst, a, b) — write via __stwt (.CS qualifier) so write-once result tensors are not promoted into L1 and don't evict the working set.

Both fall back to plain assignment / dereference on host and for types without a matching intrinsic (e.g. c10::Half); NVCC fully inlines both branches so the CPU path is unaffected. Lives in fvdb::detail::ops so any op .cu file can reach it via unqualified name lookup from inside an anonymous namespace.

`src/fvdb/detail/ops/RayImplicitIntersection.cu`

Add a bool EpsZero template parameter on rayImplicitCallback. The launcher branches once on eps == 0.0f and dispatches the corresponding specialisation on both CPU and CUDA paths. With EpsZero=true NVCC drops the per-voxel deltaT < eps branch and one register entirely. ncu confirms the kernel is the RayImplicitIntersection<1> template instance on the wip path.
All interpolation / time arithmetic done in MathType = at::opmath_type<ScalarT> so c10::Half rays compute interpolation in fp32 and only cast back to ScalarT at the streaming-store boundary.
Route the per-voxel gridScalars load through _loadReadOnly and every outTimes write (early-out, hit, miss) through _storeStreaming.
Replace the nested ifs detecting a sign flip with a single predicated check (scalarSign != voxelSign).
Track band continuity via lastT1. When the next valid voxel is contiguous along the ray (t0 == lastT1), emit a sub-voxel linear-interpolated hit time between the bracketing samples; when there's a gap (inactive voxels in the iterator, or a run of NaN tile values), fall back to the bracket-entry time, matching nanovdb::ZeroCrossing precision in that case. Prevents the kernel from interpolating across empty space between disjoint SDF regions.
Seed the sign reference from the FIRST valid (non-NaN) voxel along the ray instead of a fixed sentinel. Handles "positive outside" and "negative outside" SDF conventions identically; a ray that enters the bbox already inside the surface is reported at the exit crossing along the ray.
Add c10::kHalf to the AT_DISPATCH_V2 types so half-precision rays are actually dispatched on CUDA. Previously raised NotImplementedError: "RayImplicitIntersection" not implemented for 'Half'.

`src/fvdb/detail/ops/SampleRaysUniform.cu`

Adopt the shared Caching.cuh helpers (drops the local _storeStreaming / _storeStreamingPair copies).
Factor _emitSample out of the per-launcher lambdas as a function template. NVCC forbids extended __device__ lambdas inside generic lambdas (which the launchers use), so a free-function template is the only place the streaming-store helpers can live without inline-at-call-site repetition. The template takes the (a, b) pair as separately-deduced types so callers don't have to cast at::opmath_type<Half> = float HDDA times back to c10::Half at every call site.
Compile-time specialise countSamplesPerRayCallback and generateRaySamplesCallback on three runtime predicates (ConeZero, IncludeEndpoints, ReturnMidpoint). NVCC then prunes the dead branches and, critically, hoists stepSize = minStepSize out of the inner while-loops in the ConeZero case, removing a Clamp+mul from the hot per-sample body. ncu confirms the count-pass register count drops from 80 → 74.

`src/fvdb/detail/utils/nanovdb/HDDAIterators.h`

HDDASegmentIterator::nextSegment: replace the entering / leaving if/else ladder around mTimespan.t0 / mTimespan.t1 with predicated select expressions. Only the "active region just ended" break remains a real control-flow branch; rays in the same warp whose active state differs no longer diverge at the setter level.
HDDAVoxelIterator::nextVoxel: replace the three unrolled getDim / level-update passes with a bounded for (pass < 3) loop. Same worst-case behaviour (the level hierarchy stabilises in ≤ 3 passes) but the body collapses to a single descent for level-aligned iterations, which dominate the trace.

These two cleanups are reached through the public iterator API by ray_implicit_intersection, uniform_ray_samples, voxels_along_rays, and segments_along_rays.

Python docstrings

fvdb/grid.py, fvdb/grid_batch.py, fvdb/functional/_ray.py: document the new convention-agnostic semantics of ray_implicit_intersection* — the first valid (non-NaN) voxel along each ray seeds the sign reference, and the first subsequent voxel with the opposite sign is reported as the intersection. Both "positive outside" and "negative outside" SDF conventions are handled identically; rays that enter the bbox already inside the surface are reported at the exit crossing.

Tests

tests/unit/test_basic_ops.py:

Drop the unconditional if dtype == torch.float16: return early return in test_ray_implicit_intersection so the half-precision dispatch is now actually exercised.
Two regressions for the algorithmic fixes:
- ..._starts_inside_surface — ray whose origin sits inside the SDF sphere must report the EXIT crossing (not -1, not the bracket entry of the very first active voxel).
- ..._two_disjoint_regions — ray crossing two separated SDF spheres must report the FIRST surface, not a time inside the empty gap.
7 new @parameterized.expand(all_device_dtype_combos) tests adapted from openvdb/openvdb/unittest/TestLevelSetRayIntersector.cc and TestNanoVDB.cc (sign-of-zero), giving 34 additional test instances: sign-of-zero equivalence across all six axis-aligned directions; axis-aligned and diagonal analytic-root checks; four explicit miss configurations; non-trivial transform with non-unit voxel_size + non-zero origins; 64×64 ray sweep with per-ray geometric check; single-voxel bracket-entry interpolation pinning the primal-voxel convention.

Performance

Measured on RTX PRO 6000 Blackwell. Timing uses torch.cuda.Event over a 1024² = 1,048,576-ray camera-pinhole bundle aimed at each grid's active bbox; median of 20 iterations after trimming the top and bottom two outliers. ncu metrics use --profile-from-start off with a cudaProfilerStart/Stop window around a single 512² = 262,144-ray launch, fp32, on the optimization-engaged setting (eps = 0 / cone_angle = 0).

Wall-clock timing (median ms per call, 1 M rays)

op	dataset	dtype	setting	main	wip	speedup
`ray_implicit_intersection`	dragon (267 MB SDF)	fp32	eps=0	1.264	0.950	1.33×
`ray_implicit_intersection`	dragon	fp32	eps=1e-3	1.271	0.950	1.34×
`ray_implicit_intersection`	dragon	fp16	eps=0	n/a (unsupported)	0.892	new on wip
`ray_implicit_intersection`	dragon	fp16	eps=1e-3	n/a (unsupported)	0.908	new on wip
`ray_implicit_intersection`	emu (1.0 GB SDF)	fp32	eps=0	1.830	1.399	1.31×
`ray_implicit_intersection`	emu	fp32	eps=1e-3	1.830	1.415	1.29×
`ray_implicit_intersection`	emu	fp16	eps=0	n/a (unsupported)	1.425	new on wip
`ray_implicit_intersection`	emu	fp16	eps=1e-3	n/a (unsupported)	1.459	new on wip
`ray_implicit_intersection`	crawler (1.6 GB SDF)	fp32	eps=0	3.572	2.431	1.47×
`ray_implicit_intersection`	crawler	fp32	eps=1e-3	3.629	2.455	1.48×
`ray_implicit_intersection`	crawler	fp16	eps=0	n/a (unsupported)	2.971	new on wip
`ray_implicit_intersection`	crawler	fp16	eps=1e-3	n/a (unsupported)	2.955	new on wip
`uniform_ray_samples`	dragon	fp32	cone=0	2.943	2.790	1.05×
`uniform_ray_samples`	dragon	fp32	cone=1e-3	2.944	2.760	1.07×
`uniform_ray_samples`	dragon	fp16	cone=0	2.967	2.715	1.09×
`uniform_ray_samples`	dragon	fp16	cone=1e-3	2.968	2.785	1.07×
`uniform_ray_samples`	wdas_cloud (2.6 GB fog)	fp32	cone=0	16.878	16.030	1.05×
`uniform_ray_samples`	wdas_cloud	fp32	cone=1e-3	16.810	16.002	1.05×
`uniform_ray_samples`	wdas_cloud	fp16	cone=0	17.797	16.161	1.10×
`uniform_ray_samples`	wdas_cloud	fp16	cone=1e-3	17.741	16.747	1.06×

All fp32 outputs are bit-equivalent main vs this PR branch (output checksum match within atol=1e-3 * max(|main|, |wip|)).

ncu microarchitecture (single launch, fp32, 262k rays)

Columns are main → wip. Lower-is-better for registers, instructions, cycles, and DRAM bytes; higher-is-better for occupancy, SIMD efficiency (smsp__thread_inst_executed_per_inst_executed.ratio, out of 32), and L1 hit rate.

op	dataset	pass	regs	occ %	SIMD eff /32	insts	inst Δ	cycles	L1 hit %	DRAM MB
`ray_implicit_intersection`	dragon	implicit	80 → 78	14.1 → 14.3	10.62 → 10.06	1.29e+08 → 9.61e+07	−25.3%	2.06e+06 → 1.49e+06	94.5 → 92.2	17.8 → 17.8
`ray_implicit_intersection`	emu	implicit	80 → 78	12.4 → 12.6	9.67 → 9.12	1.89e+08 → 1.42e+08	−24.7%	3.11e+06 → 2.27e+06	93.9 → 91.2	36.8 → 30.8
`ray_implicit_intersection`	crawler	implicit	80 → 78	8.4 → 8.8	7.32 → 7.27	1.57e+08 → 1.14e+08	−27.5%	9.20e+06 → 6.78e+06	93.4 → 90.7	29.8 → 34.8
`uniform_ray_samples`	dragon	count	80 → 74	14.0 → 14.4	7.36 → 7.74	2.23e+08 → 2.41e+08	+8.2%	3.22e+06 → 2.97e+06	93.5 → 93.4	27.8 → 27.3
`uniform_ray_samples`	dragon	emit	78 → 80	14.0 → 14.1	7.35 → 7.66	2.26e+08 → 2.20e+08	−2.6%	3.24e+06 → 3.05e+06	92.9 → 92.9	23.0 → 23.1
`uniform_ray_samples`	wdas_cloud	count	80 → 74	16.1 → 16.1	4.68 → 4.86	1.55e+09 → 1.62e+09	+4.4%	1.88e+07 → 1.77e+07	88.4 → 88.4	199.8 → 218.4
`uniform_ray_samples`	wdas_cloud	emit	78 → 80	16.1 → 16.1	4.68 → 4.82	1.56e+09 → 1.56e+09	−0.2%	1.89e+07 → 1.77e+07	88.2 → 88.1	223.5 → 223.8

How the numbers map onto the changes:

ray_implicit_intersection's −25 to −28% kernel instructions is the dominant cause of the 1.29×–1.48× speedup. It comes from the combination of (a) EpsZero=true removing the per-voxel eps branch, (b) the new _emitSample-style streaming-store path eliminating the L1-bounce on output writes, and (c) HDDAIterators predicated setters cutting the per-step setter cost.
Registers/thread drop 80 → 78 on the implicit kernel and 80 → 74 on the uniform_ray_samples count pass — the count pass is exactly where the ConeZero=true hoist of stepSize = minStepSize out of the inner loop applies.
Occupancy gains are small (+0.0 to +0.4 percentage points) because the kernel was already register-bound on Blackwell and lowering registers by 2 wasn't enough to cross an occupancy boundary. The wins are instruction-count- and cycle-driven, not occupancy-driven.
L1 hit rate dips slightly on ray_implicit_intersection (93.4 → 90.7 on crawler). That's the streaming-store path bypassing L1 deliberately — it's the intended behaviour, and the freed-up L1 capacity goes to the read-only side-buffer SDF data via _loadReadOnly. The wall-clock speedup confirms it's a net win.
SIMD efficiency on uniform_ray_samples improves (7.36 → 7.74, 4.68 → 4.86) — the HDDAIterators predication wins. On ray_implicit_intersection SIMD efficiency dips slightly (10.62 → 10.06); the band-continuity check is a small new divergence path that's paid for many times over by the instruction-count reduction.
Instruction count rises on the count pass (+8.2% dragon, +4.4% wdas_cloud) even as cycles fall. The new count callback does more work per thread (compile-time-specialised termination logic, factored emit helpers) but with much better register pressure profile, so the SM keeps more eligible warps in flight per cycle — the cycle drop matches the ~5% wall-clock win.

Notes / risks

API semantics change for ray_implicit_intersection. The function is now SDF-sign-agnostic and reports the EXIT crossing for rays that start inside the surface (previously: returned -1 or a spurious bracket-entry). New behaviour matches nanovdb::ZeroCrossing. Two dedicated regressions pin it (..._starts_inside_surface, ..._two_disjoint_regions). Existing callers that relied on the old "positive-outside-only" assumption will see different outputs on inside-surface rays, but those outputs were almost certainly being treated as no-hit anyway.
c10::Half is now dispatched for ray_implicit_intersection. Code that previously caught NotImplementedError and fell back to fp32 conversion at the Python boundary should drop that branch.

Rewrite the per-ray SDF zero-crossing kernel for performance, precision, and correctness, surface a shared cache-hint helper that other ops can reuse, lift compile-time specialisations into `SampleRaysUniform.cu`, and broaden the test surface with 7 new behavioural tests adapted from the OpenVDB level-set / volume ray-intersector unit tests. Indirectly benefits from `HDDAIterators.h` cleanups landed alongside this change (branchless TimeSpan bookkeeping in `HDDASegmentIterator`, bounded-loop level convergence in `HDDAVoxelIterator`), which the ray-implicit, sample-rays-uniform, voxels-along-rays, and segments-along-rays kernels all reach through the public iterator API. ## Source changes `src/fvdb/detail/utils/cuda/Caching.cuh` (new): - `_storeStreaming` / `_storeStreamingPair`: write-once stores via `__stwt` (`.CS` qualifier in SASS) so write-once output tensors don't get promoted into L1 and evict the voxel-data working set. - `_loadReadOnly`: read-mostly load via `__ldg` (`.NC` qualifier) so side-buffer SDF data shares cache capacity instead of competing with the active-mask leaf data. Both fall back to plain assignment/dereference on host and for types without a matching intrinsic overload (e.g. `c10::Half`); NVCC fully inlines both branches, so the CPU path is unaffected. Lives in `fvdb::detail::ops` so any op `.cu` file can reach it via unqualified name lookup from inside an anonymous namespace. `src/fvdb/detail/ops/RayImplicitIntersection.cu`: - Add a `bool EpsZero` template parameter on `rayImplicitCallback`. The launcher branches once on `eps == 0.0f` (the overwhelmingly common case) and dispatches the corresponding specialisation on both CPU and CUDA paths, so NVCC drops the per-voxel `if (deltaT < eps) continue;` branch and one register entirely when `eps == 0`. - Do all interpolation / time arithmetic in `MathType = at::opmath_type<ScalarT>` so `c10::Half` rays compute interpolation in fp32 and only cast back to `ScalarT` at the streaming-store boundary. - Route the per-voxel `gridScalars` load through `_loadReadOnly` and every `outTimes` write (early-out, hit, miss) through `_storeStreaming`. - Detect sign flips with a single predicated check (`scalarSign != voxelSign`) instead of nested `if`s. - Track band continuity via `lastT1`. When the next valid voxel is contiguous along the ray (`t0 == lastT1`), emit a sub-voxel linear-interpolated hit time between the bracketing samples; when there's a gap (inactive voxels in the iterator, or a run of NaN tile values), fall back to the bracket-entry time, matching `nanovdb::ZeroCrossing` precision in that case. This prevents the kernel from interpolating across empty space between disjoint SDF regions. - Seed the sign reference from the FIRST valid (non-NaN) voxel along the ray (matching `nanovdb::ZeroCrossing` semantics) instead of a fixed sentinel. This handles both rays that start outside the surface (first sample positive, hit on crossing into the negative band) AND rays that start inside the surface (first sample negative, hit on crossing back out) without baking a fixed "positive = outside" SDF convention into the kernel. - Add `c10::kHalf` to the `AT_DISPATCH_V2` types so half-precision rays are actually dispatched on CUDA (previously raised `NotImplementedError: "RayImplicitIntersection" not implemented for 'Half'` at runtime once the unconditional fp16 test skip was removed). `src/fvdb/detail/ops/SampleRaysUniform.cu`: - Adopt the shared `Caching.cuh` helpers (drops the local `_storeStreaming` / `_storeStreamingPair` definitions). - Factor `_emitSample` out of the per-launcher lambdas as a function template. NVCC forbids extended `__device__` lambdas inside generic lambdas (which the launchers use), so a free-function template is the only place the streaming-store helpers can live without redundant inline-at-call-site repetition. The template takes the `(a, b)` pair as separately-deduced types `A, B` so callers don't have to cast `at::opmath_type<Half> = float` HDDA times back to `c10::Half` at every call site. - Compile-time specialise `countSamplesPerRayCallback` and `generateRaySamplesCallback` on three runtime predicates (`ConeZero`, `IncludeEndpoints`, `ReturnMidpoint`). NVCC then prunes the dead branches and, critically, hoists `stepSize = minStepSize` out of the inner while-loops in the `ConeZero` case, removing a `Clamp+mul` from the hot per-sample body and dropping several live registers from this latency-bound traversal. `src/fvdb/detail/utils/nanovdb/HDDAIterators.h`: - `HDDASegmentIterator::nextSegment`: replace the entering / leaving `if`/`else` ladder around `mTimespan.t0` / `mTimespan.t1` with predicated select expressions. Only the "active region just ended" `break` remains a real control-flow branch; rays in the same warp whose `active` state differs no longer diverge at the setter level. - `HDDAVoxelIterator::nextVoxel`: replace the three unrolled `getDim` / level-update passes with a bounded `for (pass < 3)` loop. Same worst-case behaviour (the level hierarchy stabilises in <= 3 passes) but the body collapses to a single descent for level-aligned iterations, which dominate the trace. ## Python docstrings `fvdb/grid.py`, `fvdb/grid_batch.py`, `fvdb/functional/_ray.py`: - Document the new convention-agnostic semantics of `ray_implicit_intersection*`: the first valid (non-NaN) voxel along each ray seeds the sign reference, and the first subsequent voxel with the opposite sign is reported as the intersection. Both "positive outside" and "negative outside" SDF conventions are handled identically; rays that enter the bbox already inside the surface are reported at the *exit* crossing along the ray. ## Tests `tests/unit/test_basic_ops.py`: - Drop the unconditional `if dtype == torch.float16: return` early return in `test_ray_implicit_intersection` so the half-precision dispatch is now actually exercised by the existing test. - Two regressions for the algorithmic fixes in this commit: * `..._starts_inside_surface`: ray whose origin sits inside the SDF sphere must report the EXIT crossing — not -1, not the bracket-entry of the very first active voxel. Pins the `nanovdb::ZeroCrossing`-style "first valid voxel seeds the reference" semantics. * `..._two_disjoint_regions`: ray that crosses two separated SDF spheres must report the FIRST surface, not a time inside the empty gap between them. Pins the band-continuity gating. - 7 new `@parameterized.expand(all_device_dtype_combos)` tests adapted from `openvdb-jswartz/openvdb/openvdb/unittest/` (`TestLevelSetRayIntersector.cc` + `TestNanoVDB.cc` for sign-of- zero), giving 34 additional test instances: | Test | Adapts from | Purpose | |---|---|---| | `..._sign_of_zero` | `TestLevelSetRayIntersector.cc:71-215` + `TestNanoVDB.cc:1520-1552` | `dir(1, +0, +0)` and `dir(1, -0, -0)` produce identical hit times across all 6 axis-aligned directions (`torch.equal` exact match). | | `..._axis_aligned_analytic` | `TestLevelSetRayIntersector.cc:43-247` | Hit time for axis-aligned rays through sphere centre matches the analytic ray-sphere root within a voxel diagonal; covers `±x/±y/±z` (subsumes the OpenVDB negative-direction case). | | `..._diagonal_analytic` | `TestLevelSetRayIntersector.cc:249-278` | Diagonal ray exercising 3-axis HDDA stepping; hit time within voxel diagonal of analytic root. | | `..._explicit_misses` | `TestLevelSetRayIntersector.cc:311-389` (`testMissedIntersections`) | Four miss configurations all return `-1`: ray bypassing bbox, ray clipping bbox corner away from sphere, ray pointed away from bbox, ray grazing inside bbox outside sphere. | | `..._non_trivial_transform` | `TestLevelSetRayIntersector.cc:99-216` | Non-unit `voxel_size=0.25` and non-zero `origins=(10, 20, 30)` — exercises `transform.applyToRay` (`RayImplicitIntersection.cu:82`), previously untested. | | `..._high_resolution_sweep` | `TestLevelSetRayIntersector.cc:280-308` | 64×64 = 4096 ray sweep; geometric check `(hit_pt - center).norm() ≈ sphere_rad` per ray. fp16 explicitly skipped (precision insufficient for voxel-diagonal tolerance). | | `..._single_voxel_bracket_entry` | (no direct OpenVDB analogue — pins `RayImplicitIntersection.cu:121-136`) | Symmetric `+1/-1` step SDF; verifies the linear-interp branch lands the zero exactly at the midpoint between two bracketing primal-voxel samples. | - Convention pinned by the bracket-entry test: under fvdb's `voxel_to_world`, primal voxel `i` is at world `i*voxel_size + origin` (treated as a node, not a cell with a `+0.5` offset). Combined with the dual transform's `+0.5` shift, the kernel's linear interpolation between SDF samples at primal voxels 3 (+1) and 4 (-1) produces a zero crossing at world `x = 3.5`, not `x = 4.0` as you'd get under a cell-centred convention. The test comment documents this for future readers. ## Test plan - Build: `./build.sh install` (cp312, fvdb conda env). - Targeted: `cd tests && pytest unit/test_basic_ops.py -v -k ray_implicit` -> 49 passed, 1 skipped (deliberate fp16 skip in `..._high_resolution_sweep`). - Adjacent regressions (no behavioural changes expected, sanity only): * `pytest unit/test_basic_ops.py` -> 265 passed, 1 skipped. * `pytest unit/test_basic_ops_single.py` -> 154 passed. * `pytest unit/test_ray_marching.py unit/test_sample.py` -> 455 passed, 4 skipped. - C++: `RayImplicitIntersection` has no `gtest` coverage today, so `./build.sh ctest` would not exercise this change. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR optimizes ray traversal kernels and updates ray_implicit_intersection semantics to be sign-convention agnostic, including CUDA half dispatch and broader behavioral coverage.

Changes:

Adds shared CUDA cache-hint helpers for read-only loads and streaming stores.
Reworks ray_implicit_intersection, uniform_ray_samples, and HDDA iterators with compile-time specialization and traversal optimizations.
Updates Python docs and adds/expands tests for ray implicit intersection edge cases.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/fvdb/detail/utils/cuda/Caching.cuh`	Adds reusable cache-hint load/store helpers.
`src/fvdb/detail/ops/RayImplicitIntersection.cu`	Rewrites zero-crossing traversal semantics and dispatch specialization.
`src/fvdb/detail/ops/SampleRaysUniform.cu`	Specializes sampling callbacks and reuses streaming-store helpers.
`src/fvdb/detail/utils/nanovdb/HDDAIterators.h`	Refactors HDDA segment/voxel iteration control flow.
`fvdb/grid.py`	Documents updated single-grid ray intersection semantics.
`fvdb/grid_batch.py`	Documents updated batch ray intersection semantics.
`fvdb/functional/_ray.py`	Documents functional ray intersection semantics.
`tests/unit/test_basic_ops.py`	Enables half coverage and adds ray intersection behavioral tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

swahtz added 2 commits April 30, 2026 11:53

Merge branch 'main' into raytrace_improvements_wip

2e7703d

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz added this to the v0.5 milestone May 28, 2026

swahtz added optimization Performance or memory optimization core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module labels May 28, 2026

swahtz requested a review from Copilot May 28, 2026 06:40

Copilot started reviewing on behalf of swahtz May 28, 2026 06:41 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ray_implicit_intersection` improvements#663

`ray_implicit_intersection` improvements#663
swahtz wants to merge 2 commits into
openvdb:mainfrom
swahtz:raytrace_improvements_wip

swahtz commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swahtz commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

src/fvdb/detail/utils/cuda/Caching.cuh (new)

src/fvdb/detail/ops/RayImplicitIntersection.cu

src/fvdb/detail/ops/SampleRaysUniform.cu

src/fvdb/detail/utils/nanovdb/HDDAIterators.h

Python docstrings

Tests

Performance

Wall-clock timing (median ms per call, 1 M rays)

ncu microarchitecture (single launch, fp32, 262k rays)

Notes / risks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swahtz commented May 28, 2026 •

edited

Loading

`src/fvdb/detail/utils/cuda/Caching.cuh` (new)

`src/fvdb/detail/ops/RayImplicitIntersection.cu`

`src/fvdb/detail/ops/SampleRaysUniform.cu`

`src/fvdb/detail/utils/nanovdb/HDDAIterators.h`