Reduce dynamic-shape compile cost and select_module dispatch overhead#4911
Reduce dynamic-shape compile cost and select_module dispatch overhead#4911chun-wan wants to merge 4 commits into
Conversation
783fe62 to
b8cb3ec
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4911 +/- ##
===========================================
+ Coverage 92.66% 92.68% +0.02%
===========================================
Files 588 588
Lines 30412 30472 +60
===========================================
+ Hits 28180 28242 +62
+ Misses 2232 2230 -2
🚀 New features to boost your workflow:
|
23ab503 to
57bad58
Compare
Is this padding the dimension to the frozen dim? What happens when the dim is larger than the static size? |
| // on the hot path. When set, the entire cache infrastructure | ||
| // is bypassed -- no read, no populate -- so the measured cost | ||
| // matches the pre-cache implementation. | ||
| // cppcheck-suppress migraphx-UseCachedEnvVar |
There was a problem hiding this comment.
Dont suppress this cppcheck warning.
| // string form so tests (and runtime) can flip the value between | ||
| // compile passes within a single process without hitting a static | ||
| // cache. N=0 disables the pass entirely (the default). | ||
| const std::size_t n = value_of(MIGRAPHX_DYN_DIM_FREEZE_TO::value()); |
| auto new_param = mm->add_parameter(name, static_shape); | ||
| mm->replace_instruction(stand_param, new_param); | ||
| mm->remove_instruction(stand_param); | ||
| any_replaced = true; |
There was a problem hiding this comment.
This can already be achieved by setting the dimensions to a static size when parsing onnx. This pass should be removed.
There was a problem hiding this comment.
Agreed, that method is also more customizable such that multiple different static dimensions can be set for different dynamic dimensions. Also remove the env variable.
| // Cross-platform env helpers (Windows has _putenv_s instead of setenv). | ||
| namespace { | ||
| inline int set_env(const char* name, const char* value) | ||
| { | ||
| #ifdef _WIN32 | ||
| return _putenv_s(name, value); | ||
| #else | ||
| return ::setenv(name, value, /*overwrite=*/1); | ||
| #endif | ||
| } | ||
| inline int unset_env(const char* name) | ||
| { | ||
| #ifdef _WIN32 | ||
| return _putenv_s(name, ""); | ||
| #else | ||
| return ::unsetenv(name); | ||
| #endif | ||
| } | ||
| } // namespace | ||
|
|
||
| namespace { | ||
| // RAII guard that turns MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS on for the | ||
| // lifetime of the guard and restores the prior value on destruction. Lets a | ||
| // single test binary mix bucket-mode and default-mode test cases without | ||
| // polluting the environment across tests. | ||
| struct bucket_env_guard | ||
| { | ||
| static constexpr const char* name = "MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS"; | ||
| std::string prev; | ||
| bool had_prev = false; | ||
| explicit bucket_env_guard(const char* value) | ||
| { | ||
| if(auto* p = std::getenv(name)) | ||
| { | ||
| prev = p; | ||
| had_prev = true; | ||
| } | ||
| set_env(name, value); | ||
| } | ||
| ~bucket_env_guard() | ||
| { | ||
| if(had_prev) | ||
| set_env(name, prev.c_str()); | ||
| else | ||
| unset_env(name); | ||
| } | ||
| }; | ||
| } // namespace |
There was a problem hiding this comment.
Delete all this. Dont unit tests with env variables(or duplicate functionality we already have). These should be parameters to the pass and then passed in when running the tests: run_pass({.dyn_dim_bucket_by_optimals=true})
| // Cross-platform env helpers (Windows has _putenv_s instead of setenv). | |
| namespace { | |
| inline int set_env(const char* name, const char* value) | |
| { | |
| #ifdef _WIN32 | |
| return _putenv_s(name, value); | |
| #else | |
| return ::setenv(name, value, /*overwrite=*/1); | |
| #endif | |
| } | |
| inline int unset_env(const char* name) | |
| { | |
| #ifdef _WIN32 | |
| return _putenv_s(name, ""); | |
| #else | |
| return ::unsetenv(name); | |
| #endif | |
| } | |
| } // namespace | |
| namespace { | |
| // RAII guard that turns MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS on for the | |
| // lifetime of the guard and restores the prior value on destruction. Lets a | |
| // single test binary mix bucket-mode and default-mode test cases without | |
| // polluting the environment across tests. | |
| struct bucket_env_guard | |
| { | |
| static constexpr const char* name = "MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS"; | |
| std::string prev; | |
| bool had_prev = false; | |
| explicit bucket_env_guard(const char* value) | |
| { | |
| if(auto* p = std::getenv(name)) | |
| { | |
| prev = p; | |
| had_prev = true; | |
| } | |
| set_env(name, value); | |
| } | |
| ~bucket_env_guard() | |
| { | |
| if(had_prev) | |
| set_env(name, prev.c_str()); | |
| else | |
| unset_env(name); | |
| } | |
| }; | |
| } // namespace |
For models with a single non-fixed dynamic dimension, today's
split_single_dyn_dim pass enumerates every integer in [min, max] and
stamps out one static-shape submodule per integer, regardless of
dynamic_dimension::optimals. Combined with a per-inference linear scan
inside select_module::compute(), this makes wide [min, max] ranges
(e.g. {1, 5000000, opt:[400000]}) unusable: ~16 days of compile,
~40 GB engine and +80 - 100 us of per-inference host overhead vs a
hand-static engine on MI308X.
This change introduces three layered, opt-in / auto-guarded
optimizations:
1. MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS=1 makes split_single_dyn_dim
honour dynamic_dimension::optimals (one submodule per optimal plus
min/max endpoints) and adds a smallest-compatible-bucket fallback
in select_module::compute() (zero-pads on the ref backend).
Collapses compile time and .mxr size to O(|optimals|).
2. MIGRAPHX_DYN_DIM_FREEZE_TO=<N> adds a new freeze_dyn_dim pass that
runs before split_single_dyn_dim and rewrites every non-fixed
dynamic parameter to a static shape at size N. Downstream sees a
fully-static program -- no submodules, no select_module, no
per-inference dispatch. Matches hand-written-static inference
latency.
3. select_module::compute() gets a per-instance dispatch cache. Hot
loops with stable input shape skip the linear scan and the
per-call get_input_parameter_names / get_parameter_shapes
allocations. Auto-activates when select_module has 4+ submodules;
kill-switch MIGRAPHX_DISABLE_SELECT_MODULE_CACHE=1 for A/B
benchmarks and emergency rollback.
Validated on MI308X (gfx942, ROCm 7.2.0):
- Pointwise compile: 40.5 s (legacy max=256) -> 0.48 s (bucket
max=5M).
- Pointwise inference: 215 us legacy -> 121 us bucket + cache ->
115 us freeze (= true-static control at 115 us).
- All existing unit + ref tests pass, plus 16 new test cases:
4 split_single_dyn_dim, 4 freeze_dyn_dim, 2 ref/freeze_dyn_dim,
6 ref/select_module (incl. cache hot-loop and alternating-shape
regression). Validation ran with cache enabled and with the
kill-switch on; bitwise-identical outputs verified across
frozen / bucket / static engines.
Addresses ROCm#1023 (Dynamic Shape Support umbrella) and
ROCm#4618 (ONNX-Runtime MIGraphX EP recompiling per shape).
57bad58 to
160ff74
Compare
CharlieL7
left a comment
There was a problem hiding this comment.
Please make the docstrings more terse and/or use some type of bullet point format. There are way too many comment paragraphs to read.
| }; | ||
| } | ||
| auto& ctx = any_cast<context>(gctx); | ||
| ctx.set_exhaustive_tune_flag(options.exhaustive_tune); |
There was a problem hiding this comment.
You will need to update this file with respect to the GPU pipeline changes rather than just overwriting back to the previous.
…e kill switch Acts on pfultz2 + CharlieL7 review on ROCm#4911: - Remove `freeze_dyn_dim` pass + tests + docs. The same result is reachable via `parse_onnx(..., default_dyn_dim_value={N,N})` / `dim_params`, which is also more flexible (different static values per dynamic dim). Drops one env var (`MIGRAPHX_DYN_DIM_FREEZE_TO`), one source file, one header, two test files, and the corresponding CHANGELOG / env-vars.rst entries. - Lift `bucket_by_optimals` from an env-var read inside the pass to a field on the `split_single_dyn_dim` struct with `reflect()` support. The unit tests now construct the pass directly with the field (`run_passes(p, {split_single_dyn_dim{.bucket_by_optimals = true}, ...})`) instead of toggling env vars; no env-var state is observed by the pass body. The GPU/Ref targets still expose user opt-in via `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS`, but it is read once in `target::get_passes()` (cached form) and forwarded as the pass field -- the test suite is now env-var free. - Make `select_module::compute()` bucket fallback always-on. Whether buckets exist in the submodule list is a compile-time decision (controlled by the pass field above); at runtime, if no exact-shape match is found and a smallest-compatible bucket exists, dispatch there. No env-var gate, no cppcheck-suppress. - Drop `MIGRAPHX_DISABLE_SELECT_MODULE_CACHE` + the `compute_legacy` branch + the kill-switch coverage tests. The dispatch cache is unconditionally active and identical to the uncached path on a miss; removing the kill switch lets us drop the cppcheck `UseCachedEnvVar` suppression that pfultz2 flagged. - Update CHANGELOG.md and docs/reference/MIGraphX-dev-env-vars.rst to match the new surface area. Locally verified inside the rocm/ali-private ROCm 7.2.0 build container: - `make -j$(nproc) migraphx` succeeds. - `test_split_single_dyn_dim_test` -- 11/11 pass (4 bucket-mode tests use the new pass field). - `test_ref select_module_*` -- 9/9 pass (the 3 kill-switch tests and the env-off-strict test are gone; bucket round-up + cache hot-loop + alternating-shapes tests still pass).
…e kill switch Acts on pfultz2 + CharlieL7 review on ROCm#4911: - Remove `freeze_dyn_dim` pass + tests + docs. The same result is reachable via `parse_onnx(..., default_dyn_dim_value={N,N})` / `dim_params`, which is also more flexible (different static values per dynamic dim). Drops one env var (`MIGRAPHX_DYN_DIM_FREEZE_TO`), one source file, one header, two test files, and the corresponding CHANGELOG / env-vars.rst entries. - Lift `bucket_by_optimals` from an env-var read inside the pass to a field on the `split_single_dyn_dim` struct with `reflect()` support. The unit tests now construct the pass directly with the field (`run_passes(p, {split_single_dyn_dim{.bucket_by_optimals = true}, ...})`) instead of toggling env vars; no env-var state is observed by the pass body. The GPU/Ref targets still expose user opt-in via `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS`, but it is read once in `target::get_passes()` (cached form) and forwarded as the pass field -- the test suite is now env-var free. - Make `select_module::compute()` bucket fallback always-on. Whether buckets exist in the submodule list is a compile-time decision (controlled by the pass field above); at runtime, if no exact-shape match is found and a smallest-compatible bucket exists, dispatch there. No env-var gate, no cppcheck-suppress. - Drop `MIGRAPHX_DISABLE_SELECT_MODULE_CACHE` + the `compute_legacy` branch + the kill-switch coverage tests. The dispatch cache is unconditionally active and identical to the uncached path on a miss; removing the kill switch lets us drop the cppcheck `UseCachedEnvVar` suppression that pfultz2 flagged. - Update CHANGELOG.md and docs/reference/MIGraphX-dev-env-vars.rst to match the new surface area. Locally verified inside the rocm/ali-private ROCm 7.2.0 build container: - `make -j$(nproc) migraphx` succeeds. - `test_split_single_dyn_dim_test` -- 11/11 pass (4 bucket-mode tests use the new pass field). - `test_ref select_module_*` -- 9/9 pass (the 3 kill-switch tests and the env-off-strict test are gone; bucket round-up + cache hot-loop + alternating-shapes tests still pass).
Acts on @CharlieL7's two new review points: - "Update gpu/target.cpp with respect to the GPU pipeline changes rather than just overwriting back to the previous": my previous commit had inadvertently dropped the `pipeline_factory` refactor that landed on develop. Reset gpu/target.cpp to upstream/develop and re-apply just my 5-line surgical change: * declare `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS` alongside the other GPU target env vars * inside `pipeline_factory::dynamic_shapes_pipeline()`, pass `split_single_dyn_dim{.bucket_by_optimals = enabled(...)}` instead of the default-constructed pass. Net diff vs upstream/develop is now just those 5 lines + 1 line. - "Please make the docstrings more terse and/or use some type of bullet point format": condensed comments across all changed files -- multi-paragraph rationales are now 1-2 lines or bullet lists. No comment-only edits change behaviour; tests still pass. Verified in the rocm/ali-private ROCm 7.2.0 container: * `make migraphx` -> ok * `test_split_single_dyn_dim_test` -> 11/11 pass * `test_ref "select_module_*"` -> 9/9 pass
…target.cpp
* split_single_dyn_dim.hpp: copyright was still 2024 -> bump to 2026
to satisfy `tools/check_stamped.py` (licensing CI).
* gpu/target.cpp: apply the 2-line clang-format diff that the CI
`Check formatting` step produced (line-break placement around
`split_single_dyn_dim{.bucket_by_optimals = enabled(...)}`).
e1d4dfb to
18ddc81
Compare
For models with a single non-fixed dynamic dimension, today's split_single_dyn_dim pass enumerates every integer in [min, max] and stamps out one static-shape submodule per integer, regardless of dynamic_dimension::optimals. Combined with a per-inference linear scan inside select_module::compute(), this makes wide [min, max] ranges (e.g. {1, 5000000, opt:[400000]}) unusable: ~16 days of compile, ~40 GB engine and +80 - 100 us of per-inference host overhead vs a hand-static engine on MI308X.
This change introduces three layered, opt-in / auto-guarded optimizations:
MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS=1 makes split_single_dyn_dim honour dynamic_dimension::optimals (one submodule per optimal plus min/max endpoints) and adds a smallest-compatible-bucket fallback in select_module::compute() (zero-pads on the ref backend). Collapses compile time and .mxr size to O(|optimals|).
MIGRAPHX_DYN_DIM_FREEZE_TO= adds a new freeze_dyn_dim pass that runs before split_single_dyn_dim and rewrites every non-fixed dynamic parameter to a static shape at size N. Downstream sees a fully-static program -- no submodules, no select_module, no per-inference dispatch. Matches hand-written-static inference latency.
select_module::compute() gets a per-instance dispatch cache. Hot loops with stable input shape skip the linear scan and the per-call get_input_parameter_names / get_parameter_shapes allocations. Auto-activates when select_module has 4+ submodules; kill-switch MIGRAPHX_DISABLE_SELECT_MODULE_CACHE=1 for A/B benchmarks and emergency rollback.
Validated on MI308X (gfx942, ROCm 7.2.0):
115 us freeze (= true-static control at 115 us).
Addresses #1023 (Dynamic Shape Support umbrella) and #4618 (ONNX-Runtime MIGraphX EP recompiling per shape).
Motivation
Technical Details
Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable