Skip to content

Reduce dynamic-shape compile cost and select_module dispatch overhead#4911

Open
chun-wan wants to merge 4 commits into
ROCm:developfrom
chun-wan:feat/dyn-dim-freeze-bucket-cache
Open

Reduce dynamic-shape compile cost and select_module dispatch overhead#4911
chun-wan wants to merge 4 commits into
ROCm:developfrom
chun-wan:feat/dyn-dim-freeze-bucket-cache

Conversation

@chun-wan
Copy link
Copy Markdown

For models with a single non-fixed dynamic dimension, today's split_single_dyn_dim pass enumerates every integer in [min, max] and stamps out one static-shape submodule per integer, regardless of dynamic_dimension::optimals. Combined with a per-inference linear scan inside select_module::compute(), this makes wide [min, max] ranges (e.g. {1, 5000000, opt:[400000]}) unusable: ~16 days of compile, ~40 GB engine and +80 - 100 us of per-inference host overhead vs a hand-static engine on MI308X.

This change introduces three layered, opt-in / auto-guarded optimizations:

  1. MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS=1 makes split_single_dyn_dim honour dynamic_dimension::optimals (one submodule per optimal plus min/max endpoints) and adds a smallest-compatible-bucket fallback in select_module::compute() (zero-pads on the ref backend). Collapses compile time and .mxr size to O(|optimals|).

  2. MIGRAPHX_DYN_DIM_FREEZE_TO= adds a new freeze_dyn_dim pass that runs before split_single_dyn_dim and rewrites every non-fixed dynamic parameter to a static shape at size N. Downstream sees a fully-static program -- no submodules, no select_module, no per-inference dispatch. Matches hand-written-static inference latency.

  3. select_module::compute() gets a per-instance dispatch cache. Hot loops with stable input shape skip the linear scan and the per-call get_input_parameter_names / get_parameter_shapes allocations. Auto-activates when select_module has 4+ submodules; kill-switch MIGRAPHX_DISABLE_SELECT_MODULE_CACHE=1 for A/B benchmarks and emergency rollback.

Validated on MI308X (gfx942, ROCm 7.2.0):

  • Pointwise compile: 40.5 s (legacy max=256) -> 0.48 s (bucket max=5M).
  • Pointwise inference: 215 us legacy -> 121 us bucket + cache ->
    115 us freeze (= true-static control at 115 us).
  • All existing unit + ref tests pass, plus 16 new test cases: 4 split_single_dyn_dim, 4 freeze_dyn_dim, 2 ref/freeze_dyn_dim, 6 ref/select_module (incl. cache hot-loop and alternating-shape regression). Validation ran with cache enabled and with the kill-switch on; bitwise-identical outputs verified across frozen / bucket / static engines.

Addresses #1023 (Dynamic Shape Support umbrella) and #4618 (ONNX-Runtime MIGraphX EP recompiling per shape).

Motivation

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@chun-wan chun-wan requested review from a team and causten as code owners May 26, 2026 10:13
@chun-wan chun-wan force-pushed the feat/dyn-dim-freeze-bucket-cache branch 13 times, most recently from 783fe62 to b8cb3ec Compare May 26, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 96.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/include/migraphx/op/select_module.hpp 95.38% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4911      +/-   ##
===========================================
+ Coverage    92.66%   92.68%   +0.02%     
===========================================
  Files          588      588              
  Lines        30412    30472      +60     
===========================================
+ Hits         28180    28242      +62     
+ Misses        2232     2230       -2     
Files with missing lines Coverage Δ
src/include/migraphx/split_single_dyn_dim.hpp 100.00% <ø> (ø)
src/split_single_dyn_dim.cpp 98.65% <100.00%> (+0.19%) ⬆️
src/targets/ref/target.cpp 100.00% <ø> (ø)
src/include/migraphx/op/select_module.hpp 95.15% <95.38%> (+8.61%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chun-wan chun-wan force-pushed the feat/dyn-dim-freeze-bucket-cache branch 4 times, most recently from 23ab503 to 57bad58 Compare May 26, 2026 19:40
@pfultz2
Copy link
Copy Markdown
Collaborator

pfultz2 commented May 26, 2026

MIGRAPHX_DYN_DIM_FREEZE_TO= adds a new freeze_dyn_dim pass that runs before split_single_dyn_dim and rewrites every non-fixed dynamic parameter to a static shape at size N. Downstream sees a fully-static program -- no submodules, no select_module, no per-inference dispatch. Matches hand-written-static inference latency.

Is this padding the dimension to the frozen dim? What happens when the dim is larger than the static size?

// on the hot path. When set, the entire cache infrastructure
// is bypassed -- no read, no populate -- so the measured cost
// matches the pre-cache implementation.
// cppcheck-suppress migraphx-UseCachedEnvVar
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont suppress this cppcheck warning.

Comment thread src/freeze_dyn_dim.cpp Outdated
// string form so tests (and runtime) can flip the value between
// compile passes within a single process without hitting a static
// cache. N=0 disables the pass entirely (the default).
const std::size_t n = value_of(MIGRAPHX_DYN_DIM_FREEZE_TO::value());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont use ::value.

Comment thread src/freeze_dyn_dim.cpp Outdated
auto new_param = mm->add_parameter(name, static_shape);
mm->replace_instruction(stand_param, new_param);
mm->remove_instruction(stand_param);
any_replaced = true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can already be achieved by setting the dimensions to a static size when parsing onnx. This pass should be removed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that method is also more customizable such that multiple different static dimensions can be set for different dynamic dimensions. Also remove the env variable.

Comment thread test/split_single_dyn_dim_test.cpp Outdated
Comment on lines +39 to +86
// Cross-platform env helpers (Windows has _putenv_s instead of setenv).
namespace {
inline int set_env(const char* name, const char* value)
{
#ifdef _WIN32
return _putenv_s(name, value);
#else
return ::setenv(name, value, /*overwrite=*/1);
#endif
}
inline int unset_env(const char* name)
{
#ifdef _WIN32
return _putenv_s(name, "");
#else
return ::unsetenv(name);
#endif
}
} // namespace

namespace {
// RAII guard that turns MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS on for the
// lifetime of the guard and restores the prior value on destruction. Lets a
// single test binary mix bucket-mode and default-mode test cases without
// polluting the environment across tests.
struct bucket_env_guard
{
static constexpr const char* name = "MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS";
std::string prev;
bool had_prev = false;
explicit bucket_env_guard(const char* value)
{
if(auto* p = std::getenv(name))
{
prev = p;
had_prev = true;
}
set_env(name, value);
}
~bucket_env_guard()
{
if(had_prev)
set_env(name, prev.c_str());
else
unset_env(name);
}
};
} // namespace
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete all this. Dont unit tests with env variables(or duplicate functionality we already have). These should be parameters to the pass and then passed in when running the tests: run_pass({.dyn_dim_bucket_by_optimals=true})

Suggested change
// Cross-platform env helpers (Windows has _putenv_s instead of setenv).
namespace {
inline int set_env(const char* name, const char* value)
{
#ifdef _WIN32
return _putenv_s(name, value);
#else
return ::setenv(name, value, /*overwrite=*/1);
#endif
}
inline int unset_env(const char* name)
{
#ifdef _WIN32
return _putenv_s(name, "");
#else
return ::unsetenv(name);
#endif
}
} // namespace
namespace {
// RAII guard that turns MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS on for the
// lifetime of the guard and restores the prior value on destruction. Lets a
// single test binary mix bucket-mode and default-mode test cases without
// polluting the environment across tests.
struct bucket_env_guard
{
static constexpr const char* name = "MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS";
std::string prev;
bool had_prev = false;
explicit bucket_env_guard(const char* value)
{
if(auto* p = std::getenv(name))
{
prev = p;
had_prev = true;
}
set_env(name, value);
}
~bucket_env_guard()
{
if(had_prev)
set_env(name, prev.c_str());
else
unset_env(name);
}
};
} // namespace

For models with a single non-fixed dynamic dimension, today's
split_single_dyn_dim pass enumerates every integer in [min, max] and
stamps out one static-shape submodule per integer, regardless of
dynamic_dimension::optimals. Combined with a per-inference linear scan
inside select_module::compute(), this makes wide [min, max] ranges
(e.g. {1, 5000000, opt:[400000]}) unusable: ~16 days of compile,
~40 GB engine and +80 - 100 us of per-inference host overhead vs a
hand-static engine on MI308X.

This change introduces three layered, opt-in / auto-guarded
optimizations:

  1. MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS=1 makes split_single_dyn_dim
     honour dynamic_dimension::optimals (one submodule per optimal plus
     min/max endpoints) and adds a smallest-compatible-bucket fallback
     in select_module::compute() (zero-pads on the ref backend).
     Collapses compile time and .mxr size to O(|optimals|).

  2. MIGRAPHX_DYN_DIM_FREEZE_TO=<N> adds a new freeze_dyn_dim pass that
     runs before split_single_dyn_dim and rewrites every non-fixed
     dynamic parameter to a static shape at size N. Downstream sees a
     fully-static program -- no submodules, no select_module, no
     per-inference dispatch. Matches hand-written-static inference
     latency.

  3. select_module::compute() gets a per-instance dispatch cache. Hot
     loops with stable input shape skip the linear scan and the
     per-call get_input_parameter_names / get_parameter_shapes
     allocations. Auto-activates when select_module has 4+ submodules;
     kill-switch MIGRAPHX_DISABLE_SELECT_MODULE_CACHE=1 for A/B
     benchmarks and emergency rollback.

Validated on MI308X (gfx942, ROCm 7.2.0):

  - Pointwise compile: 40.5 s (legacy max=256) -> 0.48 s (bucket
    max=5M).
  - Pointwise inference: 215 us legacy -> 121 us bucket + cache ->
    115 us freeze (= true-static control at 115 us).
  - All existing unit + ref tests pass, plus 16 new test cases:
    4 split_single_dyn_dim, 4 freeze_dyn_dim, 2 ref/freeze_dyn_dim,
    6 ref/select_module (incl. cache hot-loop and alternating-shape
    regression). Validation ran with cache enabled and with the
    kill-switch on; bitwise-identical outputs verified across
    frozen / bucket / static engines.

Addresses ROCm#1023 (Dynamic Shape Support umbrella) and
ROCm#4618 (ONNX-Runtime MIGraphX EP recompiling per shape).
@chun-wan chun-wan force-pushed the feat/dyn-dim-freeze-bucket-cache branch from 57bad58 to 160ff74 Compare May 27, 2026 02:00
Copy link
Copy Markdown
Collaborator

@CharlieL7 CharlieL7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the docstrings more terse and/or use some type of bullet point format. There are way too many comment paragraphs to read.

Comment thread src/targets/gpu/target.cpp Outdated
};
}
auto& ctx = any_cast<context>(gctx);
ctx.set_exhaustive_tune_flag(options.exhaustive_tune);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to update this file with respect to the GPU pipeline changes rather than just overwriting back to the previous.

chun-wan pushed a commit to chun-wan/AMDMIGraphX that referenced this pull request May 27, 2026
…e kill switch

Acts on pfultz2 + CharlieL7 review on ROCm#4911:

- Remove `freeze_dyn_dim` pass + tests + docs.  The same result is
  reachable via `parse_onnx(..., default_dyn_dim_value={N,N})` /
  `dim_params`, which is also more flexible (different static values
  per dynamic dim).  Drops one env var (`MIGRAPHX_DYN_DIM_FREEZE_TO`),
  one source file, one header, two test files, and the corresponding
  CHANGELOG / env-vars.rst entries.

- Lift `bucket_by_optimals` from an env-var read inside the pass to a
  field on the `split_single_dyn_dim` struct with `reflect()` support.
  The unit tests now construct the pass directly with the field
  (`run_passes(p, {split_single_dyn_dim{.bucket_by_optimals = true}, ...})`)
  instead of toggling env vars; no env-var state is observed by the
  pass body.  The GPU/Ref targets still expose user opt-in via
  `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS`, but it is read once in
  `target::get_passes()` (cached form) and forwarded as the pass
  field -- the test suite is now env-var free.

- Make `select_module::compute()` bucket fallback always-on.  Whether
  buckets exist in the submodule list is a compile-time decision
  (controlled by the pass field above); at runtime, if no exact-shape
  match is found and a smallest-compatible bucket exists, dispatch
  there.  No env-var gate, no cppcheck-suppress.

- Drop `MIGRAPHX_DISABLE_SELECT_MODULE_CACHE` + the `compute_legacy`
  branch + the kill-switch coverage tests.  The dispatch cache is
  unconditionally active and identical to the uncached path on a
  miss; removing the kill switch lets us drop the cppcheck
  `UseCachedEnvVar` suppression that pfultz2 flagged.

- Update CHANGELOG.md and docs/reference/MIGraphX-dev-env-vars.rst
  to match the new surface area.

Locally verified inside the rocm/ali-private ROCm 7.2.0 build container:
  - `make -j$(nproc) migraphx` succeeds.
  - `test_split_single_dyn_dim_test` -- 11/11 pass (4 bucket-mode
    tests use the new pass field).
  - `test_ref select_module_*` -- 9/9 pass (the 3 kill-switch tests
    and the env-off-strict test are gone; bucket round-up + cache
    hot-loop + alternating-shapes tests still pass).
chun-wan added 3 commits May 28, 2026 02:06
…e kill switch

Acts on pfultz2 + CharlieL7 review on ROCm#4911:

- Remove `freeze_dyn_dim` pass + tests + docs.  The same result is
  reachable via `parse_onnx(..., default_dyn_dim_value={N,N})` /
  `dim_params`, which is also more flexible (different static values
  per dynamic dim).  Drops one env var (`MIGRAPHX_DYN_DIM_FREEZE_TO`),
  one source file, one header, two test files, and the corresponding
  CHANGELOG / env-vars.rst entries.

- Lift `bucket_by_optimals` from an env-var read inside the pass to a
  field on the `split_single_dyn_dim` struct with `reflect()` support.
  The unit tests now construct the pass directly with the field
  (`run_passes(p, {split_single_dyn_dim{.bucket_by_optimals = true}, ...})`)
  instead of toggling env vars; no env-var state is observed by the
  pass body.  The GPU/Ref targets still expose user opt-in via
  `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS`, but it is read once in
  `target::get_passes()` (cached form) and forwarded as the pass
  field -- the test suite is now env-var free.

- Make `select_module::compute()` bucket fallback always-on.  Whether
  buckets exist in the submodule list is a compile-time decision
  (controlled by the pass field above); at runtime, if no exact-shape
  match is found and a smallest-compatible bucket exists, dispatch
  there.  No env-var gate, no cppcheck-suppress.

- Drop `MIGRAPHX_DISABLE_SELECT_MODULE_CACHE` + the `compute_legacy`
  branch + the kill-switch coverage tests.  The dispatch cache is
  unconditionally active and identical to the uncached path on a
  miss; removing the kill switch lets us drop the cppcheck
  `UseCachedEnvVar` suppression that pfultz2 flagged.

- Update CHANGELOG.md and docs/reference/MIGraphX-dev-env-vars.rst
  to match the new surface area.

Locally verified inside the rocm/ali-private ROCm 7.2.0 build container:
  - `make -j$(nproc) migraphx` succeeds.
  - `test_split_single_dyn_dim_test` -- 11/11 pass (4 bucket-mode
    tests use the new pass field).
  - `test_ref select_module_*` -- 9/9 pass (the 3 kill-switch tests
    and the env-off-strict test are gone; bucket round-up + cache
    hot-loop + alternating-shapes tests still pass).
Acts on @CharlieL7's two new review points:

- "Update gpu/target.cpp with respect to the GPU pipeline changes
  rather than just overwriting back to the previous": my previous
  commit had inadvertently dropped the `pipeline_factory` refactor
  that landed on develop.  Reset gpu/target.cpp to upstream/develop
  and re-apply just my 5-line surgical change:
    * declare `MIGRAPHX_DYN_DIM_BUCKET_BY_OPTIMALS` alongside the
      other GPU target env vars
    * inside `pipeline_factory::dynamic_shapes_pipeline()`, pass
      `split_single_dyn_dim{.bucket_by_optimals = enabled(...)}`
      instead of the default-constructed pass.
  Net diff vs upstream/develop is now just those 5 lines + 1 line.

- "Please make the docstrings more terse and/or use some type of
  bullet point format": condensed comments across all changed
  files -- multi-paragraph rationales are now 1-2 lines or bullet
  lists.  No comment-only edits change behaviour; tests still pass.

Verified in the rocm/ali-private ROCm 7.2.0 container:
  * `make migraphx`                      -> ok
  * `test_split_single_dyn_dim_test`     -> 11/11 pass
  * `test_ref "select_module_*"`         -> 9/9 pass
…target.cpp

* split_single_dyn_dim.hpp: copyright was still 2024 -> bump to 2026
  to satisfy `tools/check_stamped.py` (licensing CI).
* gpu/target.cpp: apply the 2-line clang-format diff that the CI
  `Check formatting` step produced (line-break placement around
  `split_single_dyn_dim{.bucket_by_optimals = enabled(...)}`).
@chun-wan chun-wan force-pushed the feat/dyn-dim-freeze-bucket-cache branch from e1d4dfb to 18ddc81 Compare May 27, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants