Extend problem cache with hardware provenance metadata by danieyan-amd · Pull Request #4835 · ROCm/AMDMIGraphX

danieyan-amd · 2026-04-30T19:43:40Z

Two changes to problem_cache.cpp:

load(): Project deserialized keys to only {name, problem} so that extra metadata fields in the JSON don't break cache key matching. Previously, the full JSON object (all fields) was used as the map key, causing 100% cache misses when metadata was present.
save(): Enrich each key with hardware provenance before writing: gpu_arch, cu_count, graphics_clock_mhz, memory_clock_mhz, memory_bus_bits, vram_bytes, wavefront_size, regs_per_block, max_threads_per_cu. Queried once via hipGetDeviceProperties at session end — negligible performance cost.

The in-memory map always uses {name, problem} keys for O(1) lookups. The on-disk JSON carries additional hardware context for traceability. On load, the extra fields are projected away, preserving fast matching.

Motivation

Adding hardware info to the problem cache, and added handling of the hardware data when doing cache lookups for solutions.

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Two changes to problem_cache.cpp: 1. load(): Project deserialized keys to only {name, problem} so that extra metadata fields in the JSON don't break cache key matching. Previously, the full JSON object (all fields) was used as the map key, causing 100% cache misses when metadata was present. 2. save(): Enrich each key with hardware provenance before writing: gpu_arch, cu_count, graphics_clock_mhz, memory_clock_mhz, memory_bus_bits, vram_bytes, wavefront_size, regs_per_block, max_threads_per_cu. Queried once via hipGetDeviceProperties at session end — negligible performance cost. The in-memory map always uses {name, problem} keys for O(1) lookups. The on-disk JSON carries additional hardware context for traceability. On load, the extra fields are projected away, preserving fast matching.

danieyan-amd · 2026-04-30T19:45:06Z

Sorry Chris, I didnt mean to hit ready for review.

Copilot

Pull request overview

This PR updates the GPU problem cache persistence format to remain resilient to extra on-disk metadata while also recording hardware provenance for traceability.

Changes:

In load(), deserialize into a temporary map and project keys down to {name, problem} to prevent metadata fields from breaking cache-key matching.
In save(), enrich persisted keys with HIP device properties (e.g., arch, CU count, clocks, VRAM) before writing the JSON file.

+    // Enrich keys with hardware provenance metadata on write.
+    // This runs once at session end — negligible cost.
+    hipDeviceProp_t props{};
+    auto status = hipGetDeviceProperties(&props, get_device_id());


danieyan-amd · 2026-05-06T17:40:28Z

+    std::unordered_map<value, value> raw;
+    from_value(from_json_string(read_string(pc_path)), raw);
+    for(auto& [k, v] : raw)
+    {
+        auto projected = create_key(k.at("name").to<std::string>(), k.at("problem"));
+        cache[projected] = v;
+    }


@copilot apply changes based on this feedback

codecov · 2026-04-30T19:56:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4835      +/-   ##
===========================================
- Coverage    92.86%   92.66%   -0.20%     
===========================================
  Files          586      588       +2     
  Lines        30287    30412     +125     
===========================================
+ Hits         28126    28180      +54     
- Misses        2161     2232      +71

see 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

pfultz2 · 2026-05-08T15:42:51Z

+    {
+        auto projected = create_key(k.at("name").to<std::string>(), k.at("problem"));
+        cache[projected] = v;
+    }


Make an extra copy can get slow with larger problem caches.

pfultz2 · 2026-05-08T15:46:57Z

I think the metadata should be managed externally. In the future, we may use sqlite dbs to manage problem caches which may not be efficient to insert metadata like this.

tperry-amd · 2026-05-18T20:24:36Z

I think the metadata should be managed externally. In the future, we may use sqlite dbs to manage problem caches which may not be efficient to insert metadata like this.

Do you mind if we move forward with the code as it is and optimize if it becomes a problem? We just want to incrementally improve the code and build up the database and caching logic. If it becomes a performance issue, we will optimize it since we are focusing on time-to-first-inference right now.

pfultz2 · 2026-05-19T13:30:43Z

Do you mind if we move forward with the code as it is and optimize if it becomes a problem?

I dont understand why this is needed. I would like to keep the code as simple as possible so it easier to update with newer features in the future such as multi-targets, multi-file, sqlite, etc.

tperry-amd · 2026-05-19T15:45:55Z

I dont understand why this is needed. I would like to keep the code as simple as possible so it easier to update with newer features in the future such as multi-targets, multi-file, sqlite, etc.

It's needed so that we can create problem cache databases that span different GPU's and GPU configurations. It will start off simple by selecting a single matching problem if there are multiple different matching hardware configurations but we plan to make the algorithm have better selection login in the future as our data improves. This is so that we can start collecting the data and work on building the database.

Our next update will actually be adding multi-targets and sqlite support. Daniel has also been profiling different backend support and will be providing an abstraction interface to allow interchangeable backends for logging problem cache data.

We need this change so we don't need to redo all the data collection work in the future when we want to select better solution when the hardware configuration changes but has the same gfx arch.

pfultz2 · 2026-05-19T16:19:32Z

It's needed so that we can create problem cache databases that span different GPU's and GPU configurations.

Well this is not the way we would approach this. We would make the device a key. So instead of it being std::unordered_map<value, value> it would be std::unordered_map<std::string, std::unordered_map<value, value>> or std::unordered_map<value, std::unordered_map<value, value>>.

will be providing an abstraction interface

Make sure to use type erasure instead of inheritance for this.

danieyan-amd · 2026-05-19T16:24:08Z

It's needed so that we can create problem cache databases that span different GPU's and GPU configurations.

Well this is not the way we would approach this. We would make the device a key. So instead of it being std::unordered_map<value, value> it would be std::unordered_map<std::string, std::unordered_map<value, value>> or std::unordered_map<value, std::unordered_map<value, value>>.

will be providing an abstraction interface

Make sure to use type erasure instead of inheritance for this.

I can look into this, and make the necessary changes

Addresses PR review feedback: - Device (gpu_arch|cu_count|wavefront_size) used as composite cache key - Type-erased problem_cache_backend wrapper (no virtual inheritance) - JSON backend as default implementation - load()/save() in problem_cache rewritten to use backend abstraction

danieyan-amd

Any Feed back on the changes to use device key? @pfultz2

Fixes licensing CI failure detected by tools/check_stamped.py. Supersedes PR ROCm#4915 (one-line year bump) by folding the fix into this PR.

- problem_cache_backend.hpp: rename trailing-underscore virtual methods to do_* prefix (readability-identifier-naming); replace && / ! with and / not (UseNamedLogicOperator); cast bitwise shift literals to unsigned (hicpp-signed-bitwise); add rule-of-five defaults to concept_t - problem_cache.cpp: replace ! with not on flagged lines; declare local backend as const auto& in has() and get() (cppcheck readability) - json_cache_backend.cpp: replace && / ! with and / not on flagged lines; declare loop variable e as const auto& in load_entries() - All three files re-run through clang-format (ROCm clang-format 21)

- problem_cache_backend.cpp: replace ! and && with not / and in fallback factory - json_cache_backend.hpp: cast bitwise shift literals to unsigned (hicpp-signed-bitwise) - All 3 files re-run through clang-format (ROCm clang-format 21) - Verified all 6 PR4835 files pass clang-format --dry-run -Werror

…k FPs - src/targets/gpu/CMakeLists.txt: register problem_cache_backend.cpp and json_cache_backend.cpp in add_library(migraphx_gpu ...). Without this, the linker fails with undefined references to make_default_cache_backend / make_cache_backend / make_cache_backend_with_fallback, which is the root cause of the 5 failing Jenkins build gates (All Targets Release, HIP Clang Release, Navi4x, RTC Debug, MLIR Debug). - src/targets/gpu/problem_cache_backend.cpp: add 'cppcheck-suppress returnDanglingLifetime' before the three return statements in make_cache_backend / make_default_cache_backend / make_cache_backend_with_fallback. These are inconclusive false positives: the temporary json_cache_backend{} is sunk into a std::unique_ptr<concept_t> by problem_cache_backend's templated constructor (canonical type-erasure pattern), so there is no dangling reference. The suppressions silence the cppcheck CI gate. Locally verified with clang-format 21 dry-run -Werror (exit 0) and cppcheck 2.20.0 using the project's CMakeLists suppress list (exit 0) on all 6 PR4835-introduced cache files.

danieyan-amd marked this pull request as ready for review April 30, 2026 19:44

danieyan-amd requested a review from causten as a code owner April 30, 2026 19:44

Copilot AI review requested due to automatic review settings April 30, 2026 19:44

danieyan-amd marked this pull request as draft April 30, 2026 19:44

Copilot started reviewing on behalf of danieyan-amd April 30, 2026 19:45 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Potential fix for pull request finding

728917c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

danieyan-amd marked this pull request as ready for review May 7, 2026 19:34

Merge branch 'develop' into feature/problem-cache-schema-extension

1c20ada

pfultz2 reviewed May 8, 2026

View reviewed changes

Merge branch 'develop' into feature/problem-cache-schema-extension

8022df0

danieyan-amd force-pushed the feature/problem-cache-schema-extension branch from 4b65154 to 3a47ddc Compare May 20, 2026 14:51

danieyan-amd commented May 25, 2026

View reviewed changes

Machine Learning Administrator added 4 commits May 27, 2026 15:58

Bump license year to 2026 in problem_cache.hpp

ef88503

Fixes licensing CI failure detected by tools/check_stamped.py. Supersedes PR ROCm#4915 (one-line year bump) by folding the fix into this PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend problem cache with hardware provenance metadata#4835

Extend problem cache with hardware provenance metadata#4835
danieyan-amd wants to merge 9 commits into
ROCm:developfrom
danieyan-amd:feature/problem-cache-schema-extension

danieyan-amd commented Apr 30, 2026 •

edited

Loading

Uh oh!

danieyan-amd commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

danieyan-amd May 6, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

pfultz2 May 8, 2026

Uh oh!

pfultz2 commented May 8, 2026

Uh oh!

tperry-amd commented May 18, 2026

Uh oh!

pfultz2 commented May 19, 2026

Uh oh!

tperry-amd commented May 19, 2026

Uh oh!

pfultz2 commented May 19, 2026

Uh oh!

danieyan-amd commented May 19, 2026

Uh oh!

danieyan-amd left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danieyan-amd commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Changelog Category

Uh oh!

danieyan-amd commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

danieyan-amd May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pfultz2 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

pfultz2 commented May 8, 2026

Uh oh!

tperry-amd commented May 18, 2026

Uh oh!

pfultz2 commented May 19, 2026

Uh oh!

tperry-amd commented May 19, 2026

Uh oh!

pfultz2 commented May 19, 2026

Uh oh!

danieyan-amd commented May 19, 2026

Uh oh!

danieyan-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danieyan-amd commented Apr 30, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading

danieyan-amd left a comment •

edited

Loading