Skip to content

MI300A calibration: fp32_throughput CI vs real-hardware ground truth#271

Open
syifan wants to merge 17 commits into
mainfrom
mi300a-calibration
Open

MI300A calibration: fp32_throughput CI vs real-hardware ground truth#271
syifan wants to merge 17 commits into
mainfrom
mi300a-calibration

Conversation

@syifan

@syifan syifan commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Adds the first piece of MI300A timing-model calibration: a CI workflow that runs the simulator on the fp32_throughput microbenchmark and compares simulated GFLOPS against real-hardware ground truth.

Stacked on #269 (base = mi300a-bench-port, which holds the microbench code this depends on). Retarget to main once #269 merges.

What's here

  • .github/workflows/mi300a_calibration.yml — runs the sweep → compares → publishes a markdown report both as a downloadable artifact and rendered in the job summary ($GITHUB_STEP_SUMMARY). Self-hosted runner, 60-min timeout.
  • gpu_perf_scripts/calibration/
    • run_sim_sweep.sh — builds amd/samples/fp32_throughput, sweeps configs under -arch cdna3 -gpu mi300a -timing, extracts each kernel_time from the mgpusim_metrics SQLite table.
    • compare_to_ground_truth.py — derives GFLOPS, computes absolute error (sim − real) + MAE, fills the template.
    • summarize_ground_truth.py — regenerates the ground-truth CSV from mi300a_ground_truth.db.
    • report_template.md — editable report template.
    • mi300a_ground_truth.csv — committed ground truth (325 fp32 rows).

Why a CSV, not the DB

mi300a_ground_truth.db is ~9 MB (every individual repetition across 7 benchmarks + ~5 MB of indexes; atomic_operations alone is 56%). Calibration only needs the per-config mean over the 7 reps, so it's summarized to a 24 KB CSV. The DB is not committed. (Required a .gitignore negation since the repo globally ignores *.csv.)

Scope / known limitations

  • fp32_throughput only. cache_latency is deferred — it panics in timing mode (Panic: page not found in the Akita MMU page walk) for all parameters; emulation mode works.
  • Trigger: workflow_dispatch only registers once the file is on the default branch, so a path-filtered push trigger for mi300a-calibration is included to run it now. The manual button works after this reaches main.
  • Baseline is uncalibrated: sim fp32 currently runs ~21–22× faster than real HW (MAE ≈ 1,244 GFLOPS) — real HW has a ~0.5 ms fixed kernel-time floor the sim lacks, and the sim runs all blocks in parallel on 120 CUs. The GFLOPS comparison is most meaningful in the throughput-bound regime (large fmas). Closing this gap is the calibration work to follow.

🤖 Generated with Claude Code

syifan and others added 10 commits June 16, 2026 16:13
Update the DOCKER_IMAGE tag from rocm/dev-ubuntu-24.04:7.1.1 to 7.2.4 across
every native/ HIP-compilation Makefile and CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The functional emulator (amd/emu/computeunit.go) already packs CDNA3 (code
object v5) work-item IDs into a single VGPR as v0 = x | y<<10 | z<<20, but the
timing wavefront dispatcher still wrote them GCN3-style to separate registers
(v0=x, v1=y, v2=z). gfx942 kernels read threadIdx.y/.z from packed v0, so under
the timing model every 2D/3D-workgroup kernel saw threadIdx.y/.z == 0 and
produced wrong results -- including the existing amdappsdk/matrixtranspose
benchmark, which failed -verify on mi300a. Mirror the emulator's packing, gated
on CodeObjectV5 so the GCN3 (v2/v3) path is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add CDNA3 functional-emulator support for opcodes the gfx942 compiler emits from
common math/indexing that were previously undecodable or unimplemented:

  - unordered f32 compares v_cmp_ngt/nlt_f32 and family (e32 + e64), from
    expf/logf/sqrtf
  - v_add_f32_e64 (float add carrying abs/neg source modifiers)
  - v_mad_i64_i32 and a v_mad_u64_u32 decode-width fix (64-bit dst/src2), for
    64-bit/size_t index arithmetic
  - v_or3_b32, v_sub_u32 (VOP3A clamp form), s_mul_hi_i32
  - v_ldexp_f32 (inside expf/logf)
  - v_pk_add_f16 with round-to-nearest-even f16<->f32 conversion

Adds amd/emu/cdna3/newopcodes_test.go covering the new ops. Unblocks
polybench_correlation, npb_ep, rodinia_backprop/srad, altis_cfd,
tango_blackscholes, and polybench_3dconv on mi300a. GCN3 is unaffected (new
entries sit at gfx942-only opcodes; verified fir -arch gcn3 still passes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port HIP compute kernels from sarchlab/gpu_benchmarks (tier-1 microbenchmarks,
PolyBench, Rodinia, Parboil, Tango, NPB, Altis, BabelStream) to run on the
MGPUSim MI300A (gfx942) model. Each port provides:
  - a standalone extern "C" kernel under native/ + a Docker hipcc Makefile,
  - a gfx942 kernels_gfx942.hsaco,
  - a Go driver whose KernelArgs match the compiled kernel's AMDGPU kernarg
    metadata, with a CPU-reference Verify(),
  - a runnable sample under amd/samples/.

~29 of these verify on -timing -gpu mi300a -arch cdna3 -verify. The remainder
compile and run but are blocked by deeper timing-model issues (dependent-load
MMU translation, DrainCommandQueue sync, multi-workgroup dispatch) to be
addressed gradually.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bare `conv2d` pattern in .gitignore (intended for the gpu_perf_scripts/conv2d
binary) also matched amd/benchmarks/polybench/conv2d/, so the package was never
committed and CI's `go build ./...` failed:
  amd/samples/polybench_2dconv/main.go: no required module provides package
  github.com/sarchlab/mgpusim/v5/amd/benchmarks/polybench/conv2d
Force-add the package source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
golangci-lint (funlen/gocognit) flagged the CPU-reference Verify() and data-setup
functions in several ported benchmarks. Annotate them with //nolint:funlen,gocognit
(matching existing usage in amd/emu and amd/benchmarks/amdappsdk) rather than
refactoring verified code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a manual GitHub Action that runs the MI300A (CDNA3) timing model on the
fp32_throughput microbenchmark, derives GFLOPS from the simulated kernel_time,
and compares against real-hardware ground truth using absolute error. The
report is published both as a downloadable artifact and rendered in the job
summary ($GITHUB_STEP_SUMMARY).

Ground truth is summarized from the ~9 MB mi300a_ground_truth.db into a small
committable CSV (gpu_perf_scripts/calibration/mi300a_ground_truth.csv, 325 fp32
rows) via summarize_ground_truth.py, so the DB itself is not committed.

cache_latency is deferred: it panics in timing mode ("page not found" in the
Akita MMU page walk) for all parameters; emulation mode works.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The summarized fp32 ground truth must be tracked for the calibration workflow's
compare step; add a .gitignore negation since the repo globally ignores *.csv.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workflow_dispatch only works once the file is on the default branch, so add a
path-filtered push trigger for the mi300a-calibration branch to run it now.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef5cc451fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tag="fp32_nb${nb}_f${fmas}"
echo " $tag"
kt="$(run_and_extract "$BUILD_DIR/fp32_throughput" "$tag" -num-blocks "$nb" -fmas "$fmas")"
[[ -n "$kt" ]] && echo "fp32_throughput,fmas_per_thread,${fmas},${nb},256,${fmas},,${kt}" >> "$OUT_ABS"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail the sweep when a calibration point is missing

If any simulator invocation crashes or stops producing kernel_time, run_and_extract yields an empty kt and this condition simply omits that calibration row; because the comparison script accepts any number of rows, the workflow can still publish a green report with a partial or even header-only sweep. That makes exactly the failing fp32 configuration invisible in CI instead of flagging the regression, so this should fail the job when an expected point has no extracted time.

Useful? React with 👍 / 👎.

Base automatically changed from mi300a-bench-port to main June 17, 2026 12:55
…ruth-ordered sweep

- fp32 kernel now takes work-group size as an explicit arg (recompiled gfx942)
  so the sweep can vary threads_per_block and match all ground-truth configs.
  blockDim.x was unusable: it reads 0 in the MGPUSim model for multi-block
  launches. Verified correct across block sizes {1,32,64,256,1024} x multi-block.
- Sweep now varies num_blocks x threads_per_block x fmas, bounded by timeouts
  (per-run 10 min, whole-sweep 120 min) instead of a cost heuristic, and runs
  configs ordered by real-hardware execution time (shortest first) so the budget
  completes the cheap-to-simulate configs first.
- Comparator/report gain a threads_per_block column.
- Add plot_calibration.py: one execution-time figure per (num_blocks,
  threads_per_block), x=fmas, y=ms, simulation vs real-hardware lines
  (seaborn/matplotlib); workflow installs deps and uploads figures/.
- Calibration workflow is manual-only (workflow_dispatch), job timeout 150 min.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c908c7f5f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread amd/emu/cdna3/vop3a.go
}
src0 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src0, i), 0, inst)))
src1 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src1, i), 1, inst)))
if !(src0 != src1) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve NLG's unordered NaN behavior

When either operand is NaN, Go evaluates src0 != src1 to true, so this condition is false and v_cmp_nlg_f32_e64 does not set the destination bit. NLG is the unordered-or-equal predicate, so NaN operands should pass here (the same issue exists in the new VOPC NLG helper, while LG above has the opposite NaN problem); kernels that branch on this comparison with NaNs will take the wrong path.

Useful? React with 👍 / 👎.

Comment thread amd/emu/cdna3/vop1.go Outdated
Comment on lines +263 to +271
// Subnormal half: normalize.
e := -1
m := mant
for (m & 0x0400) == 0 {
m <<= 1
e--
}
m &= 0x03FF
exp32 := uint32(127-15+e+1) << 23

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix the half-subnormal exponent

For half-precision subnormals this normalization starts e at -1 and then decrements once per left shift, which makes every subnormal one exponent too small; for example 0x0001 converts to 2^-25 instead of the IEEE value 2^-24. Any v_pk_add_f16 input or result involving tiny half values will therefore be computed with the wrong magnitude.

Useful? React with 👍 / 👎.

syifan and others added 2 commits June 17, 2026 12:38
readFlatAddr decided SADDR vs OFF addressing mode from inst.Addr.RegCount
(RegCount==1 => SADDR), but gfx942 (CDNA3) emits SADDR-form global
load/store whose address VGPR operand still has RegCount==2. Those were
misclassified as OFF mode and the raw VGPR pair was used as a full 64-bit
address, yielding a bogus translation and an MMU "Panic: page not found"
in timing mode (emulation was unaffected -- it uses the correct rule).

Switch to the same test the CDNA3 emulator uses (amd/emu/cdna3/flat.go):
SADDR is active iff an SAddr operand is present and is not the 0x7F "OFF"
sentinel. This unblocks the cache_latency microbenchmark (single-thread
pointer chase over a read buffer) in timing mode; fp32_throughput, which
only uses an OFF-mode store, is unaffected. cu/cdna3 unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…port job

Restructure the MI300A calibration workflow from a single job into three:

  fp32_sweep ----\
                  >--> report (needs both; if: always())
  cache_latency --/

Each benchmark sweep is now its own job, so they run in parallel and a
failure in one (build, timeout, infra) cannot block the others. A single
report job runs after both sweeps, downloads their CSV artifacts, and
renders the combined markdown report + figures.

- Add run_cache_latency_sweep.sh: builds the cache_latency sample, sweeps
  array_bytes across the cache hierarchy (L1/L2/DRAM) x num_accesses under
  the CDNA3/MI300A timing config, and records kernel_time + derived
  ns/access. Same per-run / whole-sweep timeout bounding as the fp32 sweep;
  tolerates failed runs (records them, exits 0) so the report always runs.
- compare_to_ground_truth.py: optional --cache-sim input renders a
  sim-only cache_latency section (no committed ground truth yet) and a
  summary row; graceful note when the sweep produced no points.
- report_template.md: add the cache_latency section + tokens.

cache_latency now runs in timing mode (see the preceding coalescer fix);
its results are reported sim-only until real-hardware latencies are added
to the ground-truth CSV.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f018790bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +20 to +23
__half2 a0 = __float2half2_rn(base);
__half2 a1 = __float2half2_rn(base + 0.1f);
__half2 a2 = __float2half2_rn(base + 0.2f);
__half2 a3 = __float2half2_rn(base + 0.3f);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support fp16 setup opcodes before adding benchmark

When this new fp16 kernel is run, it reaches instructions the CDNA3 emulator still does not decode or execute: llvm-objdump-20 -d --mcpu=gfx942 on the committed kernels_gfx942.hsaco shows these initial __float2half2_rn conversions emit v_fma_mixlo_f16, v_fma_mixhi_f16, and v_pack_b32_f16, while a repo-wide search for those mnemonics finds no decoder/table entry or ALU handler (only v_pk_add_f16 was added). As a result, the fp16_throughput sample will panic on the first run instead of producing a throughput point unless those opcodes are implemented or the kernel is changed to avoid them.

Useful? React with 👍 / 👎.

The array-size axis was too coarse to resolve the cache hierarchy: 9 sizes
skipping 32/128/512 KB and 2/32/128 MB. Expand to 21 sizes spanning 4 KB ->
128 MB, with extra density around the L1 (~32 KB) and L2 (~4-8 MB)
transitions where the latency steps live. Run both num_accesses values at
every size (256 quick + 1024 amortized cross-check) -> 42 points.

Raise the sweep budget (SWEEP_TIMEOUT 60 -> 100 min, per-run 10 -> 15 min)
and the cache job timeout (90 -> 110 min) so the full grid runs without any
config being skipped. Each run is a single-thread chase of <=1024 dependent
loads (~1e-5 sim-s), so the whole grid still completes quickly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0ae943f0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (ix >= 1 && ix < nx - 1 &&
iy >= 1 && iy < ny - 1) {
for (int iz = 1; iz < nz - 1; ++iz) {
int idx = iz * ny * nx + iy * nx + ix;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support v_xad before enabling stencil sample

With the default sample size (N=32), this z-loop is entered, and disassembling the committed kernels_gfx942.hsaco shows the index computation emits v_xad_u32 (opcode 499). I checked the decoder/emulator tables with rg 'v_xad|499' amd/insts/decodetable.go amd/emu/cdna3 and there is no entry or ALU handler, so amd/samples/parboil_stencil will fail to decode before producing a result unless v_xad_u32 is implemented or the kernel is compiled to avoid it.

Useful? React with 👍 / 👎.

…-only report

Rework the calibration report around KERNEL EXECUTION TIME (sim vs real),
dropping the derived GFLOPS / ns-per-access metrics and the markdown report.

- Ground truth: cache_latency was never summarized (it used to panic in the
  sim). The DB has it -- 19 array_bytes points at num_accesses=2,000,000,
  metric latency_ns, with kernel times. Add cache_latency to
  summarize_ground_truth.py SPECS and regenerate mi300a_ground_truth.csv
  (now 325 fp32 + 19 cache_latency rows).

- cache_latency sweep now runs the REAL benchmark's 2,000,000 dependent loads
  (was a throwaway 256/1024) at exactly the 19 ground-truth array sizes, so
  each simulated kernel time pairs with a real one. 2M loads are expensive, so
  per-run (20 min) / sweep (90 min) timeouts bound it and the large
  DRAM-footprint arrays may be skipped; runs smallest-first so the
  cache-resident points always complete. Cache job timeout 110 -> 120 min.

- plot_calibration.py rewritten: figures only (no prose). One figure per
  non-scaling combo (fp32: num_blocks x threads_per_block, x=fmas; cache:
  num_accesses, x=array_bytes), y = kernel time (ms), two lines sim + real.
  Reports the average error = mean |sim - real| (ms) per figure (in the title)
  and overall (to the job summary).

- Drop the text report: remove compare_to_ground_truth.py and
  report_template.md; the report job now just plots and uploads figures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59b05d26ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for (int j = tid; j < numNodes; j += bdim) {
float ST = opt.S * powf(u, (float)(2 * j - numSteps));
float payoff = fmaxf(opt.K - ST, 0.0f);
shared_values[j] = payoff;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Implement ds_write2st64 before enabling binomial sample

With the default NumSteps=64, this shared-memory write is executed, and disassembling the committed kernels_gfx942.hsaco shows it emits ds_write2st64_b32 at several terminal-payoff/backward-induction stores; the CDNA3 DS emulator only dispatches opcodes 13, 14, 30, 54, 55, 78, 118, 119, 223, and 255, so opcode 15 falls into log.Panicf("Opcode %d for DS format is not implemented"). As a result, amd/samples/tango_binomial_options will panic before producing prices unless ds_write2st64_b32 is implemented or the kernel is compiled to avoid it.

Useful? React with 👍 / 👎.

syifan and others added 2 commits June 17, 2026 15:45
Resolves 9 add/add + content conflicts from main re-landing the gpu_benchmarks
port (PR #269) and removing residual calibration artifacts (PR #270):

- fp32_throughput (fp32throughput.go, native cpp + both hsaco, sample main.go):
  KEEP OURS. Our branch parameterizes the work-group size (ThreadsPerBlock arg,
  kernarg_segment_size=16) so the calibration sweep can match the ground-truth
  threads_per_block dimension; main reverted it to a fixed 256 (kernarg=12).
  Go/cpp/hsaco are a single ABI unit, so all five take our version.

- CDNA3 emulator opcodes (vop1.go, vop3a.go, vopc.go, newopcodes_test.go):
  TAKE THEIRS. Main's PR #269 carries opcode bugfixes our base predates --
  half-subnormal float16->float32 decode (+ a new test), correct NaN semantics
  for VCMP_NLG/VCMP_LG, and the packed-f16 op_sel_hi convention. Unrelated to
  the calibration work; main's reviewed versions are canonical.

Main's deletion of the old gpu_perf_scripts/ root microbenchmarks and
mi300a_calibration.md applies cleanly (our new harness lives in
gpu_perf_scripts/calibration/ and the SADDR coalescer fix is untouched).

Verified: go build ./...; go test ./amd/emu/cdna3/... ./amd/timing/cu/...;
fp32_throughput (-threads-per-block) and cache_latency both Pass in timing mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SADDR "page not found" fix was applied at the wrong layer and broke GCN3
timing (and Deterministicity) in CI: keying the coalescer off SAddr != 0x7F
mis-classifies GCN3 OFF-mode FLAT (SAddr == 0) as SADDR mode.

Root cause: the timing CU built its decoder with insts.NewDisassembler() and
never set IsCDNA3, so CDNA3 kernels were decoded with GCN3 FLAT semantics. The
disassembler already resolves the architecture-dependent SADDR/OFF rule at
decode time (CDNA3: OFF iff SAddr==0x7F; GCN3: OFF iff SAddr==0x7F or 0) and
encodes the result in inst.Addr.RegCount -- but only when IsCDNA3 is set.

Fix:
- shaderarray.Builder: add WithDecoderBuilder, mirroring WithALUBuilder, so the
  per-CU instruction decoder can be supplied.
- mi300a timing builder: supply an insts.Disassembler with IsCDNA3 = true
  (alongside the existing CDNA3 ALU), so CDNA3 SADDR loads decode to RegCount=1.
- defaultCoalescer.readFlatAddr: revert to the original, arch-agnostic
  `inst.Addr.RegCount == 1` rule. With correct decode it is right for both
  GCN3 and CDNA3, and GCN3 returns to its proven path.

Verified: go build ./...; go test ./amd/timing/cu/... ./amd/emu/cdna3/...;
GCN3 timing (fir, nbody, vectoradd) Pass; CDNA3 timing (cache_latency, fp32,
polybench_2mm) Pass; golangci-lint clean on the changed packages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ab8f5b1aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +31 to +33
FP32_NUM_BLOCKS=(1 32 1024)
FP32_THREADS_PER_BLOCK=(1 32 64 256 1024)
FP32_FMAS=(256 1024 4096 16384 65536 262144 1048576)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sweep every committed fp32 calibration point

The committed ground-truth CSV has fp32_throughput rows for 5 num_blocks values (including 4096 and 32768) and 13 fmas_per_thread values (including 512, 2048, 8192, 32768, 131072, and 524288), but this grid launches only 3 × 5 × 7 = 105 of the 5 × 5 × 13 expected points. Since the plotting step only iterates simulated groups/points, those omitted real-hardware configurations disappear from the calibration report rather than showing as missing or contributing to error, so the workflow no longer compares against the full committed ground truth it claims to cover.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant