MI300A calibration: fp32_throughput CI vs real-hardware ground truth#271
MI300A calibration: fp32_throughput CI vs real-hardware ground truth#271syifan wants to merge 17 commits into
Conversation
Update the DOCKER_IMAGE tag from rocm/dev-ubuntu-24.04:7.1.1 to 7.2.4 across every native/ HIP-compilation Makefile and CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The functional emulator (amd/emu/computeunit.go) already packs CDNA3 (code object v5) work-item IDs into a single VGPR as v0 = x | y<<10 | z<<20, but the timing wavefront dispatcher still wrote them GCN3-style to separate registers (v0=x, v1=y, v2=z). gfx942 kernels read threadIdx.y/.z from packed v0, so under the timing model every 2D/3D-workgroup kernel saw threadIdx.y/.z == 0 and produced wrong results -- including the existing amdappsdk/matrixtranspose benchmark, which failed -verify on mi300a. Mirror the emulator's packing, gated on CodeObjectV5 so the GCN3 (v2/v3) path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add CDNA3 functional-emulator support for opcodes the gfx942 compiler emits from
common math/indexing that were previously undecodable or unimplemented:
- unordered f32 compares v_cmp_ngt/nlt_f32 and family (e32 + e64), from
expf/logf/sqrtf
- v_add_f32_e64 (float add carrying abs/neg source modifiers)
- v_mad_i64_i32 and a v_mad_u64_u32 decode-width fix (64-bit dst/src2), for
64-bit/size_t index arithmetic
- v_or3_b32, v_sub_u32 (VOP3A clamp form), s_mul_hi_i32
- v_ldexp_f32 (inside expf/logf)
- v_pk_add_f16 with round-to-nearest-even f16<->f32 conversion
Adds amd/emu/cdna3/newopcodes_test.go covering the new ops. Unblocks
polybench_correlation, npb_ep, rodinia_backprop/srad, altis_cfd,
tango_blackscholes, and polybench_3dconv on mi300a. GCN3 is unaffected (new
entries sit at gfx942-only opcodes; verified fir -arch gcn3 still passes).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port HIP compute kernels from sarchlab/gpu_benchmarks (tier-1 microbenchmarks,
PolyBench, Rodinia, Parboil, Tango, NPB, Altis, BabelStream) to run on the
MGPUSim MI300A (gfx942) model. Each port provides:
- a standalone extern "C" kernel under native/ + a Docker hipcc Makefile,
- a gfx942 kernels_gfx942.hsaco,
- a Go driver whose KernelArgs match the compiled kernel's AMDGPU kernarg
metadata, with a CPU-reference Verify(),
- a runnable sample under amd/samples/.
~29 of these verify on -timing -gpu mi300a -arch cdna3 -verify. The remainder
compile and run but are blocked by deeper timing-model issues (dependent-load
MMU translation, DrainCommandQueue sync, multi-workgroup dispatch) to be
addressed gradually.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bare `conv2d` pattern in .gitignore (intended for the gpu_perf_scripts/conv2d binary) also matched amd/benchmarks/polybench/conv2d/, so the package was never committed and CI's `go build ./...` failed: amd/samples/polybench_2dconv/main.go: no required module provides package github.com/sarchlab/mgpusim/v5/amd/benchmarks/polybench/conv2d Force-add the package source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
golangci-lint (funlen/gocognit) flagged the CPU-reference Verify() and data-setup functions in several ported benchmarks. Annotate them with //nolint:funlen,gocognit (matching existing usage in amd/emu and amd/benchmarks/amdappsdk) rather than refactoring verified code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a manual GitHub Action that runs the MI300A (CDNA3) timing model on the
fp32_throughput microbenchmark, derives GFLOPS from the simulated kernel_time,
and compares against real-hardware ground truth using absolute error. The
report is published both as a downloadable artifact and rendered in the job
summary ($GITHUB_STEP_SUMMARY).
Ground truth is summarized from the ~9 MB mi300a_ground_truth.db into a small
committable CSV (gpu_perf_scripts/calibration/mi300a_ground_truth.csv, 325 fp32
rows) via summarize_ground_truth.py, so the DB itself is not committed.
cache_latency is deferred: it panics in timing mode ("page not found" in the
Akita MMU page walk) for all parameters; emulation mode works.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The summarized fp32 ground truth must be tracked for the calibration workflow's compare step; add a .gitignore negation since the repo globally ignores *.csv. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workflow_dispatch only works once the file is on the default branch, so add a path-filtered push trigger for the mi300a-calibration branch to run it now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef5cc451fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tag="fp32_nb${nb}_f${fmas}" | ||
| echo " $tag" | ||
| kt="$(run_and_extract "$BUILD_DIR/fp32_throughput" "$tag" -num-blocks "$nb" -fmas "$fmas")" | ||
| [[ -n "$kt" ]] && echo "fp32_throughput,fmas_per_thread,${fmas},${nb},256,${fmas},,${kt}" >> "$OUT_ABS" |
There was a problem hiding this comment.
Fail the sweep when a calibration point is missing
If any simulator invocation crashes or stops producing kernel_time, run_and_extract yields an empty kt and this condition simply omits that calibration row; because the comparison script accepts any number of rows, the workflow can still publish a green report with a partial or even header-only sweep. That makes exactly the failing fp32 configuration invisible in CI instead of flagging the regression, so this should fail the job when an expected point has no extracted time.
Useful? React with 👍 / 👎.
…ruth-ordered sweep
- fp32 kernel now takes work-group size as an explicit arg (recompiled gfx942)
so the sweep can vary threads_per_block and match all ground-truth configs.
blockDim.x was unusable: it reads 0 in the MGPUSim model for multi-block
launches. Verified correct across block sizes {1,32,64,256,1024} x multi-block.
- Sweep now varies num_blocks x threads_per_block x fmas, bounded by timeouts
(per-run 10 min, whole-sweep 120 min) instead of a cost heuristic, and runs
configs ordered by real-hardware execution time (shortest first) so the budget
completes the cheap-to-simulate configs first.
- Comparator/report gain a threads_per_block column.
- Add plot_calibration.py: one execution-time figure per (num_blocks,
threads_per_block), x=fmas, y=ms, simulation vs real-hardware lines
(seaborn/matplotlib); workflow installs deps and uploads figures/.
- Calibration workflow is manual-only (workflow_dispatch), job timeout 150 min.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c908c7f5f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| src0 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src0, i), 0, inst))) | ||
| src1 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src1, i), 1, inst))) | ||
| if !(src0 != src1) { |
There was a problem hiding this comment.
Preserve NLG's unordered NaN behavior
When either operand is NaN, Go evaluates src0 != src1 to true, so this condition is false and v_cmp_nlg_f32_e64 does not set the destination bit. NLG is the unordered-or-equal predicate, so NaN operands should pass here (the same issue exists in the new VOPC NLG helper, while LG above has the opposite NaN problem); kernels that branch on this comparison with NaNs will take the wrong path.
Useful? React with 👍 / 👎.
| // Subnormal half: normalize. | ||
| e := -1 | ||
| m := mant | ||
| for (m & 0x0400) == 0 { | ||
| m <<= 1 | ||
| e-- | ||
| } | ||
| m &= 0x03FF | ||
| exp32 := uint32(127-15+e+1) << 23 |
There was a problem hiding this comment.
Fix the half-subnormal exponent
For half-precision subnormals this normalization starts e at -1 and then decrements once per left shift, which makes every subnormal one exponent too small; for example 0x0001 converts to 2^-25 instead of the IEEE value 2^-24. Any v_pk_add_f16 input or result involving tiny half values will therefore be computed with the wrong magnitude.
Useful? React with 👍 / 👎.
readFlatAddr decided SADDR vs OFF addressing mode from inst.Addr.RegCount (RegCount==1 => SADDR), but gfx942 (CDNA3) emits SADDR-form global load/store whose address VGPR operand still has RegCount==2. Those were misclassified as OFF mode and the raw VGPR pair was used as a full 64-bit address, yielding a bogus translation and an MMU "Panic: page not found" in timing mode (emulation was unaffected -- it uses the correct rule). Switch to the same test the CDNA3 emulator uses (amd/emu/cdna3/flat.go): SADDR is active iff an SAddr operand is present and is not the 0x7F "OFF" sentinel. This unblocks the cache_latency microbenchmark (single-thread pointer chase over a read buffer) in timing mode; fp32_throughput, which only uses an OFF-mode store, is unaffected. cu/cdna3 unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…port job
Restructure the MI300A calibration workflow from a single job into three:
fp32_sweep ----\
>--> report (needs both; if: always())
cache_latency --/
Each benchmark sweep is now its own job, so they run in parallel and a
failure in one (build, timeout, infra) cannot block the others. A single
report job runs after both sweeps, downloads their CSV artifacts, and
renders the combined markdown report + figures.
- Add run_cache_latency_sweep.sh: builds the cache_latency sample, sweeps
array_bytes across the cache hierarchy (L1/L2/DRAM) x num_accesses under
the CDNA3/MI300A timing config, and records kernel_time + derived
ns/access. Same per-run / whole-sweep timeout bounding as the fp32 sweep;
tolerates failed runs (records them, exits 0) so the report always runs.
- compare_to_ground_truth.py: optional --cache-sim input renders a
sim-only cache_latency section (no committed ground truth yet) and a
summary row; graceful note when the sweep produced no points.
- report_template.md: add the cache_latency section + tokens.
cache_latency now runs in timing mode (see the preceding coalescer fix);
its results are reported sim-only until real-hardware latencies are added
to the ground-truth CSV.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f018790bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| __half2 a0 = __float2half2_rn(base); | ||
| __half2 a1 = __float2half2_rn(base + 0.1f); | ||
| __half2 a2 = __float2half2_rn(base + 0.2f); | ||
| __half2 a3 = __float2half2_rn(base + 0.3f); |
There was a problem hiding this comment.
Support fp16 setup opcodes before adding benchmark
When this new fp16 kernel is run, it reaches instructions the CDNA3 emulator still does not decode or execute: llvm-objdump-20 -d --mcpu=gfx942 on the committed kernels_gfx942.hsaco shows these initial __float2half2_rn conversions emit v_fma_mixlo_f16, v_fma_mixhi_f16, and v_pack_b32_f16, while a repo-wide search for those mnemonics finds no decoder/table entry or ALU handler (only v_pk_add_f16 was added). As a result, the fp16_throughput sample will panic on the first run instead of producing a throughput point unless those opcodes are implemented or the kernel is changed to avoid them.
Useful? React with 👍 / 👎.
The array-size axis was too coarse to resolve the cache hierarchy: 9 sizes skipping 32/128/512 KB and 2/32/128 MB. Expand to 21 sizes spanning 4 KB -> 128 MB, with extra density around the L1 (~32 KB) and L2 (~4-8 MB) transitions where the latency steps live. Run both num_accesses values at every size (256 quick + 1024 amortized cross-check) -> 42 points. Raise the sweep budget (SWEEP_TIMEOUT 60 -> 100 min, per-run 10 -> 15 min) and the cache job timeout (90 -> 110 min) so the full grid runs without any config being skipped. Each run is a single-thread chase of <=1024 dependent loads (~1e-5 sim-s), so the whole grid still completes quickly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0ae943f0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (ix >= 1 && ix < nx - 1 && | ||
| iy >= 1 && iy < ny - 1) { | ||
| for (int iz = 1; iz < nz - 1; ++iz) { | ||
| int idx = iz * ny * nx + iy * nx + ix; |
There was a problem hiding this comment.
Support v_xad before enabling stencil sample
With the default sample size (N=32), this z-loop is entered, and disassembling the committed kernels_gfx942.hsaco shows the index computation emits v_xad_u32 (opcode 499). I checked the decoder/emulator tables with rg 'v_xad|499' amd/insts/decodetable.go amd/emu/cdna3 and there is no entry or ALU handler, so amd/samples/parboil_stencil will fail to decode before producing a result unless v_xad_u32 is implemented or the kernel is compiled to avoid it.
Useful? React with 👍 / 👎.
…-only report Rework the calibration report around KERNEL EXECUTION TIME (sim vs real), dropping the derived GFLOPS / ns-per-access metrics and the markdown report. - Ground truth: cache_latency was never summarized (it used to panic in the sim). The DB has it -- 19 array_bytes points at num_accesses=2,000,000, metric latency_ns, with kernel times. Add cache_latency to summarize_ground_truth.py SPECS and regenerate mi300a_ground_truth.csv (now 325 fp32 + 19 cache_latency rows). - cache_latency sweep now runs the REAL benchmark's 2,000,000 dependent loads (was a throwaway 256/1024) at exactly the 19 ground-truth array sizes, so each simulated kernel time pairs with a real one. 2M loads are expensive, so per-run (20 min) / sweep (90 min) timeouts bound it and the large DRAM-footprint arrays may be skipped; runs smallest-first so the cache-resident points always complete. Cache job timeout 110 -> 120 min. - plot_calibration.py rewritten: figures only (no prose). One figure per non-scaling combo (fp32: num_blocks x threads_per_block, x=fmas; cache: num_accesses, x=array_bytes), y = kernel time (ms), two lines sim + real. Reports the average error = mean |sim - real| (ms) per figure (in the title) and overall (to the job summary). - Drop the text report: remove compare_to_ground_truth.py and report_template.md; the report job now just plots and uploads figures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59b05d26ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (int j = tid; j < numNodes; j += bdim) { | ||
| float ST = opt.S * powf(u, (float)(2 * j - numSteps)); | ||
| float payoff = fmaxf(opt.K - ST, 0.0f); | ||
| shared_values[j] = payoff; |
There was a problem hiding this comment.
Implement ds_write2st64 before enabling binomial sample
With the default NumSteps=64, this shared-memory write is executed, and disassembling the committed kernels_gfx942.hsaco shows it emits ds_write2st64_b32 at several terminal-payoff/backward-induction stores; the CDNA3 DS emulator only dispatches opcodes 13, 14, 30, 54, 55, 78, 118, 119, 223, and 255, so opcode 15 falls into log.Panicf("Opcode %d for DS format is not implemented"). As a result, amd/samples/tango_binomial_options will panic before producing prices unless ds_write2st64_b32 is implemented or the kernel is compiled to avoid it.
Useful? React with 👍 / 👎.
Resolves 9 add/add + content conflicts from main re-landing the gpu_benchmarks port (PR #269) and removing residual calibration artifacts (PR #270): - fp32_throughput (fp32throughput.go, native cpp + both hsaco, sample main.go): KEEP OURS. Our branch parameterizes the work-group size (ThreadsPerBlock arg, kernarg_segment_size=16) so the calibration sweep can match the ground-truth threads_per_block dimension; main reverted it to a fixed 256 (kernarg=12). Go/cpp/hsaco are a single ABI unit, so all five take our version. - CDNA3 emulator opcodes (vop1.go, vop3a.go, vopc.go, newopcodes_test.go): TAKE THEIRS. Main's PR #269 carries opcode bugfixes our base predates -- half-subnormal float16->float32 decode (+ a new test), correct NaN semantics for VCMP_NLG/VCMP_LG, and the packed-f16 op_sel_hi convention. Unrelated to the calibration work; main's reviewed versions are canonical. Main's deletion of the old gpu_perf_scripts/ root microbenchmarks and mi300a_calibration.md applies cleanly (our new harness lives in gpu_perf_scripts/calibration/ and the SADDR coalescer fix is untouched). Verified: go build ./...; go test ./amd/emu/cdna3/... ./amd/timing/cu/...; fp32_throughput (-threads-per-block) and cache_latency both Pass in timing mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SADDR "page not found" fix was applied at the wrong layer and broke GCN3 timing (and Deterministicity) in CI: keying the coalescer off SAddr != 0x7F mis-classifies GCN3 OFF-mode FLAT (SAddr == 0) as SADDR mode. Root cause: the timing CU built its decoder with insts.NewDisassembler() and never set IsCDNA3, so CDNA3 kernels were decoded with GCN3 FLAT semantics. The disassembler already resolves the architecture-dependent SADDR/OFF rule at decode time (CDNA3: OFF iff SAddr==0x7F; GCN3: OFF iff SAddr==0x7F or 0) and encodes the result in inst.Addr.RegCount -- but only when IsCDNA3 is set. Fix: - shaderarray.Builder: add WithDecoderBuilder, mirroring WithALUBuilder, so the per-CU instruction decoder can be supplied. - mi300a timing builder: supply an insts.Disassembler with IsCDNA3 = true (alongside the existing CDNA3 ALU), so CDNA3 SADDR loads decode to RegCount=1. - defaultCoalescer.readFlatAddr: revert to the original, arch-agnostic `inst.Addr.RegCount == 1` rule. With correct decode it is right for both GCN3 and CDNA3, and GCN3 returns to its proven path. Verified: go build ./...; go test ./amd/timing/cu/... ./amd/emu/cdna3/...; GCN3 timing (fir, nbody, vectoradd) Pass; CDNA3 timing (cache_latency, fp32, polybench_2mm) Pass; golangci-lint clean on the changed packages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ab8f5b1aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| FP32_NUM_BLOCKS=(1 32 1024) | ||
| FP32_THREADS_PER_BLOCK=(1 32 64 256 1024) | ||
| FP32_FMAS=(256 1024 4096 16384 65536 262144 1048576) |
There was a problem hiding this comment.
Sweep every committed fp32 calibration point
The committed ground-truth CSV has fp32_throughput rows for 5 num_blocks values (including 4096 and 32768) and 13 fmas_per_thread values (including 512, 2048, 8192, 32768, 131072, and 524288), but this grid launches only 3 × 5 × 7 = 105 of the 5 × 5 × 13 expected points. Since the plotting step only iterates simulated groups/points, those omitted real-hardware configurations disappear from the calibration report rather than showing as missing or contributing to error, so the workflow no longer compares against the full committed ground truth it claims to cover.
Useful? React with 👍 / 👎.
Adds the first piece of MI300A timing-model calibration: a CI workflow that runs the simulator on the
fp32_throughputmicrobenchmark and compares simulated GFLOPS against real-hardware ground truth.What's here
.github/workflows/mi300a_calibration.yml— runs the sweep → compares → publishes a markdown report both as a downloadable artifact and rendered in the job summary ($GITHUB_STEP_SUMMARY). Self-hosted runner, 60-min timeout.gpu_perf_scripts/calibration/run_sim_sweep.sh— buildsamd/samples/fp32_throughput, sweeps configs under-arch cdna3 -gpu mi300a -timing, extracts eachkernel_timefrom themgpusim_metricsSQLite table.compare_to_ground_truth.py— derives GFLOPS, computes absolute error (sim − real) + MAE, fills the template.summarize_ground_truth.py— regenerates the ground-truth CSV frommi300a_ground_truth.db.report_template.md— editable report template.mi300a_ground_truth.csv— committed ground truth (325 fp32 rows).Why a CSV, not the DB
mi300a_ground_truth.dbis ~9 MB (every individual repetition across 7 benchmarks + ~5 MB of indexes;atomic_operationsalone is 56%). Calibration only needs the per-config mean over the 7 reps, so it's summarized to a 24 KB CSV. The DB is not committed. (Required a.gitignorenegation since the repo globally ignores*.csv.)Scope / known limitations
cache_latencyis deferred — it panics in timing mode (Panic: page not foundin the Akita MMU page walk) for all parameters; emulation mode works.workflow_dispatchonly registers once the file is on the default branch, so a path-filteredpushtrigger formi300a-calibrationis included to run it now. The manual button works after this reachesmain.fmas). Closing this gap is the calibration work to follow.🤖 Generated with Claude Code