MI300A calibration: fp32_throughput CI vs real-hardware ground truth by syifan · Pull Request #271 · sarchlab/mgpusim

syifan · 2026-06-16T22:47:51Z

Adds the first piece of MI300A timing-model calibration: a CI workflow that runs the simulator on the fp32_throughput microbenchmark and compares simulated GFLOPS against real-hardware ground truth.

Stacked on #269 (base = mi300a-bench-port, which holds the microbench code this depends on). Retarget to main once #269 merges.

What's here

.github/workflows/mi300a_calibration.yml — runs the sweep → compares → publishes a markdown report both as a downloadable artifact and rendered in the job summary ($GITHUB_STEP_SUMMARY). Self-hosted runner, 60-min timeout.
gpu_perf_scripts/calibration/
- run_sim_sweep.sh — builds amd/samples/fp32_throughput, sweeps configs under -arch cdna3 -gpu mi300a -timing, extracts each kernel_time from the mgpusim_metrics SQLite table.
- compare_to_ground_truth.py — derives GFLOPS, computes absolute error (sim − real) + MAE, fills the template.
- summarize_ground_truth.py — regenerates the ground-truth CSV from mi300a_ground_truth.db.
- report_template.md — editable report template.
- mi300a_ground_truth.csv — committed ground truth (325 fp32 rows).

Why a CSV, not the DB

mi300a_ground_truth.db is ~9 MB (every individual repetition across 7 benchmarks + ~5 MB of indexes; atomic_operations alone is 56%). Calibration only needs the per-config mean over the 7 reps, so it's summarized to a 24 KB CSV. The DB is not committed. (Required a .gitignore negation since the repo globally ignores *.csv.)

Scope / known limitations

fp32_throughput only. cache_latency is deferred — it panics in timing mode (Panic: page not found in the Akita MMU page walk) for all parameters; emulation mode works.
Trigger: workflow_dispatch only registers once the file is on the default branch, so a path-filtered push trigger for mi300a-calibration is included to run it now. The manual button works after this reaches main.
Baseline is uncalibrated: sim fp32 currently runs ~21–22× faster than real HW (MAE ≈ 1,244 GFLOPS) — real HW has a ~0.5 ms fixed kernel-time floor the sim lacks, and the sim runs all blocks in parallel on 120 CUs. The GFLOPS comparison is most meaningful in the throughput-bound regime (large fmas). Closing this gap is the calibration work to follow.

🤖 Generated with Claude Code

Update the DOCKER_IMAGE tag from rocm/dev-ubuntu-24.04:7.1.1 to 7.2.4 across every native/ HIP-compilation Makefile and CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The functional emulator (amd/emu/computeunit.go) already packs CDNA3 (code object v5) work-item IDs into a single VGPR as v0 = x | y<<10 | z<<20, but the timing wavefront dispatcher still wrote them GCN3-style to separate registers (v0=x, v1=y, v2=z). gfx942 kernels read threadIdx.y/.z from packed v0, so under the timing model every 2D/3D-workgroup kernel saw threadIdx.y/.z == 0 and produced wrong results -- including the existing amdappsdk/matrixtranspose benchmark, which failed -verify on mi300a. Mirror the emulator's packing, gated on CodeObjectV5 so the GCN3 (v2/v3) path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add CDNA3 functional-emulator support for opcodes the gfx942 compiler emits from common math/indexing that were previously undecodable or unimplemented: - unordered f32 compares v_cmp_ngt/nlt_f32 and family (e32 + e64), from expf/logf/sqrtf - v_add_f32_e64 (float add carrying abs/neg source modifiers) - v_mad_i64_i32 and a v_mad_u64_u32 decode-width fix (64-bit dst/src2), for 64-bit/size_t index arithmetic - v_or3_b32, v_sub_u32 (VOP3A clamp form), s_mul_hi_i32 - v_ldexp_f32 (inside expf/logf) - v_pk_add_f16 with round-to-nearest-even f16<->f32 conversion Adds amd/emu/cdna3/newopcodes_test.go covering the new ops. Unblocks polybench_correlation, npb_ep, rodinia_backprop/srad, altis_cfd, tango_blackscholes, and polybench_3dconv on mi300a. GCN3 is unaffected (new entries sit at gfx942-only opcodes; verified fir -arch gcn3 still passes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port HIP compute kernels from sarchlab/gpu_benchmarks (tier-1 microbenchmarks, PolyBench, Rodinia, Parboil, Tango, NPB, Altis, BabelStream) to run on the MGPUSim MI300A (gfx942) model. Each port provides: - a standalone extern "C" kernel under native/ + a Docker hipcc Makefile, - a gfx942 kernels_gfx942.hsaco, - a Go driver whose KernelArgs match the compiled kernel's AMDGPU kernarg metadata, with a CPU-reference Verify(), - a runnable sample under amd/samples/. ~29 of these verify on -timing -gpu mi300a -arch cdna3 -verify. The remainder compile and run but are blocked by deeper timing-model issues (dependent-load MMU translation, DrainCommandQueue sync, multi-workgroup dispatch) to be addressed gradually. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bare `conv2d` pattern in .gitignore (intended for the gpu_perf_scripts/conv2d binary) also matched amd/benchmarks/polybench/conv2d/, so the package was never committed and CI's `go build ./...` failed: amd/samples/polybench_2dconv/main.go: no required module provides package github.com/sarchlab/mgpusim/v5/amd/benchmarks/polybench/conv2d Force-add the package source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

golangci-lint (funlen/gocognit) flagged the CPU-reference Verify() and data-setup functions in several ported benchmarks. Annotate them with //nolint:funlen,gocognit (matching existing usage in amd/emu and amd/benchmarks/amdappsdk) rather than refactoring verified code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds a manual GitHub Action that runs the MI300A (CDNA3) timing model on the fp32_throughput microbenchmark, derives GFLOPS from the simulated kernel_time, and compares against real-hardware ground truth using absolute error. The report is published both as a downloadable artifact and rendered in the job summary ($GITHUB_STEP_SUMMARY). Ground truth is summarized from the ~9 MB mi300a_ground_truth.db into a small committable CSV (gpu_perf_scripts/calibration/mi300a_ground_truth.csv, 325 fp32 rows) via summarize_ground_truth.py, so the DB itself is not committed. cache_latency is deferred: it panics in timing mode ("page not found" in the Akita MMU page walk) for all parameters; emulation mode works. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The summarized fp32 ground truth must be tracked for the calibration workflow's compare step; add a .gitignore negation since the repo globally ignores *.csv. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

workflow_dispatch only works once the file is on the default branch, so add a path-filtered push trigger for the mi300a-calibration branch to run it now. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef5cc451fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T22:51:51Z

+    tag="fp32_nb${nb}_f${fmas}"
+    echo "  $tag"
+    kt="$(run_and_extract "$BUILD_DIR/fp32_throughput" "$tag" -num-blocks "$nb" -fmas "$fmas")"
+    [[ -n "$kt" ]] && echo "fp32_throughput,fmas_per_thread,${fmas},${nb},256,${fmas},,${kt}" >> "$OUT_ABS"


Fail the sweep when a calibration point is missing

If any simulator invocation crashes or stops producing kernel_time, run_and_extract yields an empty kt and this condition simply omits that calibration row; because the comparison script accepts any number of rows, the workflow can still publish a green report with a partial or even header-only sweep. That makes exactly the failing fp32 configuration invisible in CI instead of flagging the regression, so this should fail the job when an expected point has no extracted time.

Useful? React with 👍 / 👎.

…ruth-ordered sweep - fp32 kernel now takes work-group size as an explicit arg (recompiled gfx942) so the sweep can vary threads_per_block and match all ground-truth configs. blockDim.x was unusable: it reads 0 in the MGPUSim model for multi-block launches. Verified correct across block sizes {1,32,64,256,1024} x multi-block. - Sweep now varies num_blocks x threads_per_block x fmas, bounded by timeouts (per-run 10 min, whole-sweep 120 min) instead of a cost heuristic, and runs configs ordered by real-hardware execution time (shortest first) so the budget completes the cheap-to-simulate configs first. - Comparator/report gain a threads_per_block column. - Add plot_calibration.py: one execution-time figure per (num_blocks, threads_per_block), x=fmas, y=ms, simulation vs real-hardware lines (seaborn/matplotlib); workflow installs deps and uploads figures/. - Calibration workflow is manual-only (workflow_dispatch), job timeout 150 min. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c908c7f5f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T16:12:10Z

+		}
+		src0 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src0, i), 0, inst)))
+		src1 := math.Float32frombits(uint32(applyF32Modifier(state.ReadOperand(inst.Src1, i), 1, inst)))
+		if !(src0 != src1) {


Preserve NLG's unordered NaN behavior

When either operand is NaN, Go evaluates src0 != src1 to true, so this condition is false and v_cmp_nlg_f32_e64 does not set the destination bit. NLG is the unordered-or-equal predicate, so NaN operands should pass here (the same issue exists in the new VOPC NLG helper, while LG above has the opposite NaN problem); kernels that branch on this comparison with NaNs will take the wrong path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T16:12:10Z

+		// Subnormal half: normalize.
+		e := -1
+		m := mant
+		for (m & 0x0400) == 0 {
+			m <<= 1
+			e--
+		}
+		m &= 0x03FF
+		exp32 := uint32(127-15+e+1) << 23


Fix the half-subnormal exponent

For half-precision subnormals this normalization starts e at -1 and then decrements once per left shift, which makes every subnormal one exponent too small; for example 0x0001 converts to 2^-25 instead of the IEEE value 2^-24. Any v_pk_add_f16 input or result involving tiny half values will therefore be computed with the wrong magnitude.

Useful? React with 👍 / 👎.

readFlatAddr decided SADDR vs OFF addressing mode from inst.Addr.RegCount (RegCount==1 => SADDR), but gfx942 (CDNA3) emits SADDR-form global load/store whose address VGPR operand still has RegCount==2. Those were misclassified as OFF mode and the raw VGPR pair was used as a full 64-bit address, yielding a bogus translation and an MMU "Panic: page not found" in timing mode (emulation was unaffected -- it uses the correct rule). Switch to the same test the CDNA3 emulator uses (amd/emu/cdna3/flat.go): SADDR is active iff an SAddr operand is present and is not the 0x7F "OFF" sentinel. This unblocks the cache_latency microbenchmark (single-thread pointer chase over a read buffer) in timing mode; fp32_throughput, which only uses an OFF-mode store, is unaffected. cu/cdna3 unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…port job Restructure the MI300A calibration workflow from a single job into three: fp32_sweep ----\ >--> report (needs both; if: always()) cache_latency --/ Each benchmark sweep is now its own job, so they run in parallel and a failure in one (build, timeout, infra) cannot block the others. A single report job runs after both sweeps, downloads their CSV artifacts, and renders the combined markdown report + figures. - Add run_cache_latency_sweep.sh: builds the cache_latency sample, sweeps array_bytes across the cache hierarchy (L1/L2/DRAM) x num_accesses under the CDNA3/MI300A timing config, and records kernel_time + derived ns/access. Same per-run / whole-sweep timeout bounding as the fp32 sweep; tolerates failed runs (records them, exits 0) so the report always runs. - compare_to_ground_truth.py: optional --cache-sim input renders a sim-only cache_latency section (no committed ground truth yet) and a summary row; graceful note when the sweep produced no points. - report_template.md: add the cache_latency section + tokens. cache_latency now runs in timing mode (see the preceding coalescer fix); its results are reported sim-only until real-hardware latencies are added to the ground-truth CSV. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f018790bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T16:45:36Z

+    __half2 a0 = __float2half2_rn(base);
+    __half2 a1 = __float2half2_rn(base + 0.1f);
+    __half2 a2 = __float2half2_rn(base + 0.2f);
+    __half2 a3 = __float2half2_rn(base + 0.3f);


Support fp16 setup opcodes before adding benchmark

When this new fp16 kernel is run, it reaches instructions the CDNA3 emulator still does not decode or execute: llvm-objdump-20 -d --mcpu=gfx942 on the committed kernels_gfx942.hsaco shows these initial __float2half2_rn conversions emit v_fma_mixlo_f16, v_fma_mixhi_f16, and v_pack_b32_f16, while a repo-wide search for those mnemonics finds no decoder/table entry or ALU handler (only v_pk_add_f16 was added). As a result, the fp16_throughput sample will panic on the first run instead of producing a throughput point unless those opcodes are implemented or the kernel is changed to avoid them.

Useful? React with 👍 / 👎.

The array-size axis was too coarse to resolve the cache hierarchy: 9 sizes skipping 32/128/512 KB and 2/32/128 MB. Expand to 21 sizes spanning 4 KB -> 128 MB, with extra density around the L1 (~32 KB) and L2 (~4-8 MB) transitions where the latency steps live. Run both num_accesses values at every size (256 quick + 1024 amortized cross-check) -> 42 points. Raise the sweep budget (SWEEP_TIMEOUT 60 -> 100 min, per-run 10 -> 15 min) and the cache job timeout (90 -> 110 min) so the full grid runs without any config being skipped. Each run is a single-thread chase of <=1024 dependent loads (~1e-5 sim-s), so the whole grid still completes quickly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0ae943f0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T17:03:11Z

+    if (ix >= 1 && ix < nx - 1 &&
+        iy >= 1 && iy < ny - 1) {
+        for (int iz = 1; iz < nz - 1; ++iz) {
+            int idx = iz * ny * nx + iy * nx + ix;


Support v_xad before enabling stencil sample

With the default sample size (N=32), this z-loop is entered, and disassembling the committed kernels_gfx942.hsaco shows the index computation emits v_xad_u32 (opcode 499). I checked the decoder/emulator tables with rg 'v_xad|499' amd/insts/decodetable.go amd/emu/cdna3 and there is no entry or ALU handler, so amd/samples/parboil_stencil will fail to decode before producing a result unless v_xad_u32 is implemented or the kernel is compiled to avoid it.

Useful? React with 👍 / 👎.

…-only report Rework the calibration report around KERNEL EXECUTION TIME (sim vs real), dropping the derived GFLOPS / ns-per-access metrics and the markdown report. - Ground truth: cache_latency was never summarized (it used to panic in the sim). The DB has it -- 19 array_bytes points at num_accesses=2,000,000, metric latency_ns, with kernel times. Add cache_latency to summarize_ground_truth.py SPECS and regenerate mi300a_ground_truth.csv (now 325 fp32 + 19 cache_latency rows). - cache_latency sweep now runs the REAL benchmark's 2,000,000 dependent loads (was a throwaway 256/1024) at exactly the 19 ground-truth array sizes, so each simulated kernel time pairs with a real one. 2M loads are expensive, so per-run (20 min) / sweep (90 min) timeouts bound it and the large DRAM-footprint arrays may be skipped; runs smallest-first so the cache-resident points always complete. Cache job timeout 110 -> 120 min. - plot_calibration.py rewritten: figures only (no prose). One figure per non-scaling combo (fp32: num_blocks x threads_per_block, x=fmas; cache: num_accesses, x=array_bytes), y = kernel time (ms), two lines sim + real. Reports the average error = mean |sim - real| (ms) per figure (in the title) and overall (to the job summary). - Drop the text report: remove compare_to_ground_truth.py and report_template.md; the report job now just plots and uploads figures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59b05d26ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T17:46:52Z

+    for (int j = tid; j < numNodes; j += bdim) {
+        float ST = opt.S * powf(u, (float)(2 * j - numSteps));
+        float payoff = fmaxf(opt.K - ST, 0.0f);
+        shared_values[j] = payoff;


Implement ds_write2st64 before enabling binomial sample

With the default NumSteps=64, this shared-memory write is executed, and disassembling the committed kernels_gfx942.hsaco shows it emits ds_write2st64_b32 at several terminal-payoff/backward-induction stores; the CDNA3 DS emulator only dispatches opcodes 13, 14, 30, 54, 55, 78, 118, 119, 223, and 255, so opcode 15 falls into log.Panicf("Opcode %d for DS format is not implemented"). As a result, amd/samples/tango_binomial_options will panic before producing prices unless ds_write2st64_b32 is implemented or the kernel is compiled to avoid it.

Useful? React with 👍 / 👎.

Resolves 9 add/add + content conflicts from main re-landing the gpu_benchmarks port (PR #269) and removing residual calibration artifacts (PR #270): - fp32_throughput (fp32throughput.go, native cpp + both hsaco, sample main.go): KEEP OURS. Our branch parameterizes the work-group size (ThreadsPerBlock arg, kernarg_segment_size=16) so the calibration sweep can match the ground-truth threads_per_block dimension; main reverted it to a fixed 256 (kernarg=12). Go/cpp/hsaco are a single ABI unit, so all five take our version. - CDNA3 emulator opcodes (vop1.go, vop3a.go, vopc.go, newopcodes_test.go): TAKE THEIRS. Main's PR #269 carries opcode bugfixes our base predates -- half-subnormal float16->float32 decode (+ a new test), correct NaN semantics for VCMP_NLG/VCMP_LG, and the packed-f16 op_sel_hi convention. Unrelated to the calibration work; main's reviewed versions are canonical. Main's deletion of the old gpu_perf_scripts/ root microbenchmarks and mi300a_calibration.md applies cleanly (our new harness lives in gpu_perf_scripts/calibration/ and the SADDR coalescer fix is untouched). Verified: go build ./...; go test ./amd/emu/cdna3/... ./amd/timing/cu/...; fp32_throughput (-threads-per-block) and cache_latency both Pass in timing mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The SADDR "page not found" fix was applied at the wrong layer and broke GCN3 timing (and Deterministicity) in CI: keying the coalescer off SAddr != 0x7F mis-classifies GCN3 OFF-mode FLAT (SAddr == 0) as SADDR mode. Root cause: the timing CU built its decoder with insts.NewDisassembler() and never set IsCDNA3, so CDNA3 kernels were decoded with GCN3 FLAT semantics. The disassembler already resolves the architecture-dependent SADDR/OFF rule at decode time (CDNA3: OFF iff SAddr==0x7F; GCN3: OFF iff SAddr==0x7F or 0) and encodes the result in inst.Addr.RegCount -- but only when IsCDNA3 is set. Fix: - shaderarray.Builder: add WithDecoderBuilder, mirroring WithALUBuilder, so the per-CU instruction decoder can be supplied. - mi300a timing builder: supply an insts.Disassembler with IsCDNA3 = true (alongside the existing CDNA3 ALU), so CDNA3 SADDR loads decode to RegCount=1. - defaultCoalescer.readFlatAddr: revert to the original, arch-agnostic `inst.Addr.RegCount == 1` rule. With correct decode it is right for both GCN3 and CDNA3, and GCN3 returns to its proven path. Verified: go build ./...; go test ./amd/timing/cu/... ./amd/emu/cdna3/...; GCN3 timing (fir, nbody, vectoradd) Pass; CDNA3 timing (cache_latency, fp32, polybench_2mm) Pass; golangci-lint clean on the changed packages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ab8f5b1aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T20:04:38Z

+FP32_NUM_BLOCKS=(1 32 1024)
+FP32_THREADS_PER_BLOCK=(1 32 64 256 1024)
+FP32_FMAS=(256 1024 4096 16384 65536 262144 1048576)


Sweep every committed fp32 calibration point

The committed ground-truth CSV has fp32_throughput rows for 5 num_blocks values (including 4096 and 32768) and 13 fmas_per_thread values (including 512, 2048, 8192, 32768, 131072, and 524288), but this grid launches only 3 × 5 × 7 = 105 of the 5 × 5 × 13 expected points. Since the plotting step only iterates simulated groups/points, those omitted real-hardware configurations disappear from the calibration report rather than showing as missing or contributing to error, so the workflow no longer compares against the full committed ground truth it claims to cover.

Useful? React with 👍 / 👎.

syifan and others added 10 commits June 16, 2026 16:13

build(benchmarks): bump ROCm Docker image to 7.2.4 for gfx942 builds

74e9bb6

Update the DOCKER_IMAGE tag from rocm/dev-ubuntu-24.04:7.1.1 to 7.2.4 across every native/ HIP-compilation Makefile and CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

style: gofmt ported benchmarks and cdna3 opcode handlers

10e9e8b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci(mi300a): commit ground-truth CSV (un-ignore from *.csv)

01399e0

The summarized fp32 ground truth must be tracked for the calibration workflow's compare step; add a .gitignore negation since the repo globally ignores *.csv. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Jun 16, 2026

View reviewed changes

Base automatically changed from mi300a-bench-port to main June 17, 2026 12:55

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

syifan and others added 2 commits June 17, 2026 12:38

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

syifan and others added 2 commits June 17, 2026 15:45

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

Conversation

syifan commented Jun 16, 2026

What's here

Why a CSV, not the DB

Scope / known limitations

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant