Skip to content

Add dejagnu tests for cooperative group GWS debugging#116

Open
spatrang wants to merge 1 commit into
amd-stagingfrom
users/spatrang/coop-group-gws-tests
Open

Add dejagnu tests for cooperative group GWS debugging#116
spatrang wants to merge 1 commit into
amd-stagingfrom
users/spatrang/coop-group-gws-tests

Conversation

@spatrang
Copy link
Copy Markdown

@spatrang spatrang commented May 7, 2026

Summary

Add dejagnu coverage for debugging AMD GPU cooperative-group kernels —
i.e. kernels launched via hipLaunchCooperativeKernel /
hipLaunchCooperativeKernelMultiDevice that synchronize at the grid /
multi-grid level. On AMD GPUs these synchronization primitives are
implemented in hardware via Global Wave Sync (GWS), and they have a
distinct wave/scheduling model that has historically only been covered by
out-of-tree tests. This PR brings that coverage into the dejagnu testsuite
so it runs as part of the regular ROCgdb regression suite.

Tests added

File Scenario
gdb.rocm/coop-group-grid-sync.{cpp,exp} Single-device cooperative kernel using cooperative_groups::this_grid().sync() (intra-device GWS), launched via hipLaunchCooperativeKernel.
gdb.rocm/coop-group-multi-grid-sync.{cpp,exp} Multi-device cooperative kernel using both this_grid().sync() and cooperative_groups::this_multi_grid().sync() (intra + cross-device GWS), launched via hipLaunchCooperativeKernelMultiDevice.

What gets verified

coop-group-grid-sync.exp — two sub-tests:

  • test_break_around_grid_sync
    • Hit a breakpoint before grid.sync() inside a cooperative dispatch.
    • Confirm multiple AMDGPU Wave threads are stopped (waves participating
      in the GWS barrier).
    • Confirm info dispatches lists the cooperative dispatch.
    • Move the breakpoint to after grid.sync() and continue: it must
      fire (proves GWS-protected code can be debugged across the barrier).
    • Continue to clean program exit.
  • test_threads_in_coop_kernel
    • For every AMDGPU Wave parked inside the kernel, switch to it and
      confirm bt 1 reports a frame inside coop_grid_sync_kernel.

coop-group-multi-grid-sync.exp — runs in non-stop mode:

  • After continue -a &, confirm a kernel-side breakpoint fires inside
    coop_multi_grid_sync_kernel. Per-GPU child breakpoint instances
    (Breakpoint X.Y) are observed for every participating GPU.
  • Continue all threads to program exit, which only succeeds if both
    this_grid().sync() and this_multi_grid().sync() release correctly
    under the debugger.

The host-side post-conditions in the .cpp programs additionally validate
the cooperative semantics numerically (cross-workgroup data dependency for
the single-device case, cross-device sum aggregation for the multi-device
case), so any regression in GWS behavior under the debugger turns into a
test failure rather than a silent miscompare.

Skip / unsupported handling

The tests degrade cleanly on systems that cannot run them:

  • Single-device test: queries cooperativeLaunch; if unsupported the
    program prints a recognizable message and exits, and the .exp marks
    the test UNSUPPORTED.
  • Multi-device test: requires >= 2 GPUs and
    cooperativeMultiDeviceLaunch on every device. It is also gated by
    the existing hip_devices_support_debug_multi_process requirement.
    Any of those missing → UNSUPPORTED.

No new dejagnu helpers are required; both .exp files use existing
infrastructure in lib/rocm.exp.

Out of scope / follow-ups

Intentionally left out of this PR; happy to extend if reviewers ask:

  • Stepping (next / step / stepi) across grid.sync() /
    mgrid.sync() boundaries.
  • Conditional breakpoints inside cooperative kernels.
  • lane apply / per-lane register inspection while waves are parked at
    the GWS barrier.
  • Watchpoints on cooperative shared buffers.

@spatrang spatrang requested review from Copilot and lumachad May 7, 2026 13:08
@spatrang spatrang marked this pull request as ready for review May 7, 2026 13:14
@spatrang spatrang requested a review from a team as a code owner May 7, 2026 13:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new ROCm dejagnu coverage to exercise ROCgdb debugging of cooperative-group HIP kernels that synchronize via GWS, covering both single-device this_grid().sync() and multi-device this_multi_grid().sync() scenarios.

Changes:

  • Introduces a single-device cooperative-kernel test that breaks before/after grid.sync() and validates waves/dispatch visibility.
  • Introduces a multi-device cooperative-kernel non-stop test that breaks inside a multi-grid kernel and runs through grid + multi-grid barriers to completion.
  • Adds two HIP C++ test programs that implement the cooperative-group synchronization patterns and validate results on the host side.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp DejaGnu test for single-device cooperative kernel debugging around this_grid().sync().
gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp HIP program implementing single-device cooperative grid.sync() and host-side validation.
gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp DejaGnu non-stop test for multi-device cooperative kernel debugging through this_grid().sync() + this_multi_grid().sync().
gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp HIP program implementing multi-device cooperative launch with cross-device aggregation and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from af78fac to 33f1926 Compare May 7, 2026 13:35
Copy link
Copy Markdown
Collaborator

@lancesix lancesix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Thanks a lot for this, this is a great starting point.

My main concern for now is gfx110x. We do not support debugging cooperative group on those (documented limitation), the testcase should look for them to not FAIL. This is known that the test will not pass even if the arch do support GWS.

I have added a couple of small comments, I'll get back to a more detailed review after the gfx11 concern has been addressed.

Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
@spatrang
Copy link
Copy Markdown
Author

Hi, Thanks a lot for this, this is a great starting point.

My main concern for now is gfx110x. We do not support debugging cooperative group on those (documented limitation), the testcase should look for them to not FAIL. This is known that the test will not pass even if the arch do support GWS.

I have added a couple of small comments, I'll get back to a more detailed review after the gfx11 concern has been addressed.

Addressed. Added a supports_cooperative_groups helper in lib/rocm.exp that excludes gfx1100/1101/1102/1103, and both .exp files now require it, so on gfx110x the run reports UNSUPPORTED: …: require failed: supports_cooperative_groups instead of FAIL. Mirrors the existing hip_devices_support_debug_multi_process pattern in the same lib.

@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from 33f1926 to 73b1b20 Compare May 11, 2026 06:54
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
CHECK (hipGetDeviceProperties (&props, device_id));
if (!props.cooperativeLaunch)
{
printf ("Device does not support cooperative launch, skipping.\n");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output the device id so when this happens it is clear what device we're using.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the existing skip message to include the device id range that was actually checked, so it’s clear which devices were evaluated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it say 0..n though? If the last ones fails in a 8 GPU setup, we will report 0..7, which doesn't seem very helpful.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — the 0..N framing did read like a single-device range when it really meant "we checked all of them and none qualified." Reworded to drop the range and just report how many devices were checked:

None of the %d HIP device(s) support cooperative launch, skipping.
So in an 8 GPU setup it now prints "None of the 8 HIP device(s) support cooperative launch, skipping.", which is unambiguous about both the count and the fact that nothing matched.

Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp Outdated
Comment on lines +67 to +72
# First, advance to the eligibility marker so we can read the
# final value of "n_gpus" (the count of GPUs the inferior is
# actually going to use for the cooperative dispatch). If the
# program self-skips before reaching the marker (insufficient
# GPUs or no cooperative-multi-device support on some device),
# mark the test UNSUPPORTED.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the requires above catch the case where GWS support is not there?

Also, since this uses multiple GPU's, what are the chances of some system being busy with other workloads and capacity just not being there to run these tests and spawn the waves we want. Have we considered this case? Do we need some detection to make sure the test doesn't fail randomly?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes — with require supports_cooperative_groups (architecture gate) and require hip_devices_support_debug_multi_process already in place above, plus the recent .cpp change that picks 2 GWS-capable devices via cooperativeMultiDeviceLaunch, the runtime "or skip" branch was redundant. Dropped it; the marker advance now uses the standard gdb_continue_to_breakpoint and bails on failure. The awkward [^\r\n]\r\n[^\r\n]\r\n regex goes away with it.

  2. On the busy-system / capacity question: any HIP API failure during launch or sync is checked by the CHECK (...) macro in the inferior, which terminates with a non-zero exit. That surfaces as a loud test FAIL (mismatched exit), not a silent skip or random pass — so we won't get random false skips just because the system is loaded. We could add an explicit launch-time timeout, but that wasn't the historical pattern in gdb.rocm tests. Updated the comment in the .exp to make this rationale explicit.


set eligible 0
gdb_test_multiple "continue" "advance to eligibility marker or skip" {
-re "hit Breakpoint $::decimal,\[^\r\n\]*main \\(\\)\[^\r\n\]*\r\n\[^\r\n\]*\r\n$::gdb_prompt " {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another one of those patterns I'm not sure we need: [^\r\n]\r\n[^\r\n]\r\n

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

@lumachad lumachad May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really helpful if you could give a hint of what was changed. I can't really tell what this was replaced with.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole gdb_test_multiple "continue" block (with the awkward [^\r\n]\r\n[^\r\n]\r\n$gdb_prompt regex and the eligible flag) was replaced with the standard gdb_continue_to_breakpoint helper from lib/gdb.exp. The block now reads:

gdb_breakpoint \
    [gdb_get_line_number "n-gpus-final line"] allow-pending
if {[gdb_continue_to_breakpoint "advance to n-gpus-final"] != 0} {
    return
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change pushed? I don't see it.

Comment thread gdb/testsuite/lib/rocm.exp Outdated
Comment thread gdb/testsuite/lib/rocm.exp Outdated
Comment on lines +427 to +444
proc supports_cooperative_groups {} {
set unsupported_targets {
gfx1100 gfx1101 gfx1102 gfx1103
}

set targets [find_amdgpu_devices]
if { [llength $targets] == 0 } {
return 0
}

foreach target $targets {
if { [lsearch -exact $unsupported_targets $target] != -1 } {
return 0
}
}
return 1
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we check that each GPU supports GWS? It seems this code assumes we do support it unless we're using one of the unsupported targets.

Should this check at least validate in some way that gdb does the right thing instead of letting potentially unsupported cases going through with testing and possibly causing issues?

Also, if we can check per-GPU, I think that would be nice given we have the single GPU test as well, and we need to select a particular device for that. If we have a list of GPU's and entries saying if each of those support GWS, then we can pick a good one.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-GPU predicate: refactored — extracted target_supports_cooperative_groups ; supports_cooperative_groups now iterates and delegates per-target. Same caller-visible behavior, but the per-target seam is now reusable for future filtering.

Validating gdb actually works on a given GPU: I checked amd-dbgapi.h. There's no per-agent capability query for "supports cooperative-group debug" — the API only enforces the restriction at attach time after a coop queue is created (AMD_DBGAPI_STATUS_ERROR_RESTRICTION). So we can't probe it cheaply from a requires gate today. The static denylist matches exactly what the header documents as unsupported (Restrictions §9: gfx1100/1101/1102) and mirrors the hip_devices_support_debug_multi_process pattern just above. Long-term fix is a per-agent capability bit on amd-dbgapi; once that lands, target_supports_cooperative_groups is the single place to swap in a runtime query.

Picking a supported device in the single-GPU test: the natural follow-up — needs us to plumb the supported-target allowlist into the inferior so the .cpp can pick a debugger-supported device, not just a cooperativeLaunch-supporting one. Worth a separate change, ideally after (2) lands so we plumb agent IDs instead of gfx names.

Side note: our denylist also includes gfx1103, which isn't in the public header. Drop it or keep it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a bit too much work for now. OK. I'm happy not doing the heavy-handed check and instead go with the assumption about supported features in specific gfx's. Then the code you have to handle lack of working GWS stays in place and my comments there can be ignore.

@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch 2 times, most recently from ae4f615 to 5aa9e19 Compare May 20, 2026 06:12
Comment thread gdb/testsuite/gdb.rocm/deep-stack.exp Outdated
@@ -26,7 +26,7 @@

load_lib rocm.exp

require allow_hipcc_tests
require allow_hip_tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry — these 4 allow_hipcc_tests → allow_hip_tests edits are unrelated to this PR and the rename target doesn't exist (only allow_hipcc_tests is defined in lib/rocm.exp). They leaked in from a different local WIP branch during a rebase. Reverted in all 4 files; the next force-push will drop them from the PR.

@@ -20,7 +20,7 @@

load_lib rocm.exp

require allow_hipcc_tests
require allow_hip_tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry — these 4 allow_hipcc_tests → allow_hip_tests edits are unrelated to this PR and the rename target doesn't exist (only allow_hipcc_tests is defined in lib/rocm.exp). They leaked in from a different local WIP branch during a rebase. Reverted in all 4 files; the next force-push will drop them from the PR.

@@ -17,7 +17,7 @@

load_lib rocm.exp

require allow_hipcc_tests
require allow_hip_tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry — these 4 allow_hipcc_tests → allow_hip_tests edits are unrelated to this PR and the rename target doesn't exist (only allow_hipcc_tests is defined in lib/rocm.exp). They leaked in from a different local WIP branch during a rebase. Reverted in all 4 files; the next force-push will drop them from the PR.

Comment thread gdb/testsuite/gdb.rocm/deep-stack.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
array unset gpu_seen
set distinct_gpus 0
gdb_test_multiple "" "breakpoint hit on every participating GPU" {
-re "Thread $::decimal\[^\r\n\]* hit Breakpoint $::decimal\\.($::decimal),\[^\r\n\]*\r\n" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is risk here that the implicit ".*" at the beginning of the regexp may match an already printed "Thread ... hit Breakpoint" line. We may have to do line-by-line matching to avoid this possibility.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anchored the regex with a leading \r\n so each match must consume a fresh newline first. That forces line-by-line consumption and removes the risk of the implicit "scan whole buffer" behavior re-matching an already-seen Thread N … hit Breakpoint X.Y … line. In non-stop mode every such notification is preceded by a newline, so the anchor is always satisfied for new hits.

Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
# cooperative dispatch's multi-workgroup property is exercised.
set blocks {}
gdb_test_multiple "info threads" "blocks present pre-sync" -lbl {
-re "AMDGPU Wave \[^(\r\n\]*\\((\[^()\r\n\]+)\\)" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that there is an implicit .* at the beginning of the regexp and the regexp is not matching until the end of the line. To make sure we match such lines one by one, we may have to use ^ at the regexp start and go until the we see \r\n.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anchored to a single line. Two changes to the regex:

Added ^[^\r\n]* at the start so the match starts at the beginning of the current line (in -lbl mode the framework already prepends ^, but the explicit [^\r\n]* makes it clear no \r\n can be skipped before AMDGPU Wave).
Added [^\r\n]*\r\n at the end so the entire matching line is consumed, leaving the next info threads row at the buffer's start for the next iteration.
Net effect: one row

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked again — the original regex is actually correct here because the -lbl flag on gdb_test_multiple already gives us line-by-line consumption: it prepends ^ to each pattern and auto-consumes any line that no pattern matches. So adding ^[^\r\n]* / [^\r\n]*\r\n to the pattern is redundant and, when I tried it, broke the match entirely (the test FAILed because no blocks were collected). Reverted to the original pattern; the line-by-line guarantee is provided by -lbl.

Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from 5aa9e19 to d541e7d Compare May 21, 2026 10:50
@spatrang spatrang requested review from aktemur, lancesix and lumachad May 21, 2026 10:59
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from d541e7d to 28f6ae0 Compare May 21, 2026 12:45
@@ -0,0 +1,168 @@
/* Copyright 2026 Free Software Foundation, Inc.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a description to the commit message? There is currently only the title. It would also help if a couple tags are added to the commit title, like so: "gdb, testsuite: add gdb.rocm tests for cooperative group debugging"


set eligible 0
gdb_test_multiple "continue" "advance to eligibility marker or skip" {
-re "hit Breakpoint $::decimal,\[^\r\n\]*main \\(\\)\[^\r\n\]*\r\n\[^\r\n\]*\r\n$::gdb_prompt " {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change pushed? I don't see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants