Skip to content

gdb: add "catch hiperr" to break on HIP API errors#129

Open
amd-shahab wants to merge 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs
Open

gdb: add "catch hiperr" to break on HIP API errors#129
amd-shahab wants to merge 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs

Conversation

@amd-shahab
Copy link
Copy Markdown
Contributor

@amd-shahab amd-shahab commented May 15, 2026

This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.

--

What

This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:

(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (hip error)
hipErrorInvalidDevice (101): invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.

Design

Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.

The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:

uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.

Test

A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

  1. (Pending) catchpoints work.
  2. Multiple catchpoints in one execution are reported accordingly.
  3. Errors of indirect HIP calls like "kenel<<<...>>>()" are caught.
  4. The convenience variable can be used for conditional situations.
  5. Conditional and temporary catchpoints work as expected.

ROCM-1548

@amd-shahab amd-shahab self-assigned this May 15, 2026
@amd-shahab amd-shahab requested a review from a team as a code owner May 15, 2026 18:55
@lumachad
Copy link
Copy Markdown
Collaborator

@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip".

Copy link
Copy Markdown
Collaborator

@lancesix lancesix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick first pass of comments.

One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
@amd-shahab amd-shahab added the ci:skip Skip all pre-commit / CI jobs while the label is up label May 18, 2026
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/amd64-tdep.c
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 823bf4f to 745f0bc Compare May 18, 2026 10:43
@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 18, 2026

  • error string not part of API

  • __hipOnError -> @samp{__hipOnError}

  • "int main" -> "int\nmain"

  • gdb_test -> gdb_test_on_output

  • target_read_stack -> target_read_memory

  • drop the "ignored" param name

  • Else -> Otherwise

  • NULL -> nullptr

  • removed dangling "backtrace" command from the documentation

  • !arg -> arg != nullptr

  • binay -> binary

  • relvant -> relevant

  • bool print_one (const bp_location **) -> bool print_one (const bp_location **last_loc)

  • rename variable "x" to "dontcare"

  • comment on true/false return of amd64_fetch_hiperr_parameters()

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 745f0bc to 1da282b Compare May 18, 2026 10:52
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • emit a warning when the "version" is not expected.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 1da282b to d616325 Compare May 18, 2026 12:37
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • make the invocation of the test fully argument dependent
  • add comments for every method in break-catch-hiperr.c

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from d616325 to 4b40f98 Compare May 18, 2026 14:19
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • check for the existence of __hipOnError symbol. If not found, skip the test.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 4b40f98 to 5005bc1 Compare May 18, 2026 19:21
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • check for the right frame (__hipOnError)

Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 5 times, most recently from 81df767 to 5a4629e Compare May 19, 2026 14:34
@amd-shahab amd-shahab removed the ci:skip Skip all pre-commit / CI jobs while the label is up label May 19, 2026
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 3 times, most recently from b2e7ebe to 3ff9b7d Compare May 20, 2026 14:14
@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 20, 2026

  • Use separate hooks for Linux and Windows implementation of amd64 targets
  • The test comes up with its own reference value instead of having them hardcoded.
  • Better string output. Also, tell user about $_hiperr and backtrace.

Comment thread gdb/gdbarch_components.py Outdated
Comment thread gdb/gdbarch_components.py
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/amd64-tdep.h Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
@aktemur
Copy link
Copy Markdown
Contributor

aktemur commented May 26, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

@lancesix
Copy link
Copy Markdown
Collaborator

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

I have not tested, but am highly confident that this will trigger the catch point.

@aktemur
Copy link
Copy Markdown
Contributor

aktemur commented May 26, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

I have not tested, but am highly confident that this will trigger the catch point.

I also think the catchpoint will be triggered but just in case I wanted make sure GDB can handle the infcall interruption correctly. It would be great to test with and without set unwind-on-signal on. (I cannot test because I don't have the hipOnError symbol.

@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 27, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

As Lancelot predicted:

(gdb) catch hiperr
Catchpoint 2 (HIP error)
(gdb) p hipSetDevice(98765)
[New Thread 0x7ffff4fff6c0 (LWP 8965) (id 2)]
[New Thread 0x7ffff47fe6c0 (LWP 8966) (id 3)]
[New Thread 0x7ffff3cff6c0 (LWP 8967) (id 4)]
[Thread 0x7ffff3cff6c0 (LWP 8967) (id 4) exited]

Thread 1 "hip-catch-error" hit Catchpoint 2 (HIP error)
HIP API call failed with error hipErrorInvalidDevice (101): invalid device ordinal

The $_hiperr holds the error number.
A "backtrace" will show the call stack.
The program being debugged stopped while in a function called from GDB.
Evaluation of the expression containing the function
(hipSetDevice(int)) will be abandoned.
When the function is done executing, GDB will silently stop.


(gdb) bt
#0  __hipOnError (err_info=0x7fffffffd7a0) at /src/clr/hipamd/src/hip_error.cpp:11
#1  0x00007ffff62863e9 in hip::hipSetDevice (device=98765) at /src/clr/hipamd/src/hip_device_runtime.cpp:784
#2  0x00007ffff668440f in hipSetDevice (deviceId=98765) at /src/clr/hipamd/src/hip_table_interface.cpp:2133
#3  <function called from gdb>
#4  main (argc=1, argv=0x7fffffffe108) at /src/rocgdb/gdb/testsuite/gdb.rocm/hip-catch-errors.cpp:26

With set unwind-on-signal on (default is off), we get the same output. "Signal"s have nothing to do with these errors.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 2 times, most recently from 5c918d6 to 0ac52b6 Compare May 27, 2026 14:42
@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 27, 2026

  • In catchpoint output, refer to "$_hiperr" as a "convenience variable"
  • In catchpoint output, change the "backtrace" section to the suggested sentence
  • gdb.texinfo updated to reflect the latest output.
  • Use "$::hip_" instead of "global hip_"
  • Use ATTRIBUTE_UNUSED
  • rename the fetch_hiperr parameter names to err_{no,name,str}
  • a few typos are fixed

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 0ac52b6 to 61d1b84 Compare May 28, 2026 12:48
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • test: Add an implicit and faulty kernel<<<1, big_size>>> call to HIP API as test case
  • test: Make the routine for getting reference values and causing the errors the same

What
----
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned.  An abridged sample of execution looks
like this:

(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (hip error)
hipErrorInvalidDevice (101): invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error code.  For further details, please see the updated
documentation in this change.

Design
------
Most of the implementation is in the newly added module called
break-catch-hiperr.c.  The naming and its context follows the
break-catch-throw.c module.

The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library.  Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer.  The expected format of this struct is:

  uint32_t    version;   // at this point, only 1 is supported
  uint32_t    code;      // the numerical error code
  const char *name;      // name of the error
  const char *desc;      // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered.  It will
affect the reported data of the catchpoint though.

Test
----
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

1. (Pending) catchpoints work.
2. Multiple catchpoints in one execution are reported accordingly.
3. Errors of indirect HIP calls like "kenel<<<...>>>()" are caught.
4. The convenience variable can be used for conditional situations.
5. Conditional and temporary catchpoints work as expected.

Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 61d1b84 to 11fde27 Compare May 28, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants