gdb: add "catch hiperr" to break on HIP API errors#129
Conversation
|
@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip". |
lancesix
left a comment
There was a problem hiding this comment.
First quick first pass of comments.
One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?
823bf4f to
745f0bc
Compare
|
745f0bc to
1da282b
Compare
|
1da282b to
d616325
Compare
|
d616325 to
4b40f98
Compare
|
4b40f98 to
5005bc1
Compare
|
81df767 to
5a4629e
Compare
b2e7ebe to
3ff9b7d
Compare
|
3ff9b7d to
257e870
Compare
257e870 to
66a737d
Compare
|
If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check? |
I have not tested, but am highly confident that this will trigger the catch point. |
I also think the catchpoint will be triggered but just in case I wanted make sure GDB can handle the infcall interruption correctly. It would be great to test with and without |
As Lancelot predicted: With |
5c918d6 to
0ac52b6
Compare
|
0ac52b6 to
61d1b84
Compare
|
What ---- This change adds a new feature to GDB, when it is targeted for amdgcn-amd-amdhsa, that enables it to stop whenever HIP API errors are returned. An abridged sample of execution looks like this: (gdb) catch hiperr (gdb) run Thread 1 "hip" hit Catchpoint 1 (hip error) hipErrorInvalidDevice (101): invalid device ordinal Moreover, a $_hiperr convenience variable is added that holds the error code. For further details, please see the updated documentation in this change. Design ------ Most of the implementation is in the newly added module called break-catch-hiperr.c. The naming and its context follows the break-catch-throw.c module. The whole mechanism work by setting a breakpoint (catchpoint) on a __hipOnError() symbol, that must be provided by the CLR library. Enabling the catchpoint when there's no __hipOnError symbol, simply has no effect. There's a architect dependent section that is responsible for extracting the error parameters (code, name, description) from a single argument passed to the __hipOnError() function as a structure pointer. The expected format of this struct is: uint32_t version; // at this point, only 1 is supported uint32_t code; // the numerical error code const char *name; // name of the error const char *desc; // a phrase about the error The code is devised in such a way that if the __hipOnError() exists, but for whatever reason the parameters cannot be deduced, it won't interfere with the catchpoint being triggered. It will affect the reported data of the catchpoint though. Test ---- A new test, gdb.rocm/hip-catch-errors, is added to the mix to ensure that: 1. (Pending) catchpoints work. 2. Multiple catchpoints in one execution are reported accordingly. 3. Errors of indirect HIP calls like "kenel<<<...>>>()" are caught. 4. The convenience variable can be used for conditional situations. 5. Conditional and temporary catchpoints work as expected. Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
61d1b84 to
11fde27
Compare
This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.
--
What
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:
(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (hip error)
hipErrorInvalidDevice (101): invalid device ordinal
Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.
Design
Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.
The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.
There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:
uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error
The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.
Test
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:
ROCM-1548