Skip to content

Kernel OOPS in nv_audio_dynamic_power due to invalid memory state/access. #1141

@SpacingBat3

Description

@SpacingBat3

NVIDIA Open GPU Kernel Modules Version

595.71.05-2 (distributor package build – 2 means it was rebuilt once from 595.71.05 after initial package release)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

7.0.5, distribution-built, should not be relevant which flavour

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

TU117 (GeForce GTX 1650)

Describe the bug

I have a single-GPU setup where I sometimes "swap" my GPU between VM and host (a consequence of no GPU virtualisation for at least consumer-grade TU117 cards), which has many quirks on its own. After a recent freezes with TTY sessions (even at remote / over SSH – non-interactive command execution over SSH was fine tho) I got in kernel I decided to check out if there was anything suspicious in kernel journal and this entry has brought my attention (a cleaned-up log with CPU register states and some info stripped about my host machine to not leak it to the public openly):

journalctl -b -1 --dmesg
BUG: kernel NULL pointer dereference, address: 000000000000041c
fbcon: Taking over console
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 4 UID: 0 PID: 576538 Comm: kmscon Tainted: G        W  OE       [INFO_REDACTED]
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: [INFO_REDACTED]
RIP: 0010:nv_audio_dynamic_power+0xcd/0x140 [nvidia]
Code: 01 00 00 c7 44 24 04 00 00 00 00 48 39 d0 75 1a e9 77 ff ff ff 0f 1f 84 00 00 00 00 00 48 8b 40 08 48 39 d0 0f 84 62 ff ff ff <83> 78 1c 03 75 ed 48 8b 78 20 48 83 bf 68 03 00 00 00 0f 84 4a ff
RSP: 0018:ffffd2151992e9a8 EFLAGS: 00010207
RAX: 0000000000000400 RBX: ffff8cea7da4c020 RCX: 0000000000000009
RDX: ffff8ce8db1791a0 RSI: ffffd2151992e930 RDI: ffff8ce8c2440108
RBP: ffffd2151992e9c0 R08: 0000000000000000 R09: 0000000000000000
R10: ffffd2151992eaa0 R11: 0000000020803d03 R12: ffffd2151992ebd0
R13: ffffd2151992eaa0 R14: ffffffffc16d1210 R15: ffffffffc08275e0
FS:  00007f4e9d63c880(0000) GS:ffff8cec4be28000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000041c CR3: 00000002c98b5000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 subdeviceCtrlCmdOsUnixAudioDynamicPower_IMPL+0x27/0x2b [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 resControl_IMPL+0x1b5/0x1c0 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 ? srso_return_thunk+0x5/0x5f
 gpuresControl_IMPL+0x55/0xa0 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 serverControl+0x4bd/0x5c0 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 ? srso_return_thunk+0x5/0x5f
 _rmapiRmControl+0x78a/0xa10 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 rmapiControlWithSecInfo+0x79/0x140 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 ? srso_return_thunk+0x5/0x5f
 rmapiControlWithSecInfoTls+0x76/0xe0 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 _nv04ControlWithSecInfo.constprop.0+0x84/0x90 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 ? _nv04ControlWithSecInfo.constprop.0+0x84/0x90 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 Nv04ControlKernel+0x60/0x70 [nvidia 150ac0faf5cc194d977f0f7b07309e91a63173e5]
 nvkms_call_rm+0x4c/0x80 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 nvRmApiControl+0x6e/0x80 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 ? srso_return_thunk+0x5/0x5f
 RmSetELDAudioCaps+0xca/0x190 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 nvHdmiDpEnableDisableAudio+0xcd/0x3b0 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 KickoffProposedModeSetHwState+0xe44/0xf70 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 nvSetDispModeEvo+0x2f09/0x42a0 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 nvKmsIoctl+0x123/0x300 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 ? down+0x1e/0x70
 nvkms_ioctl_from_kapi_try_pmlock+0x60/0xa0 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 ApplyModeSetConfig+0x151/0xc70 [nvidia_modeset 6f9b7ae829e3025142ba1c8e45c5d4ab49daae0b]
 ? srso_return_thunk+0x5/0x5f
 nv_drm_atomic_apply_modeset_config+0x5aa/0x810 [nvidia_drm 113f4d30e12df7db5dc061bbb1e758e96ed27000]
 nv_drm_atomic_commit+0x201/0x4e0 [nvidia_drm 113f4d30e12df7db5dc061bbb1e758e96ed27000]
 drm_atomic_commit+0xb1/0xe0
 ? __pfx___drm_printfn_info+0x10/0x10
 drm_client_modeset_commit_atomic.constprop.0+0x22e/0x2a0
 drm_client_modeset_commit_locked+0x5c/0x190
 ? srso_return_thunk+0x5/0x5f
 drm_client_modeset_commit+0x25/0x40
 drm_fb_helper_restore_fbdev_mode_unlocked+0xaa/0x110
 ? srso_return_thunk+0x5/0x5f
 drm_fbdev_client_restore+0x12/0x20
 drm_client_dev_restore+0x96/0x140
 drm_release+0x12c/0x150
 __fput+0xf5/0x2f0
 __x64_sys_close+0x90/0x140
 do_syscall_64+0x119/0x1630
 ? __x64_sys_close+0xe1/0x140
 ? __fput+0x18f/0x2f0
 ? srso_return_thunk+0x5/0x5f
 ? __x64_sys_close+0xe1/0x140
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? kmem_cache_free+0x34c/0x420
 ? __x64_sys_close+0xe1/0x140
 ? srso_return_thunk+0x5/0x5f
 ? __check_object_size+0x85/0x220
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? drm_copy_field+0x7e/0xf0
 ? srso_return_thunk+0x5/0x5f
 ? drm_copy_field+0x32/0xf0
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? drm_ioctl_kernel+0xae/0x100
 ? srso_return_thunk+0x5/0x5f
 ? __check_object_size+0x43/0x220
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? drm_ioctl+0x30f/0x530
 ? __pfx_drm_version+0x10/0x10
 ? srso_return_thunk+0x5/0x5f
 ? __x64_sys_ioctl+0xd5/0xf0
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? __x64_sys_ioctl+0xd5/0xf0
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? kmem_cache_free+0x34c/0x420
 ? __x64_sys_close+0xe1/0x140
 ? __fput+0x18f/0x2f0
 ? srso_return_thunk+0x5/0x5f
 ? __x64_sys_close+0xe1/0x140
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? do_syscall_64+0x119/0x1630
 ? srso_return_thunk+0x5/0x5f
 ? exc_page_fault+0x90/0x1e0
 ? irq_exit_rcu+0x55/0x100
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f4e9d4a0a52
Code: 08 0f 85 c1 3f ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 55 bf 01 00
RSP: 002b:00007fff61551348 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 00005583852041e0 RCX: 00007f4e9d4a0a52
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000010
RBP: 00007fff61551370 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000558385195330 R14: 0000000000000000 R15: 0000000000000010
 </TASK>
Modules linked in: [REDACTED_INFO]
CR2: 000000000000041c
---[ end trace 0000000000000000 ]---

The kernel clearly points out the bug happened in the out-of-tree nvidia module (non-proprietary flavour), which led me to writing this report.

I'm also not sure how much code differs between this and proprietary kernel drivers, I don't really want to test the proprietary drivers given my distribution provider decided to switch to open-source flavour instead as the default set of NVIDIA-provided drivers. I am also sure the issue is at kernel-space at least, and this is not a report for any of the userspace tools that are also part of the proprietary toolset.

I mostly report this in hopes this will improve the module robustness for the next releases, given kernel points out symptoms to be an invalid memory access.

To Reproduce

It might be a bit troublesome to reproduce steadily: not always my machine does freeze that much, it might also require a rather complex libvirt VM setup for Windows 11 on Linux I currently have with QEMU hooks. I might export a domain XML and hooks that runs: 1) before machine start 2) after machine complete shutdown, so you might see the overall workflow of what is really going on with the GPU.

The issue might be very much pinned down to the unclear states of GPUs when switching back and forth, unless NVIDIA claims to support single-GPU systems where delegating GPU across host and guest is fine. I still want to point it down if any interest to have a bit more robustness in this part of the code could be achieved, possibly to avoid invalid memory access entirely at this stage of code execution.

Bug Incidence

Once

nvidia-bug-report.log.gz

(Could I skip it after identifying the source of bug in-code and providing the relevant kernel logs in Description and continuing to discuss it in More info? I'm also not a fan of sharing an info on many of my system resources without any justification or the rationale.)

More Info

The component kernel logs into is obviously associated with the audio controller actually than the GPU itself, which is also provided and used by VM and host systems, given my monitor actually has a set of speakers built into it.

A bit of disassembly of the module led me to this place in code (0xa190+0xcd, also see kernel logs pointing to this exact opcode)…

a25d: 83 78 1c 03 cmpl $0x3,0x1c(%rax)

…which (according to addr2line and another build of modules with debug symbols enabled) might correspond to this line of code:

if (pdev->type == SNDRV_DEV_CODEC)

…and that makes sense for me, given the context of the assembly and how SNDRV_DEV_CODEC is defined in the kernel:

enum snd_device_type {
	SNDRV_DEV_LOWLEVEL,
	SNDRV_DEV_INFO,
	SNDRV_DEV_BUS,
	SNDRV_DEV_CODEC,

FYI, the pdev pointer is defined exactly in previous line and list_entry is a macro that supposedly (I have to yet gain expertise in Linux kernel API) converts p pointer, and given the macro for loop iteration resolves to this:

/**
 * list_for_each_prev	-	iterate over a list backwards
 * @pos:	the &struct list_head to use as a loop cursor.
 * @head:	the head for your list.
 */
#define list_for_each_prev(pos, head) \
	for (pos = (head)->prev; pos != (head); pos = pos->prev)

Could it be that p=(&card->devices)->prev (appox) is a NULL pointer? I still haven't researched if it makes sense by analysis of the code and putting all pieces together.

list_for_each_prev(p, &card->devices)
{
struct snd_device *pdev = list_entry(p, struct snd_device, list);
if (pdev->type == SNDRV_DEV_CODEC)
{
codec = pdev->device_data;


I still have very little knowledge on NVIDIA kernel modules, neither I am experienced with kernel modules development for GPUs, so this is information I could gather so far on this issue. I might do more digging in the future and provide more details if needed. I would like to have this issue open for as long as I potentially I might gather more info as well and re-approach reproducing it. Right now I want to bring it to the discussion and have a collective approach on discovering more exact cause of unclear memory state and whether can it be somehow validated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions