Skip to content

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570

Open
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:master
Open

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:master

Conversation

@yurypm
Copy link
Copy Markdown
Contributor

@yurypm yurypm commented May 7, 2026

We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.
Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.

@yurypm yurypm requested a review from a team as a code owner May 7, 2026 15:46
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 7, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: yurypm / name: Yury Murashka (9564cef)

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch Outdated
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch Outdated
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yurypm yurypm requested a review from paulmenzel May 12, 2026 23:17
Copy link
Copy Markdown
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to publish my comments I made some days ago.

Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch Outdated
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

We need this patch to fix a soft lockup in the Linux kernel on Arista
modular chassis in the 202511 branch.

During PCIe hot-plug events, uncorrectable errors can be reported and
AER recovery for the tg3 device is initiated by the AER kernel driver.
The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
will be called. But AER error recovery can fail. For example, when one
of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
device is disabled and NAPI is disabled (pci_disable_device and
napi_disable are called from tg3_io_error_detected). Then we can try to
disable PCIe link and napi_disable will be called again:

  napi_disable+0x1b/0x1b0
  tg3_napi_disable+0x89/0xa0 [tg3]
  tg3_netif_stop+0x37/0xe3 [tg3]
  tg3_stop+0x30/0x160 [tg3]
  tg3_close+0x2a/0x60 [tg3]
  __dev_close_many+0xad/0x130
  dev_close_many+0xb2/0x190
  unregister_netdevice_many_notify+0x19d/0xa00
  unregister_netdevice_queue+0xf8/0x140
  unregister_netdev+0x1c/0x30
  tg3_remove_one+0xaa/0x150 [tg3]
  pci_device_remove+0x42/0xb0
  device_release_driver_internal+0x19c/0x200
  pci_stop_bus_device+0x85/0xb0
  pci_stop_bus_device+0x2c/0xb0
  pci_stop_bus_device+0x2c/0xb0
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x9f/0x160
  pciehp_disable_slot+0x67/0x100
  pciehp_handle_presence_or_link_change+0x77/0x350

This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover a similar issue,
but for fatal errors. We cannot reuse this flag because it is reset in
tg3_io_resume, but it is not called when AER recovery fails.

Similarly, if an AER error is reported and tg3_io_error_detected calls
pci_disable_device, a subsequent device removal via tg3_remove_one or
tg3_shutdown will call pci_disable_device again for the already-disabled
device.

Add a napi_enabled flag to struct tg3 to track whether napi_enable has
been called. Guard tg3_napi_disable() so it returns early if NAPI was
not previously enabled. Also guard pci_disable_device() calls in
tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
disabling an already-disabled device.

Fixes: b45aa2f6192e ("tg3: Add EEH support")
Signed-off-by: Yury Murashka <yurypm@arista.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants