bnxt_en NIC driver crashes IO_PAGE_FAULT

Bug #1931106 reported by aft2d
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Hi all,

We received a bunch of new servers with a Supermicro H12SSL-NT mainboard that has an embedded Broadcom BCM57416 NIC.

On all those servers we observe crashes of the NIC driver (bnxt_en) from time to time. We're not able to manually reproduce this issue, it just occurs at some point. Also our monitoring does not show any irregularities(high traffic flow or sth. like this).

Syslog: https://pastebin.com/yDAyjHvF

All servers are running with up-to-date packages:
$ lsb_release -rd
Description: Ubuntu 20.04.2 LTS
Release: 20.04
$ uname -r
5.4.0-73-generic

It also happens on older kernel versions (tested 5.4.0-66) as well as the HWE kernel (tested 5.8.0-55).

Thanks in advance.
~ Roman

Revision history for this message
aft2d (aft2d) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please test latest mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.13-rc5/amd64/

Headers are not needed.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
aft2d (aft2d) wrote (last edit ):

Did that, and ~1 hour after I put them back in production 4 out of 5 servers I've upgraded to v5.13-rc5 crashed and the last after one more hour.

For comaparison, with the old kernel version it occured ~2/3 times a week.

New syslog: https://pastebin.com/GWqtVaA3

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please raise the issue to the following mail address:
$ scripts/get_maintainer.pl -f drivers/net/ethernet/broadcom/bnxt
Michael Chan <email address hidden> (supporter:BROADCOM BNXT_EN 50 GIGABIT ETHERNET DRIVER)
"David S. Miller" <email address hidden> (maintainer:NETWORKING DRIVERS)
Jakub Kicinski <email address hidden> (maintainer:NETWORKING DRIVERS)
<email address hidden> (open list:BROADCOM BNXT_EN 50 GIGABIT ETHERNET DRIVER)
<email address hidden> (open list)

Revision history for this message
aft2d (aft2d) wrote (last edit ):

Update:
I've contacted the mail addresses from the Kai-Heng's post.
Michael Chan (from Broadcom) replied that they've seen similar issues on other AMD systems and that they were working with AMD to resolve this.
The plan was to establish contact between me and AMD, unfortunately this never happened. The attempt to contact AMD via the official way (tech support) failed because I could not answer AMD's questions without feedback from Broadcom, who then also did not reply anymore.

Workaround:
Luckily, with the information that came out of the conversation with Broadcom, I was able to troubleshoot a bit myself since I knew at least somewhat where to look.
It appears that by setting Advanced -> NB Configuration -> IOMMU to "disabled" (default is "Auto") in Supermicro BIOS the problem does not occur anymore.

Since then the whole topic is "stuck".

It's just a workaround and not really a fix, but at least servers running stable now for me. Since I don't know where the actual problem is (whether in AMD hardware, bios, kernel, or whatever) so I can't say if this bug report can be marked as closed or not.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

One possible workaround is to make the device use identity mapping, but that requires change the kernel source code.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.