mpt3sas - storage controller resets under heavy disk io

Bug #1841132 reported by Drew Woodard
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Disco
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned

Bug Description

[summary]
when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset
this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io
the server must be rebooted to restore the controller to a normal state

[hardware configuration]
server: dell poweredge r7415, purchased 2019-02
cpu/chipset: amd epyc naples
storage controller: "dell hba330 mini" with chipset "lsi sas3008"
drives: 4x samsung 860 pro 2TB ssd

[software configuration]
ubuntu 18.04 server
mdadm raid6
all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

[what happened]
server was operating as a vm host for months without issue
one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io
the server was removed from production and I looked for a way to reproduce the issue

[how to reproduce the issue]
there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was:
have the four ssds in a mdadm raid6 with ext4 filesystem
create three 500GB files containing random data
open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop
the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory
then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing
within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times
rebooting the server restores the controller to a normal state
if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur

[why this is being reported here]
It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives.
My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific).
I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change.
I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change.
I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change.
I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue.
-
tl;dr version:
ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors

[caveats]
Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages.
I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy.
There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed.
There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1841132

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Drew Woodard (drew-woodard) wrote :

attaching copy of syslog from moment of error

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Drew Woodard (drew-woodard) wrote :

Hi Kai-Heng, thanks for your response.

I followed your advice and installed these packages from the ppa:
linux-headers-5.3.0-050300rc5_5.3.0-050300rc5.201908182231_all.deb
linux-headers-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-image-unsigned-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-modules-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb

I then rebooted the system and checked versions with "uname -a" and "modinfo mpt3sas":
(kernel 5.3.0) (mpt3sas driver 29.100.00.00)

The stress test has now been running for 4h with no errors, which is 8x as long as the previous best on 18.04.
I will leave the stress test running overnight in the event that the bug still exists but occurs less frequently.

Revision history for this message
Drew Woodard (drew-woodard) wrote :

I let the stress test run on the mainline kernel for 22h, no errors.

so in summary:
ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Would it be possible for you to do a reverse kernel bisection?

First, find the first -rc kernel works and the last -rc kernel doesn’t work from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Then,
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect new $(the working version you found)
$ git bisect old $(the non-working version found)
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If it doesn’t work,
$ git bisect old
Otherwise,
$ git bisect new
Repeat to "make -j`nproc` deb-pkg" until you find the commit fixes the issue.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Drew,

There's a mpt3sas fix in v5.3-rc3 for a problem that may cause an adapter firmware fault
(although not sure of the exact fault state code; but it should cause a reset anyway).

If you could please test either
1) v5.3-rc2 [1] to confirm the issue happens with v5.3-rc2 but not with v5.3-rc3;
or
2) or 4.15.0-60.67 (in bionic-proposed) which has the fix (so checking whether issue doesn't happen)
that would be great.

If that doesn't help, please continue with the great regression tip provided by Kai-Heng Feng.

Thanks!
Mauricio

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3-rc2/

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Mentioned upstream candidate fix is:

commit df9a606184bfdb5ae3ca9d226184e9489f5c24f7
Author: Suganath Prabu <email address hidden>
Date: Tue Jul 30 03:43:57 2019 -0400

    scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA

    Although SAS3 & SAS3.5 IT HBA controllers support 64-bit DMA addressing, as
    per hardware design, if DMA-able range contains all 64-bits
    set (0xFFFFFFFF-FFFFFFFF) then it results in a firmware fault.

    E.g. SGE's start address is 0xFFFFFFFF-FFFF000 and data length is 0x1000
    bytes. when HBA tries to DMA the data at 0xFFFFFFFF-FFFFFFFF location then
    HBA will fault the firmware.

    Driver will set 63-bit DMA mask to ensure the above address will not be
    used.

    Cc: <email address hidden> # 5.1.20+
    Signed-off-by: Suganath Prabu <email address hidden>
    Reviewed-by: Christoph Hellwig <email address hidden>
    Signed-off-by: Martin K. Petersen <email address hidden>

git/linux $ git describe --contains df9a606184bfdb5ae3ca9d226184e9489f5c24f7
v5.3-rc3~21^2~1

git/ubuntu-bionic $ git log --oneline Ubuntu-4.15.0-60.67 -- drivers/scsi/mpt3sas/
395f1e3037b8 scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA
...

Revision history for this message
Drew Woodard (drew-woodard) wrote :

rc2 does contain the bug, annoyingly it took 54min to trigger which is longer than any previous version.
rc3 is stress testing at the moment.

ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

Revision history for this message
Drew Woodard (drew-woodard) wrote :

rc3 has been stress testing for 7h without error so I believe Mauricio Faria de Oliveira correctly identified the patch that corrects this issue.

ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Drew,

That's very good news! So it looks like that patch resolves the problem.

Could you please test the kernel in bionic-proposed [1] (4.15.0-60-generic)
which has that patch to confirm it's also working correctly?

Thanks!
Mauricio

[1] https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message
Drew Woodard (drew-woodard) wrote :

Stress tested bionic-proposed kernel 4.15.0-60-generic #67-Ubuntu for 6.5h with no errors so it appears to be patched in that version as expected.

ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 proposed (kernel 4.15.0) (mpt3sas driver 17.100.00.00) working
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Drew,

Thanks for testing bionic-proposed!
So it will be resolved for bionic kernels shortly, when it hit bionic-updates.

Disco/19.04 will get this patch via stable updates in the near future [1].

Eoan has it applied (LP: #1839588).

So this is all good.

Thanks again,
Mauricio

[1] https://lists.ubuntu.com/archives/kernel-team/2019-August/103416.html

Changed in linux (Ubuntu Eoan):
status: Incomplete → Fix Released
Changed in linux (Ubuntu Disco):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Marking Bionic as Fix Released as the kernel from bionic-proposed has been promoted to bionic-updates.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Drew Woodard (drew-woodard) wrote :

Are there any plans to include this fix in the hwe kernel?
I tested today on the current 18.04 hwe kernel 5.0.0-27 and the bug appeared in 18min.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Drew,

Yes, the HWE kernel syncs automatically from the normal kernel as it moves forward.

Once Disco gets the patch, the HWE from Disco in Bionic should get it as well (same version number plus ~18.04.1 suffix).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Disco currently has the mpt3sas fix in disco-proposed (version 5.0.0-30.32),
also available as linux-hwe kernel in bionic-proposed.

Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Disco is now fix released with linux 5.0.0-31.33 (bionic: linux-hwe 5.0.0-31.33~18.04.1).

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.