[SRU] IO's are issued with incorrect Scatter Gather Buffer

Bug #1795453 reported by Sasikumar on 2018-10-01
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Marcelo Cerri
Bionic
Undecided
Marcelo Cerri
Cosmic
Undecided
Marcelo Cerri

Bug Description

[Impact]
We have observed that OS is sending the IO's (SCSI Read/Write) with incorrect Scatter Gather Buffer address .

i.e

 OS is sending the IO with 64 bit Scatter Gather Buffer address , such that if we add the length with buffer address then it causes the roll over .

Here is the data we captured in our driver (Printed the SGE details which was sent by OS)

sgl_ptr->Address = fffffffffffff000
[14547.313240]
                sgl_ptr->Length = 1000
[14547.313241]
                sge_count = 18
[14547.313242]
                cmd->index = 9d9

Note : This issue is observed only when virtualization is enabled

Product : Broadcom (LSI) Megaraid H840 controller

[Test Case]
Steps found in comment #60
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1795453/comments/60

[Regression Potential]

The patches in Comment #59 have been accepted upstream.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1795453/comments/59

[Other Info]

Product : Broadcom (LSI) Megaraid H840 controller
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw----+ 1 root audio 116, 1 Oct 12 16:35 seq
 crw-rw----+ 1 root audio 116, 33 Oct 12 16:35 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Dell Inc. PowerEdge R7415
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-8-generic root=UUID=43e89410-c7ff-11e8-9229-d094661013d8 ro
ProcVersionSignature: Ubuntu 4.18.0-8.9~18.04.1-generic 4.18.7
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-8-generic N/A
 linux-backports-modules-4.18.0-8-generic N/A
 linux-firmware 1.173.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic uec-images
Uname: Linux 4.18.0-8-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lxd plugdev sudo
_MarkForUpload: True
dmi.bios.date: 09/07/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.6.2
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.6.2:bd09/07/2018:svnDellInc.:pnPowerEdgeR7415:pvr:rvnDellInc.:rn:rvr:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7415
dmi.product.sku: SKU=NotProvided;ModelName=PowerEdge R7415
dmi.sys.vendor: Dell Inc.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw----+ 1 root audio 116, 1 Oct 12 16:35 seq
 crw-rw----+ 1 root audio 116, 33 Oct 12 16:35 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Dell Inc. PowerEdge R7415
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-8-generic root=UUID=43e89410-c7ff-11e8-9229-d094661013d8 ro
ProcVersionSignature: Ubuntu 4.18.0-8.9~18.04.1-generic 4.18.7
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-8-generic N/A
 linux-backports-modules-4.18.0-8-generic N/A
 linux-firmware 1.173.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic uec-images
Uname: Linux 4.18.0-8-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lxd plugdev sudo
_MarkForUpload: True
dmi.bios.date: 09/07/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.6.2
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.6.2:bd09/07/2018:svnDellInc.:pnPowerEdgeR7415:pvr:rvnDellInc.:rn:rvr:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7415
dmi.product.sku: SKU=NotProvided;ModelName=PowerEdge R7415
dmi.sys.vendor: Dell Inc.
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
InstallationDate: Installed on 2018-09-20 (25 days ago)
InstallationMedia: Ubuntu-Server 18.04 LTS "Bionic Beaver" - Release amd64 (20180426)
IwConfig:
 lo no wireless extensions.

 eno2 no wireless extensions.

 eno1 no wireless extensions.
MachineType: Dell Inc. PowerEdge R7415
Package: linux (not installed)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-36-generic root=UUID=0fdc9101-73c3-4eec-9136-4384e760b449 ro
ProcVersionSignature: Ubuntu 4.15.0-36.39-generic 4.15.18
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-36-generic N/A
 linux-backports-modules-4.15.0-36-generic N/A
 linux-firmware 1.173.1
RfKill:

Tags: bionic
Uname: Linux 4.15.0-36-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 09/07/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.6.2
dmi.board.name: 065PKD
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.6.2:bd09/07/2018:svnDellInc.:pnPowerEdgeR7415:pvr:rvnDellInc.:rn065PKD:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7415
dmi.sys.vendor: Dell Inc.

Michael Reed (mreed8855) wrote :

Can you provide more information, such as the driver version, uname -a, logs, and how to recreate the issue?

Sasikumar (sasikumarpc) wrote :

uname -r:

4.15.0-34-generic

Driver Version :

MEGASAS_VERSION "07.703.05.00"

How to recreate the issue :

run IO's (SCSI R/W) using vdbench tool on raw Virtual Disks (which is on the Broadcom (LSI) Megaraid H840 controller)

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe (Ubuntu):
status: New → Confirmed
Jerry Clement (jerry-clement) wrote :

@mreed8855 This issue may be related to the amd_iommu driver. There are patches upstream that are not in the 4.15 kernel. How can we get a list of those patches. Is there an amd_iommu expert out there?

Sasikumar (sasikumarpc) wrote :

No, we own only the mega raid Linux driver , you may have to reach out to AMD iommu driver team

Jeff Lane (bladernr) wrote :

Can any of you recreate this using the mainline kernel?

https://wiki.ubuntu.com/Kernel/MainlineBuilds

that should at least give a point in time where we know either it's not fixed in 4.18 or it IS fixed in 4.18 and the patches would be easier to derive.

Jeff Lane (bladernr) wrote :

moving to kernel itself rather than the hwe package

affects: linux-signed-hwe (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Confirmed → New

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1795453

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

we are in the process of collecting logs

Sasikumar (sasikumarpc) wrote :

Looks like the issue is not seen with 4.18 kernel , please let me know still Ubuntu requires the log with 4.15 kernel ?

Joseph Salisbury (jsalisbury) wrote :

Could you also test the latest upstream 4.15 stable kernel? It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15.18/

We should be able to perform a reverse bisect to identify the commits in 4.18 that resolve this bug.

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
tags: added: bionic
Sasikumar (sasikumarpc) wrote :

Also another version (other than default version of Ubuntu 18.04.01) of 4.15 kernel fixes the issue

Sasikumar (sasikumarpc) wrote :

It will be good if you could provide us the exact commit (Patch) which fixes the issue

Sujith Pandel (sujithpandel) wrote :

sasikumarpc -
Can you share the specific kernel versions that you are using, where the issue is fixed?
Can you also share the logs of pass & failure cases?

Sasikumar (sasikumarpc) wrote :

Issue reported Kernel Version
-----------------------------

4.15.0-34-generic

Issue resolved Kernel Version
-----------------------------

4.18.0-8-generic
                Linux user 4.18.0-8-generic #9~18.04.1-Ubuntu SMP Mon Sep 17 12:49:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

It looks like we do not have a system with 4.15.0-34-generic kernel version to upload the log files .

Sujith Pandel (sujithpandel) wrote :

> Also another version (other than default version of Ubuntu 18.04.01) of 4.15 kernel fixes the issue

Can you share this kernel version as well?

apport information

tags: added: apport-collected uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

> Also another version (other than default version of Ubuntu 18.04.01) of 4.15 kernel fixes the issue

Can you share this kernel version as well?

Actually adding additional memory to the system which has 4.15.0-34-generic kernel fixes the issue , not the different version of 4.15 kernel fixes the issue , it is my bad on the misunderstanding

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Sujith Pandel (sujithpandel) wrote :

@sasikumarpc - Ok. So when you say below v4.18 kernel fixes it, does it mean with increased memory or is it with the same setup where the issue was originally seen?

Issue resolved Kernel Version
-----------------------------
4.18.0-8-generic

Sasikumar (sasikumarpc) wrote :

Yes below 4.18 kernel (which is 4.15) fixes only with increased memory

Sujith Pandel (sujithpandel) wrote :

With latest 4.15 LTS kernels, I see the vdbench process crash and the PERC controller and the related disks go offline..

Sujith Pandel (sujithpandel) wrote :

With the 4.18.0-8-generic kernel, I see AMD IOMMU IO_PAGE_FAULTs for invalid addresses
AMD-Vi: Event logged [IO_PAGE_FAULT device=e1:00.0 domain=0x0000 address=0xfffffffffffd14c0 flags=0x0030]

PERC controller is still present btw, but the vdbench stress crashed due to this page faults.

Sujith Pandel (sujithpandel) wrote :

I see the disks going offline during stress in mainline 4.19.0-041900rc7-generic as well.

Although the symptoms look different in 4.15 and 4.18 kernels, I feel the bug is not fixed; address being accessed is still out-of-range in this 16GB server.

Sasikumar (sasikumarpc) wrote :

Yes , It seems that bug is still not fixed

FYI - We also observed page fault in 4.18 kernel

Sujith Pandel (sujithpandel) wrote :

@sasikumarpc -
We observed this scenario where the PERC goes offline:
sgl_ptr->Address = fffffffffffff000
sgl_ptr->Length = 1000

We do not see a roll-over here..

Kernel is indicating to read 0x1000 bytes starting from address 0xfffffffffffff000 i.e till 0xffffffffffffffff which very well aligns with 64 bit dma_mask being assigned by megaraid_sas driver.

Looks like kernel is working as expected based on the dma_mask set by megaraid_sas driver.
Once the 32bit address gets exhausted, the kernel tries to get buffer based on 64bit dma_mask top-down.

We are suspecting that the PERC firmware might not be able to handle this corner case.
PERC did not go offline by changing the dma_mask to 63, and these initial results looks to be supporting this stance on fw.
Can you please check this and revert?

Sasikumar (sasikumarpc) wrote :

Just would like to understand more about the system behavior when we change the DMA Mask from 64 bit to 63 bit

do we expect any double buffering ?

Sujith Pandel (sujithpandel) wrote :

The 63bit dma_mask was just a test to understand the adress-wraparound by the fw. Nothing specific.
PERC didn't go offline. Always ended up with OOM btw.

Sasikumar (sasikumarpc) wrote :

63 bit masking will cause double buffering of data ?

Sujith Pandel (sujithpandel) wrote :

Broadcom submitted the fix upstream:
https://marc.info/?l=linux-scsi&m=154514161019901&w=2
https://marc.info/?l=linux-scsi&m=154514161319908&w=2
https://marc.info/?l=linux-scsi&m=154514161619911&w=2

and the commit IDs of these patches from upstream:

894169db1246 scsi: megaraid_sas: Use 63-bit DMA addressing
7b9e2d348c2a scsi: megaraid_sas: driver version update

Please include them in the Ubuntu 18.04 LTS kernels.

Michael Reed (mreed8855) wrote :

How to Recreate

1. Install Ubuntu 18.04.01 on a separate storage disk.
2. Create VDs on H840/H740 using 9 physical disks
    a. 16 sliced RAID5; 10 GB each
    b. 16 sliced RAID1; 10 GB each
    c. 16 sliced RAID6; 10 GB each
3. Install JAVA
4. Download and copy vdbench to system
    a. https://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html
    b. I used version 5.4.6 but reporter had 5.4.7
5. Run vdbench test “./vdbench -t”
6. Edit/create vdbench workprofile to match disks on H840
    a. Use lsscsi to show 48 H840/H740 disks
7. Run vd bench with profile
    a. ./vdbench -f vdbenchworkloadprofile.txt

Michael Reed (mreed8855) on 2019-01-11
description: updated
Michael Reed (mreed8855) on 2019-01-11
tags: added: verification-needed
tags: added: tpp
Michael Reed (mreed8855) on 2019-01-11
summary: - IO's are issued with incorrect Scatter Gather Buffer
+ [SRU] IO's are issued with incorrect Scatter Gather Buffer
Michael Reed (mreed8855) on 2019-01-11
tags: added: cosmic disco xenial
Marcelo Cerri (mhcerri) wrote :

I built test kernels for bionic and cosmic with the fix applied. Please can you test it and check if the problem is solved?

You can download the debian packages from:

https://kernel.ubuntu.com/~mhcerri/lp1795453/bionic/
https://kernel.ubuntu.com/~mhcerri/lp1795453/cosmic/

tags: added: patch
Jerry Clement (jerry-clement) wrote :

Testing Status Update:
Testing ongoing since last week.
4.15 kernel ran for 72 hours without failure.
Will update tomorrow on status of the 4.18 kernel (so far 24+ hours without failure).

Jerry Clement (jerry-clement) wrote :

Testing Status Update:
4.18 kernel has run for 72+ hours without failure.

Please proceed for the next SRU cycle. This is critical.

Michael Reed (mreed8855) on 2019-01-29
tags: added: verification-needed-bionic verification-needed-xenial
removed: verification-needed
Michael Reed (mreed8855) on 2019-01-29
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Marcelo Cerri (mhcerri) wrote :
Changed in linux (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Cosmic):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Sujith Pandel (sujithpandel) wrote :

Can we also get the above fixes in Xenial HWE kernels?
I suppose Xenial HWE kernel is a fork of bionic LTS kernels..

Michael Reed (mreed8855) wrote :

This landed in the 4.15 kernel so it should picked up in 16.04.5

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
Jerry Clement (jerry-clement) wrote :

How long does it take for content to show up in proposed?
As of today it does not appear to be available yet.

Hi,

This bug has been marked as affecting both Bionic and Cosmic main linux packages, so the fix has been applied as such. Please note that comment #67 is requesting verification with the Cosmic kernel only (4.18), the Bionic kernel (4.15) has not been published to -proposed yet. Another comment will be added automatically when the Bionic kernel hits -proposed, which is scheduled to happen until the end of this week.

The verification is requested only for the main kernels, however since xenial/linux-hwe is based on bionic/linux (4.15), it will also receive the fix with the next update.

Thank you.

tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
tags: added: verification-done-bionic
removed: verification-needed-bionic
To post a comment you must log in.