qla2xxx: Fix page fault at kmem_cache_alloc_node()

Bug #1770003 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Bionic
Fix Released
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
IBM is requesting these nine patches to be SRU'd to Bionic. IBM found
that the current Bionic kernel contains a problem related to qla2xxx
driver which causes the following:

[ 66.295233] Unable to handle kernel paging request for data at address 0x8882f6ed90e9151a
[ 66.295297] Faulting instruction address: 0xc00000000038a110
cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
    pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
    lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
    sp: c00000000692f8d0
   msr: 9000000000009033
   dar: 8882f6ed90e9151a
  current = 0xc00000000698fd00
  paca = 0xc00000000fab7000 softe: 0 irq_happened: 0x01
    pid = 1762, comm = systemd-journal
Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.20)
enter ? for help
[c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
[c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
[c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
[c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
[c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
[c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
[c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
[c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 000074826f6fa9c4
SP (7ffff5dc5510) is in userspace

We were able to get rid of this problem cherry picking some of the upstream patches. Do you think they might fit in the SRU criteria?

The commit ids are below and they were easily cherry picked.

eaf75d1815dad230dac2f1e8f1dc0349b2d50071: scsi: qla2xxx: Fix double free bug after firmware timeout
6d67492764b39ad6efb6822816ad73dc141752f4: scsi: qla2xxx: Prevent relogin trigger from sending too many commands
7ac0c332f96bb9688560726f5e80c097ed8de59a: scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
045d6ea200af794ba15515984cff63787a7fc3c0: scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
1ae634eb28533b82f9777a47c1ade44cb8c0182b: scsi: qla2xxx: Serialize session free in qlt_free_session_done
d8630bb95f46ea118dede63bd75533faa64f9612: scsi: qla2xxx: Serialize session deletion by using work_lock
        Requries: 1c6cacf4ea6c04a58a0e3057f5ed60c24a4ffeff ('scsi: qla2xxx: Fixup locking for session deletion')
94cff6e114df56d0df74cdabe3481df38d9b0c1e: scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_dele?
9cd883f07a54e5301d51e259acd250bb035996be: scsi: qla2xxx: Fix session cleanup for N2N

== Regression Potential ==
Medium. There are nine patches in this pull request. They are not specific to a
paticular arch, but they are specific to qla2xxx.

== Test Case ==
A test kernel was built with these patches and tested by IBM.
IBM states the test kernel resolved the bug.

CVE References

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-167562 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-08 15:31 EDT-------
Also, if you could generate a custom kernel for us to test before committing to the master tree, we would appreciate.

Manoj Iyer (manjo)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Manoj Iyer (manjo) wrote :

The patches identified here does no seem to cleanly cherry-pick. How ever I cherry-picked several of the dependencies and came up with this list.

scsi: qla2xxx: Fix double free bug after firmware timeout
scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
scsi: qla2xxx: Serialize session free in qlt_free_session_done
scsi: qla2xxx: Serialize session deletion by using work_lock
scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
scsi: qla2xxx: Prevent relogin trigger from sending too many commands
scsi: qla2xxx: Prevent multiple active discovery commands per session
scsi: qla2xxx: Delay loop id allocation at login
scsi: qla2xxx: Allow relogin and session creation after reset
scsi: qla2xxx: Add ability to use GPNFT/GNNFT for RSCN handling
scsi: qla2xxx: Properly extract ADISC error codes
scsi: qla2xxx: Fix GPNFT/GNNFT error handling
scsi: qla2xxx: Fix login state machine freeze
scsi: qla2xxx: Add lock protection around host lookup
scsi: qla2xxx: Add switch command to simplify fabric discovery
scsi: qla2xxx: Fix session cleanup for N2N
scsi: qla2xxx: Allow target mode to accept PRLI in dual mode
scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
scsi: qla2xxx: Add ability to send PRLO
scsi: qla2xxx: Add option for use reserve exch for ELS
scsi: qla2xxx: Move work element processing out of DPC thread
scsi: qla2xxx: Replace GPDB with async ADISC command
scsi: qla2xxx: Fix Firmware dump size for Extended login and Exchange Offload
scsi: qla2xxx: Use IOCB path to submit Control VP MBX command

It is a longer list of patches to the drive, please let me know if this is acceptable ?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

If there are too many required patches, a bug fix would not pass the SRU requirements[1]. The SRU process requires the least amount of changes to implement the fix.

How did you go about identifying all of the required commits in the bug description? Is that the minimum number of commits required to fix the bug? Is it possible to "Reverse" bisect down to a fewer number of commits?

Is there a specific commit that you know of that introduced this bug in Bionic? One other option could be to find the offending commit and revert it.

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Hi Joseph,

The original patchset just included 4 patches, they are:

d8630bb scsi: qla2xxx: Serialize session deletion by using work_lock
1ae634e scsi: qla2xxx: Serialize session free in qlt_free_session_done
9cd883f scsi: qla2xxx: Fix session cleanup for N2N
eaf75d1 scsi: qla2xxx: Fix double free bug after firmware timeout

The other three patches came as a requirement to backport these 4 patches.

On top of it, the commit id d8630bb95f46 had a regression which would require the last patch:

1c6cacf scsi: qla2xxx: Fixup locking for session deletion

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 1/9] scsi: qla2xxx: Fix session cleanup for N2N

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:03 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 2/9] scsi: qla2xxx: Remove unused argument from

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:03 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 3/9] scsi: qla2xxx: Serialize session deletion by

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:04 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 4/9] scsi: qla2xxx: Fixup locking for session

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:04 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 5/9] scsi: qla2xxx: Serialize session free in qlt_free_session_done

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:05 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 6/9] scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:06 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 7/9] scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:06 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 8/9] scsi: qla2xxx: Prevent relogin trigger from sending too many commands

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:07 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 9/9] scsi: qla2xxx: Fix double free bug after firmware timeout

------- Comment (attachment only) From <email address hidden> 2018-05-09 14:07 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-09 14:10 EDT-------
Canonical,

I have attached here the patchset for this bug regarding the Breno's comment.

Revision history for this message
Manoj Iyer (manjo) wrote :

Here is the set of patches I needed to backport/cherry-pick to get a kernel that would build.

2853192e154b scsi: qla2xxx: Use IOCB path to submit Control VP MBX command
11aea16ab3f5 scsi: qla2xxx: Add ability to send PRLO
1c6cacf4ea6c scsi: qla2xxx: Fixup locking for session deletion
eaf75d1815da scsi: qla2xxx: Fix double free bug after firmware timeout
6d67492764b3 scsi: qla2xxx: Prevent relogin trigger from sending too many commands
7ac0c332f96b scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
045d6ea200af scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
1ae634eb2853 scsi: qla2xxx: Serialize session free in qlt_free_session_done
d8630bb95f46 scsi: qla2xxx: Serialize session deletion by using work_lock
94cff6e114df scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
a4239945b8ad scsi: qla2xxx: Add switch command to simplify fabric discovery
040036bb0bc1 scsi: qla2xxx: Delay loop id allocation at login
9cd883f07a54 scsi: qla2xxx: Fix session cleanup for N2N
f13515acdcb5 scsi: qla2xxx: Replace GPDB with async ADISC command

I have this kernel built in a PPA: ppa:ubuntu-power-triage/lp1770003. This kernel needs to be tested on the platform (bug does not say if this was found in P8 or P9) where the issue was reported, and will also need to be tested on Cavium (ARM64) to make sure there are no regressions.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-09 15:26 EDT-------
>bug does not say if this was found in P8 or P9

Sorry, it was identified and it is being debugged on a POWER9 machine. Not sure if this happen or not on POWER8 yet.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-09 15:31 EDT-------
We are focusing on P9 now, but we believe this exists on all platforms

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-09 18:14 EDT-------
I installed this kernel image and it overwrote the stock 4.15.0-20-generic kernel image, wiping out any chance to boot the original kernel if something goes wrong. This isn't the way the package was supposed to work, is it?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-09 18:25 EDT-------
(In reply to comment #25)
> I installed this kernel image and it overwrote the stock 4.15.0-20-generic
> kernel image, wiping out any chance to boot the original kernel if something
> goes wrong. This isn't the way the package was supposed to work, is it?

Looks like it actually failed to install as well, so I'm looking at what happened.

------- Comment From <email address hidden> 2018-05-09 18:29 EDT-------
Here are the pertinent messages from dpkg:
-----------------------------------
Selecting previously unselected package linux-image-unsigned-4.15.0-20-generic.
dpkg: regarding linux-image-unsigned-4.15.0-20-generic_4.15.0-20.21~lp1770003+build.6_ppc64el.deb containing linux-image-unsigned-4.15.0-20-generic:
linux-image-unsigned-4.15.0-20-generic conflicts with linux-image-4.15.0-20-generic
linux-image-4.15.0-20-generic (version 4.15.0-20.21) is present and installed.

dpkg: error processing archive linux-image-unsigned-4.15.0-20-generic_4.15.0-20.21~lp1770003+build.6_ppc64el.deb (--install):
conflicting packages - not installing linux-image-unsigned-4.15.0-20-generic
------------------------------------

I would have expected this kernel to be installed along side the default 18.04 kernel. Is there something wrong with the packages or my setup? This was on a scratch install of 18.04 -proposed.

Revision history for this message
Manoj Iyer (manjo) wrote :

That is because the package I provided has the same version number as the -proposed package, my bad, I should have accounted for -proposed version. I will bump the version an update the package in the PPA.

Changed in ubuntu-power-systems:
importance: High → Critical
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Bionic):
importance: High → Critical
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 13:01 EDT-------
I have had some luck reproducing this, on ltc-boston113 (previously unable to reproduce there). I had altered the boot parameters to remove "quiet splash" and added "qla2xxx.logging=0x1e400000", and got the kworker panic during boot (did not even reach login prompt). This was a fresh install of 18.04 -proposed, with the 4.15.0-20.21 kernel. What is the status of the experimental patched kernel?

Revision history for this message
Manoj Iyer (manjo) wrote :

Its building in the PPA as we speak, it should be ready for your use in about 45mts or so. You can monitor progress here: https://launchpad.net/~ubuntu-power-triage/+archive/ubuntu/lp1770003/+packages

Revision history for this message
Manoj Iyer (manjo) wrote :

you could preseed your installer to install the kernel in the PPA. I have not tested the instructions below.. but should give you an idea how to preseed. From your comment above looks like you might not have a booting system to be able to install the new kernel etc.

Assuming you are using a netboot install.

-- tell installer grub to use preseed --
Edit ubuntu-installer/ppc64el/grub/grub.cfg and add your preseed file. Like for example:

menuentry 'Install' {
    set background_color=black
    linux /ubuntu-installer/ppc64el/linux auto=true priority=critical url=http://<web server>/preseed/preseed.ppc64el --- quiet
    initrd /ubuntu-installer/ppc64el/initrd.gz
}

-- preseed example --
You can use late-commands in your preseed file to install the PPA kernel like:

# Install kernel from PPA
d-i preseed/late_command string in-target add-apt-repository -y ppa:ubuntu-power-triage/lp1770003
d-i preseed/late_command string in-target apt update
d-i preseed/late_command string in-target apt install -y linux-image-unsigned-4.15.0-20-generic linux-modules-extra-4.15.0-20-generic

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 14:31 EDT-------
I know we didn't get notification yet, but I saw that "build.7" was on the PPA so I tried to install it. I get the same results as before: errors from dpkg and failed install, and the system is left in a crippled state where the qla2xxx driver (perhaps others) loads but is not functional (cannot access the SAN).

------- Comment From <email address hidden> 2018-05-10 14:39 EDT-------
I should add that I am *not* using "force" on dpkg when trying to install, and yet the system seems to have been broken even though the dpkg command reported an error.

Revision history for this message
Manoj Iyer (manjo) wrote :
Download full text (5.5 KiB)

I don't see the issue you are reporting. I don't have the storage device that uses this driver, but I am able to install the new kernel from PPA and load the qla2xxx driver on our P9.

ubuntu@dradis:~$ uname -a
Linux dradis 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

ubuntu@dradis:~$ sudo add-apt-repository -y ppa:ubuntu-power-triage/lp1770003

ubuntu@dradis:~$ apt search linux-image-unsigned
Sorting... Done
Full Text Search... Done
linux-image-unsigned-4.15.0-20-generic/bionic 4.15.0-20.23~lp1770003+build.7 ppc64el
  Linux kernel image for version 4.15.0 on PowerPC 64el SMP

ubuntu@dradis:~$ apt search linux-modules
Sorting... Done
Full Text Search... Done
linux-modules-4.15.0-20-generic/bionic 4.15.0-20.23~lp1770003+build.7 ppc64el [upgradable from: 4.15.0-20.21]
  Linux kernel extra modules for version 4.15.0 on PowerPC 64el SMP

linux-modules-extra-4.15.0-20-generic/bionic 4.15.0-20.23~lp1770003+build.7 ppc64el [upgradable from: 4.15.0-20.21]
  Linux kernel extra modules for version 4.15.0 on PowerPC 64el SMP

ubuntu@dradis:~$ sudo apt install -y --assume-yes linux-image-unsigned-4.15.0-20-generic linux-modules-4.15.0-20-generic linux-modules-extra-4.15.0-20-generic
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  linux-headers-generic
Use 'sudo apt autoremove' to remove it.
Suggested packages:
  fdutils linux-doc-4.15.0 | linux-source-4.15.0 linux-tools
The following packages will be REMOVED:
  linux-generic linux-image-4.15.0-20-generic linux-image-generic
The following NEW packages will be installed:
  linux-image-unsigned-4.15.0-20-generic
The following packages will be upgraded:
  linux-modules-4.15.0-20-generic linux-modules-extra-4.15.0-20-generic
2 upgraded, 1 newly installed, 3 to remove and 2 not upgraded.
Need to get 50.1 MB of archives.
After this operation, 72.7 kB of additional disk space will be used.
Get:1 http://ppa.launchpad.net/ubuntu-power-triage/lp1770003/ubuntu bionic/main ppc64el linux-image-unsigned-4.15.0-20-generic ppc64el 4.15.0-20.23~lp1770003+build.7 [6143 kB]
Get:2 http://ppa.launchpad.net/ubuntu-power-triage/lp1770003/ubuntu bionic/main ppc64el linux-modules-4.15.0-20-generic ppc64el 4.15.0-20.23~lp1770003+build.7 [12.5 MB]
Get:3 http://ppa.launchpad.net/ubuntu-power-triage/lp1770003/ubuntu bionic/main ppc64el linux-modules-extra-4.15.0-20-generic ppc64el 4.15.0-20.23~lp1770003+build.7 [31.5 MB]
Fetched 50.1 MB in 1min 10s (714 kB/s)
(Reading database ... 64762 files and directories currently installed.)
Removing linux-generic (4.15.0.20.23) ...
Removing linux-image-generic (4.15.0.20.23) ...
dpkg: linux-image-4.15.0-20-generic: dependency problems, but removing anyway as you requested:
 linux-modules-extra-4.15.0-20-generic depends on linux-image-4.15.0-20-generic | linux-image-unsigned-4.15.0-20-generic; however:
  Package linux-image-4.15.0-20-generic is to be removed.
  Package linux-image-unsigned-4.15.0-20-generic is not installed.

Removing linux-image-4.15.0-20-generic (4.15.0-20.21...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 16:47 EDT-------
I can't tell what 'dpkg' command syntax is being used underneath, but I do see this message in your output which indicates something is not quite right:

Removing linux-image-4.15.0-20-generic (4.15.0-20.21) ...
W: Removing the running kernel
W: Last kernel image has been removed, so removing the default symlinks

That suggests that this package is still replacing the normal kernel instead of installing in addition to it. What is in /boot after your install? Do you have two kernels or only one?

Revision history for this message
Manoj Iyer (manjo) wrote :

Sorry, my apologies .. I misunderstood your requirement ..again.. I have bumped the kernel version to 4.15.0-21 from 4.15.0-20. After you install you should be able to choose between old and new kernel. It will take approx 5hrs for the kernel to be published in the PPA.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 20:49 EDT-------
Thank you, Manoj

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-11 09:55 EDT-------
While trying out the proposed kernel, which does now install as expected so that I can boot either, I run into a problem accessing the SAN. With the stock 4.15.0-20-generic kernel it works fine. With the 4.15.0-21 (lp1770003+build.1) kernel the system boots without any SAN disks discovered. I will attach boot logs for both cases, with qla2xxx extended error logging enabled. Since our internal-built test kernel doesn't exhibit this problem, I have to think something has changed in the set of patches being applied. We will need to review the patch lists.

Revision history for this message
bugproxy (bugproxy) wrote : boot of test kernel, no SAN disks detected

------- Comment (attachment only) From <email address hidden> 2018-05-11 09:56 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : boot of standard kernel, all SAN disks detected

------- Comment (attachment only) From <email address hidden> 2018-05-11 09:56 EDT-------

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Manoj,

Do you have the git tree you used to build this kernel? I would like to take a look if any backport was missing.

Revision history for this message
Breno Leitão (breno-leitao) wrote :

I also created a vimdiff of both logs, and I noted something that caused the whole difference later:

On the OK kernel, I see:
   [0001:03:00.0]-001d: : Found an ISP2532 irq 41 iobase 0x00000000db2a8857.

On the NOK kernel I see:
   [0001:03:00.0]-001d: : Found an ISP2532 irq 41 iobase 0x000000006baeaf0e.

Later, The addresses are different again:

OK Kernel:
   [0001:03:00.1]-001d: : Found an ISP2532 irq 42 iobase 0x00000000c67eaba1

NOK Kernel:
   [0001:03:00.1]-001d: : Found an ISP2532 irq 42 iobase 0x00000000507ae35c

After that, the OK kernel follows with:
   [0001:03:00.0]-580e:2: Asynchronous P2P MODE received
   [0001:03:00.0]-18b9:2: Format 1: VP[0] enabled - status 0 - with port id 050500.
   [0001:03:00.0]-5875:2: Format 1: Remote WWPN 20:05:00:05:1e:02:da:3e.

While the NOK kernel follows with:
   [0001:03:00.0]-5809:2: LIP occurred (f700)
   [0001:03:00.0]-580c:2: LIP reset occurred (f7f7).

Later, when the OK kernel seems to detect something, as:
   [0001:03:00.0]-289f:2: Device wrap (030a00).
   [0001:03:00.0]-28d8:2: qla24xx_fcport_handle_login 50:05:07:68:02:16:5e:37 DS 0 LS 7 P 0 fl 3 confl
   [0001:03:00.0]-28bd:2: qla24xx_fcport_handle_login 982 50:05:07:68:02:16:5e:37 post gnl

The NOK kernel just prints:
   [0001:03:00.0]-107ff:2: Async-gpnft hdl=2 FC4Type 8.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-12 08:42 EDT-------
(In reply to comment #42)
>...
>
> After that, the OK kernel follows with:
> [0001:03:00.0]-580e:2: Asynchronous P2P MODE received
> [0001:03:00.0]-18b9:2: Format 1: VP[0] enabled - status 0 - with port id
> 050500.
> [0001:03:00.0]-5875:2: Format 1: Remote WWPN 20:05:00:05:1e:02:da:3e.
>
> While the NOK kernel follows with:
> [0001:03:00.0]-5809:2: LIP occurred (f700)
> [0001:03:00.0]-580c:2: LIP reset occurred (f7f7).
>
> Later, when the OK kernel seems to detect something, as:
> [0001:03:00.0]-289f:2: Device wrap (030a00).
> [0001:03:00.0]-28d8:2: qla24xx_fcport_handle_login 50:05:07:68:02:16:5e:37
> DS 0 LS 7 P 0 fl 3 confl
> [0001:03:00.0]-28bd:2: qla24xx_fcport_handle_login 982
> 50:05:07:68:02:16:5e:37 post gnl
>
> The NOK kernel just prints:
> [0001:03:00.0]-107ff:2: Async-gpnft hdl=2 FC4Type 8.

Looks like the OK kernel completes the FC login, while the NOK kernel does not. I believe there were some extra patches added that may be missing other requirements. Looking at the two different patch lists, I see this:

1) "scsi: qla2xxx: Fixup locking for session deletion" is missing from the NOK kernel.

2) There are 16 added patches to the NOK kernel, which were not be required when we built our test kernel.

It's possible those 16 patches are missing some critical companion patches, but it should not be necessary to add those 16.

I don't think it is any more acceptable for SRU, but I'll bring it up anyway: another option is to do a full qla2xxx driver refresh to version 10.00.00.04-k, plus the "scsi: qla2xxx: Fixup locking for session deletion" patch (a vital fix).

bugproxy (bugproxy)
tags: removed: bugnameltc-167562 severity-high triage-g
bugproxy (bugproxy)
tags: added: bugnameltc-167562 severity-high
Frank Heimes (fheimes)
tags: added: triage-g
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Hi,

Since Manoj's kernel didn't work, I created a kernel with the fixes above and it is working on ppc64el (on a 24 hours test).

These are the patches I added:

 scsi: qla2xxx: Fixup locking for session deletion
 scsi: qla2xxx: Fix double free bug after firmware timeout
 scsi: qla2xxx: Prevent relogin trigger from sending too many commands
 scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
 scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
 scsi: qla2xxx: Serialize session free in qlt_free_session_done
 scsi: qla2xxx: Serialize session deletion by using work_lock
 scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
 scsi: qla2xxx: Fix session cleanup for N2N

You can find the patches at https://github.com/leitao/linux/commits/bionic

Is it possible to add these patches in this next SRU?

tags: added: kernel-key
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hello,

I'll build a test kernel with you're backports. They all apply to the current Bionic kernel in the master-next branch: version 4.15.0-22.23.

 The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1770003

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-image-unsigned, linux-modules and linux-modules-extra .deb packages.

Thanks in advance!

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Doug said:

Using kernel from http://kernel.ubuntu.com/~jsalisbury/lp1770003, I have
confirmed that the disks are discovered and I am running some scenarios now.
Doing portdisable/portenable from the FC switch occasionally, while running HTX
I/O load. No problems seen so far.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-15 13:47 EDT-------
Using kernel from http://kernel.ubuntu.com/~jsalisbury/lp1770003, I have confirmed that the disks are discovered and I am running some scenarios now. Doing portdisable/portenable from the FC switch occasionally, while running HTX I/O load. No problems seen so far.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
description: updated
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-16 10:04 EDT-------
(In reply to comment #56)
> Created attachment 127215 [details]
> [PATCH][SRU][Bionic 1/9] scsi: qla2xxx: Fix session cleanup for N2N

sorry folks,

After the submission I have noted the wrong authority in the set by me, and so I updated the patch 1/9 fixing the author to Quinn Tran instead.

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH][SRU][Bionic 1/9] scsi: qla2xxx: Fix session cleanup for N2N

------- Comment (attachment only) From <email address hidden> 2018-05-16 10:00 EDT-------

Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-24 14:32 EDT-------
Canonical, please confirm. I see kernel linux-image-4.15.0-23-generic as the latest for -proposed. Is this the kernel that contains this fix?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is the correct version, 4.15.0-23.25. Here are the fixes in the changelog:

* qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-25 07:47 EDT-------
We have installed this kernel and verified that it is working. We are not seeing the panics.

I'm not certain how to change the tag from this end, but I consider this verified now.
Thanks.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-04 09:42 EDT-------
Canonical or Breno,

can we move this to accepted/verified? When is the release to customers due?

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Manoj Iyer (manjo) wrote :

As per the kernel team, release from -proposed to -updates should happen today. June 11th.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.4 KiB)

This bug was fixed in the package linux - 4.15.0-23.25

---------------
linux (4.15.0-23.25) bionic; urgency=medium

  * linux: 4.15.0-23.25 -proposed tracker (LP: #1772927)

  * arm64 SDEI support needs trampoline code for KPTI (LP: #1768630)
    - arm64: mmu: add the entry trampolines start/end section markers into
      sections.h
    - arm64: sdei: Add trampoline code for remapping the kernel

  * Some PCIe errors not surfaced through rasdaemon (LP: #1769730)
    - ACPI: APEI: handle PCIe AER errors in separate function
    - ACPI: APEI: call into AER handling regardless of severity

  * qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

  * Several hisi_sas bug fixes (LP: #1768974)
    - scsi: hisi_sas: dt-bindings: add an property of signal attenuation
    - scsi: hisi_sas: support the property of signal attenuation for v2 hw
    - scsi: hisi_sas: fix the issue of link rate inconsistency
    - scsi: hisi_sas: fix the issue of setting linkrate register
    - scsi: hisi_sas: increase timer expire of internal abort task
    - scsi: hisi_sas: remove unused variable hisi_sas_devices.running_req
    - scsi: hisi_sas: fix return value of hisi_sas_task_prep()
    - scsi: hisi_sas: Code cleanup and minor bug fixes

  * [bionic] machine stuck and bonding not working well when nvmet_rdma module
    is loaded (LP: #1764982)
    - nvmet-rdma: Don't flush system_wq by default during remove_one
    - nvme-rdma: Don't flush delete_wq by default during remove_one

  * Warnings/hang during error handling of SATA disks on SAS controller
    (LP: #1768971)
    - scsi: libsas: defer ata device eh commands to libata

  * Hotplugging a SATA disk into a SAS controller may cause crash (LP: #1768948)
    - ata: do not schedule hot plug if it is a sas host

  * ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow by CPU
    ATTEMPT TO RE-ENTER FIRMWARE! (LP: #1767927)
    - powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
    - powerpc/64s: return more carefully from sreset NMI
    - powerpc/64s: sreset panic if there is no debugger or crash dump handlers

  * fsnotify: Fix fsnotify_mark_connector race (LP: #1765564)
    - fsnotify: Fix fsnotify_mark_connector race

  * Hang on network interface removal in Xen virtual machine (LP: #1771620)
    - xen-netfront: Fix hang on device removal

  * HiSilicon HNS NIC names are truncated in /proc/interrupts (LP: #1765977)
    - net: hns: Avoid action name truncation

  * Ubuntu 18.04 kernel crashed while in degraded mode (LP: #1770849)
    - SAUCE: powerpc/perf: Fix memory allocation for...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-18 08:33 EDT-------
closed per previous comment

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.