Boston-LC:bos1u1: Stress test on Qlogic Fibre Channel on Ubuntu KVM guest that caused KVM host crashed in qlt_free_session_done call

Bug #1750441 reported by bugproxy on 2018-02-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

Problem Description:
=============
- PCI passthru Qlogic Fibre Channel adapter from Ubuntu 18.04 KVM host to Ubuntu 18.04 KVM guest.

- Stress test on Qlogic Fibre Channel on Ubuntu KVM guest caused KVM host crashed in qlt_free_session_done call.

- Below stack traces from KVM host:

91:mon> t
[c000200e4e81fb60] c00800001162f044 qlt_free_session_done+0x4ec/0x680 [qla2xxx] (unreliable)
[c000200e4e81fc90] c00000000012fbb8 process_one_work+0x298/0x5a0
[c000200e4e81fd20] c00000000012ff58 worker_thread+0x98/0x630
[c000200e4e81fdc0] c000000000138ae8 kthread+0x1a8/0x1b0
[c000200e4e81fe30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4

91:mon> e
cpu 0x91: Vector: 300 (Data Access) at [c000200e4e81f8e0]
    pc: c00800001162ed58: qlt_free_session_done+0x200/0x680 [qla2xxx]
    lr: c00800001162eca8: qlt_free_session_done+0x150/0x680 [qla2xxx]
    sp: c000200e4e81fb60
   msr: 900000000280b033
   dar: 20
 dsisr: 40000000
  current = 0xc000200e4e7b0e00
  paca = 0xc00000000fae3b00 softe: 0 irq_happened: 0x01
    pid = 1119, comm = kworker/145:1
Linux version 4.15.0-041500rc9-generic (kernel@tangerine) (gcc version 7.2.0 (Ubuntu 7.2.0-6ubuntu1)) #201801212130 SMP Mon Jan 22 03:36:42 UTC 2018

91:mon> r
R00 = c00800001162eca8 R16 = 0000000000000000
R01 = c000200e4e81fb60 R17 = 0000000000000000
R02 = c00800001166ad60 R18 = 0000000000000000
R03 = 0000000000000001 R19 = 0000000000000000
R04 = c000200e44f8c7f8 R20 = c000200e618e7d80
R05 = 000000000000f087 R21 = 0000000000000000
R06 = c00800001165e6c8 R22 = 0000000000000001
R07 = c00800001164adb0 R23 = c000200e44f99d24
R08 = 0000000000000000 R24 = 0000000000000402
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000000000000 R26 = c000000fe1270c20
R11 = c00800001163e170 R27 = c000200e44f99000
R12 = c000000000cfccf0 R28 = c00800001164adb0
R13 = c00000000fae3b00 R29 = c000000fe1270c00
R14 = c000000000138948 R30 = c000200e44f8c7f8
R15 = c000200e4f019440 R31 = c000000fe1270cc0
pc = c00800001162ed58 qlt_free_session_done+0x200/0x680 [qla2xxx]
cfar= c00800001162ed1c qlt_free_session_done+0x1c4/0x680 [qla2xxx]
lr = c00800001162eca8 qlt_free_session_done+0x150/0x680 [qla2xxx]
msr = 900000000280b033 cr = 28002284
ctr = c000000000cfccf0 xer = 0000000000000000 trap = 300
dar = 0000000000000020 dsisr = 40000000
91:mon>

The crash location seems close to this one fixed about two weeks ago:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/scsi/qla2xxx/qla_os.c?h=next-20180212&id=2ce87cc5b269510de9ca1185ca8a6e10ec78c069

scsi: qla2xxx: Fix memory corruption during hba reset test
This patch fixes memory corrpution while performing HBA Reset test.

Following stack trace is seen:

[ 466.397219] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 466.433669] IP: [<ffffffffc06f5dd0>] qlt_free_session_done+0x260/0x5f0 [qla2xxx]
[ 466.467731] PGD 0
[ 466.476718] Oops: 0000 [#1] SMP

- Luciano built and provided the patch with new Qlogic change on Friday last week.

root@bos1u1p1:~/chavez# ls linux-image*
linux-image-4.15.0-041500rc9-generic_4.15.0-041500rc9.201801212130_ppc64el.deb
linux-image-extra-4.15.0-041500rc9-generic_4.15.0-041500rc9.201801212130_ppc64el.deb

- I configured and ran same test over weekend and test ran good. KVM host did not crash in qlt_free_session_done call like before.

- So the patch fixed the problem.

Hi Canonical,

Please review and consider this a request to pull in commit 2ce87cc5b269510de9ca1185ca8a6e10ec78c069 please. Thanks!

CVE References

bugproxy (bugproxy) on 2018-02-19
tags: added: architecture-ppc64le bugnameltc-164551 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 2ce87cc5b269510de9ca1185ca8a6e10ec78c069. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1750441

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in ubuntu-power-systems:
status: New → In Progress

------- Comment From <email address hidden> 2018-02-21 16:40 EDT-------
The test system will be available at the end of this week. I will setup the test and verify the test kernel at that time.

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-21 15:10 EDT-------
- From my previous comment, I was plan to setup my system and verify the patch.

- However, after I updated my system to new Ubuntu 18.04, I ran into a new Ubuntu 18.04 issue where an Ubuntu KVM guest could not started due to Transactional Memory error (LTC bug 165081) .

- I need to fix my KVM system by wait for patch from LTC bug 165081 available. So I can get the KVM guest started again. Once that works then I can go back and verify the fix for this github.

Launchpad Janitor (janitor) wrote :
Download full text (32.6 KiB)

This bug was fixed in the package linux - 4.15.0-13.14

---------------
linux (4.15.0-13.14) bionic; urgency=medium

  * linux: 4.15.0-13.14 -proposed tracker (LP: #1756408)

  * devpts: handle bind-mounts (LP: #1755857)
    - SAUCE: devpts: hoist out check for DEVPTS_SUPER_MAGIC
    - SAUCE: devpts: resolve devpts bind-mounts
    - SAUCE: devpts: comment devpts_mntget()
    - SAUCE: selftests: add devpts selftests

  * [bionic][arm64] d-i: add hisi_sas_v3_hw to scsi-modules (LP: #1756103)
    - d-i: add hisi_sas_v3_hw to scsi-modules

  * [Bionic][ARM64] enable ROCE and HNS3 driver support for hip08 SoC
    (LP: #1756097)
    - RDMA/hns: Refactor eq code for hip06
    - RDMA/hns: Add eq support of hip08
    - RDMA/hns: Add detailed comments for mb() call
    - RDMA/hns: Add rq inline data support for hip08 RoCE
    - RDMA/hns: Update the usage of sr_max and rr_max field
    - RDMA/hns: Set access flags of hip08 RoCE
    - RDMA/hns: Filter for zero length of sge in hip08 kernel mode
    - RDMA/hns: Fix QP state judgement before sending work requests
    - RDMA/hns: Assign dest_qp when deregistering mr
    - RDMA/hns: Fix endian problems around imm_data and rkey
    - RDMA/hns: Assign the correct value for tx_cqn
    - RDMA/hns: Create gsi qp in hip08
    - RDMA/hns: Add gsi qp support for modifying qp in hip08
    - RDMA/hns: Fill sq wqe context of ud type in hip08
    - RDMA/hns: Assign zero for pkey_index of wc in hip08
    - RDMA/hns: Update the verbs of polling for completion
    - RDMA/hns: Set the guid for hip08 RoCE device
    - net: hns3: Refactor of the reset interrupt handling logic
    - net: hns3: Add reset service task for handling reset requests
    - net: hns3: Refactors the requested reset & pending reset handling code
    - net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface
    - net: hns3: Add mailbox support to VF driver
    - net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
    - net: hns3: Add HNS3 VF driver to kernel build framework
    - net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC
    - net: hns3: Add mailbox support to PF driver
    - net: hns3: Change PF to add ring-vect binding & resetQ to mailbox
    - net: hns3: Add mailbox interrupt handling to PF driver
    - net: hns3: add support to query tqps number
    - net: hns3: add support to modify tqps number
    - net: hns3: change the returned tqp number by ethtool -x
    - net: hns3: free the ring_data structrue when change tqps
    - net: hns3: get rss_size_max from configuration but not hardcode
    - net: hns3: add a mask initialization for mac_vlan table
    - net: hns3: add vlan offload config command
    - net: hns3: add ethtool related offload command
    - net: hns3: add handling vlan tag offload in bd
    - net: hns3: cleanup mac auto-negotiation state query
    - net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg
    - net: hns3: add support for set_pauseparam
    - net: hns3: add support to update flow control settings after autoneg
    - net: hns3: add Asym Pause support to phy default features
    - net: hns3: add support for querying advertised pause frame by ethtool ethx
    - net:...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers