Boston-LC:bos1u1: Stress test on Qlogic Fibre Channel on Ubuntu KVM guest that caused KVM host crashed in qlt_free_session_done call

Bug #1750441 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Bionic
Fix Released
High
Joseph Salisbury

Bug Description

Problem Description:
=============
- PCI passthru Qlogic Fibre Channel adapter from Ubuntu 18.04 KVM host to Ubuntu 18.04 KVM guest.

- Stress test on Qlogic Fibre Channel on Ubuntu KVM guest caused KVM host crashed in qlt_free_session_done call.

- Below stack traces from KVM host:

91:mon> t
[c000200e4e81fb60] c00800001162f044 qlt_free_session_done+0x4ec/0x680 [qla2xxx] (unreliable)
[c000200e4e81fc90] c00000000012fbb8 process_one_work+0x298/0x5a0
[c000200e4e81fd20] c00000000012ff58 worker_thread+0x98/0x630
[c000200e4e81fdc0] c000000000138ae8 kthread+0x1a8/0x1b0
[c000200e4e81fe30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4

91:mon> e
cpu 0x91: Vector: 300 (Data Access) at [c000200e4e81f8e0]
    pc: c00800001162ed58: qlt_free_session_done+0x200/0x680 [qla2xxx]
    lr: c00800001162eca8: qlt_free_session_done+0x150/0x680 [qla2xxx]
    sp: c000200e4e81fb60
   msr: 900000000280b033
   dar: 20
 dsisr: 40000000
  current = 0xc000200e4e7b0e00
  paca = 0xc00000000fae3b00 softe: 0 irq_happened: 0x01
    pid = 1119, comm = kworker/145:1
Linux version 4.15.0-041500rc9-generic (kernel@tangerine) (gcc version 7.2.0 (Ubuntu 7.2.0-6ubuntu1)) #201801212130 SMP Mon Jan 22 03:36:42 UTC 2018

91:mon> r
R00 = c00800001162eca8 R16 = 0000000000000000
R01 = c000200e4e81fb60 R17 = 0000000000000000
R02 = c00800001166ad60 R18 = 0000000000000000
R03 = 0000000000000001 R19 = 0000000000000000
R04 = c000200e44f8c7f8 R20 = c000200e618e7d80
R05 = 000000000000f087 R21 = 0000000000000000
R06 = c00800001165e6c8 R22 = 0000000000000001
R07 = c00800001164adb0 R23 = c000200e44f99d24
R08 = 0000000000000000 R24 = 0000000000000402
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000000000000 R26 = c000000fe1270c20
R11 = c00800001163e170 R27 = c000200e44f99000
R12 = c000000000cfccf0 R28 = c00800001164adb0
R13 = c00000000fae3b00 R29 = c000000fe1270c00
R14 = c000000000138948 R30 = c000200e44f8c7f8
R15 = c000200e4f019440 R31 = c000000fe1270cc0
pc = c00800001162ed58 qlt_free_session_done+0x200/0x680 [qla2xxx]
cfar= c00800001162ed1c qlt_free_session_done+0x1c4/0x680 [qla2xxx]
lr = c00800001162eca8 qlt_free_session_done+0x150/0x680 [qla2xxx]
msr = 900000000280b033 cr = 28002284
ctr = c000000000cfccf0 xer = 0000000000000000 trap = 300
dar = 0000000000000020 dsisr = 40000000
91:mon>

The crash location seems close to this one fixed about two weeks ago:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/scsi/qla2xxx/qla_os.c?h=next-20180212&id=2ce87cc5b269510de9ca1185ca8a6e10ec78c069

scsi: qla2xxx: Fix memory corruption during hba reset test
This patch fixes memory corrpution while performing HBA Reset test.

Following stack trace is seen:

[ 466.397219] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 466.433669] IP: [<ffffffffc06f5dd0>] qlt_free_session_done+0x260/0x5f0 [qla2xxx]
[ 466.467731] PGD 0
[ 466.476718] Oops: 0000 [#1] SMP

- Luciano built and provided the patch with new Qlogic change on Friday last week.

root@bos1u1p1:~/chavez# ls linux-image*
linux-image-4.15.0-041500rc9-generic_4.15.0-041500rc9.201801212130_ppc64el.deb
linux-image-extra-4.15.0-041500rc9-generic_4.15.0-041500rc9.201801212130_ppc64el.deb

- I configured and ran same test over weekend and test ran good. KVM host did not crash in qlt_free_session_done call like before.

- So the patch fixed the problem.

Hi Canonical,

Please review and consider this a request to pull in commit 2ce87cc5b269510de9ca1185ca8a6e10ec78c069 please. Thanks!

CVE References

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-164551 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 2ce87cc5b269510de9ca1185ca8a6e10ec78c069. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1750441

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-02-21 16:40 EDT-------
The test system will be available at the end of this week. I will setup the test and verify the test kernel at that time.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-21 15:10 EDT-------
- From my previous comment, I was plan to setup my system and verify the patch.

- However, after I updated my system to new Ubuntu 18.04, I ran into a new Ubuntu 18.04 issue where an Ubuntu KVM guest could not started due to Transactional Memory error (LTC bug 165081) .

- I need to fix my KVM system by wait for patch from LTC bug 165081 available. So I can get the KVM guest started again. Once that works then I can go back and verify the fix for this github.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (32.6 KiB)

This bug was fixed in the package linux - 4.15.0-13.14

---------------
linux (4.15.0-13.14) bionic; urgency=medium

  * linux: 4.15.0-13.14 -proposed tracker (LP: #1756408)

  * devpts: handle bind-mounts (LP: #1755857)
    - SAUCE: devpts: hoist out check for DEVPTS_SUPER_MAGIC
    - SAUCE: devpts: resolve devpts bind-mounts
    - SAUCE: devpts: comment devpts_mntget()
    - SAUCE: selftests: add devpts selftests

  * [bionic][arm64] d-i: add hisi_sas_v3_hw to scsi-modules (LP: #1756103)
    - d-i: add hisi_sas_v3_hw to scsi-modules

  * [Bionic][ARM64] enable ROCE and HNS3 driver support for hip08 SoC
    (LP: #1756097)
    - RDMA/hns: Refactor eq code for hip06
    - RDMA/hns: Add eq support of hip08
    - RDMA/hns: Add detailed comments for mb() call
    - RDMA/hns: Add rq inline data support for hip08 RoCE
    - RDMA/hns: Update the usage of sr_max and rr_max field
    - RDMA/hns: Set access flags of hip08 RoCE
    - RDMA/hns: Filter for zero length of sge in hip08 kernel mode
    - RDMA/hns: Fix QP state judgement before sending work requests
    - RDMA/hns: Assign dest_qp when deregistering mr
    - RDMA/hns: Fix endian problems around imm_data and rkey
    - RDMA/hns: Assign the correct value for tx_cqn
    - RDMA/hns: Create gsi qp in hip08
    - RDMA/hns: Add gsi qp support for modifying qp in hip08
    - RDMA/hns: Fill sq wqe context of ud type in hip08
    - RDMA/hns: Assign zero for pkey_index of wc in hip08
    - RDMA/hns: Update the verbs of polling for completion
    - RDMA/hns: Set the guid for hip08 RoCE device
    - net: hns3: Refactor of the reset interrupt handling logic
    - net: hns3: Add reset service task for handling reset requests
    - net: hns3: Refactors the requested reset & pending reset handling code
    - net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface
    - net: hns3: Add mailbox support to VF driver
    - net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
    - net: hns3: Add HNS3 VF driver to kernel build framework
    - net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC
    - net: hns3: Add mailbox support to PF driver
    - net: hns3: Change PF to add ring-vect binding & resetQ to mailbox
    - net: hns3: Add mailbox interrupt handling to PF driver
    - net: hns3: add support to query tqps number
    - net: hns3: add support to modify tqps number
    - net: hns3: change the returned tqp number by ethtool -x
    - net: hns3: free the ring_data structrue when change tqps
    - net: hns3: get rss_size_max from configuration but not hardcode
    - net: hns3: add a mask initialization for mac_vlan table
    - net: hns3: add vlan offload config command
    - net: hns3: add ethtool related offload command
    - net: hns3: add handling vlan tag offload in bd
    - net: hns3: cleanup mac auto-negotiation state query
    - net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg
    - net: hns3: add support for set_pauseparam
    - net: hns3: add support to update flow control settings after autoneg
    - net: hns3: add Asym Pause support to phy default features
    - net: hns3: add support for querying advertised pause frame by ethtool ethx
    - net:...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.