IPR driver causes multipath to fail paths/stuck IO on Medium Errors

Bug #1682644 reported by bugproxy on 2017-04-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Canonical Kernel Team
Xenial
High
Unassigned
Zesty
High
Unassigned

Bug Description

SRU Justification:

Impact: stuck I/O to multipath disks with medium errors (on IPR controllers)
Fix: upstream commit for IPR driver to allow SCSI layer to handle the error
Testcase: perform I/O to a failing disk which is multipathed (on IPR
          controller), which returns SCSI Medium Errors (without the fix,
          the I/O gets stuck).
          the commit message describes a test-case w/ sg_dd.

---Problem Description---
IPR driver causes multipath to fail paths/stuck IO on Medium Errors

This problem is resolved with this upstream accepted patch, scheduled for 4.11.
The detailed problem description and resolution are described in the commit message.

> scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=785a470496d8e0a32e3d39f376984eb2c98ca5b3

Please apply to 17.04 (target 16.04.3 HWE kernel) and 16.04 (GA kernel).
Patch already applied to 17.10.

The business justification for the SRU is:

Clients with a dual-controller multipathed IPR configuration that eventually runs into failing disk/sectors, will experience an I/O hang once the drive reports a Medium Error, which can hang an application or even the root filesystem (whatever is doing I/O to the failing drive), potentially hanging the system.

Thanks.

---Additional Hardware Info---
Dual (IPR) controller setup, multipath enabled

---Steps to Reproduce---
1) Use a disk with bad sectors (or force such condition, via internal/special tools)
2) Multipath that disk
3) Run IO to the multipath device on the bad sectors
4) Both paths will be failed, and IO is stuck due to queue_if_no_path (enabled by default for IPR)

The detailed problem description and resolution are described in the commit message.

CVE References

bugproxy (bugproxy) on 2017-04-13
tags: added: architecture-ppc64le bugnameltc-153445 severity-critical targetmilestone-inin16043
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
tags: added: kernel-da-key
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged

------- Comment From <email address hidden> 2017-08-14 13:12 EDT-------
This one hasn't been touched in months... We need to get this fix into a 16.04 SRU...

Bug description updated w/ SRU template.

Patch submitted to kernel-team mailing list.

[SRU Z/X][PATCH] scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
https://lists.ubuntu.com/archives/kernel-team/2017-August/086518.html

Requested for
- Zesty/17.04 (target 16.04.3 HWE kernel / v4.10-based) and
- Xenial/16.04 (GA kernel / v4.4-based).

Already applied on
- Artful/17.10 (16.04.4 HWE kernel).

description: updated
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Zesty):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Zesty):
importance: Undecided → High
bugproxy (bugproxy) on 2017-08-24
tags: added: severity-high
removed: severity-critical
Changed in linux (Ubuntu Zesty):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
tags: added: verification-needed-xenial

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-01 19:50 EDT-------
Marking as verification done on Zesty.

Verified that the commit is applied in the zesty's kernel master-next branch (cannot verify in hardware currently, but the code change is trivial and has been tested on real hardware before in Ubuntu kernels and upstream for upstream submission).

------- Comment From <email address hidden> 2017-09-01 19:55 EDT-------
(commit in zesty)
http://kernel.ubuntu.com/git/ubuntu/ubuntu-zesty.git/commit/?h=master-next&id=67317c9194ff46c043b10749b7ded3c2fed4be9a

tags: added: verification-done-zesty
removed: verification-needed-xenial verification-needed-zesty
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-01 20:30 EDT-------
Likewise for Xenial.

http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?h=master-next&id=6020175400028e2b5cdd8212a6b20d208a033973

Thanks.

tags: added: verification-done-xenial
Launchpad Janitor (janitor) wrote :
Download full text (4.2 KiB)

This bug was fixed in the package linux - 4.10.0-35.39

---------------
linux (4.10.0-35.39) zesty; urgency=low

  * linux: 4.10.0-35.39 -proposed tracker (LP: #1716606)

  * kernel panic -not syncing: Fatal exception: panic_on_oops (LP: #1708399)
    - SAUCE: s390/mm: fix local TLB flushing vs. detach of an mm address space
    - SAUCE: s390/mm: fix race on mm->context.flush_mm

  * CVE-2017-1000251
    - Bluetooth: Properly check L2CAP config option output buffer length

linux (4.10.0-34.38) zesty; urgency=low

  * linux: 4.10.0-34.38 -proposed tracker (LP: #1713470)

  * Ubuntu 16.04.03: perf tool does not count pm_run_inst_cmpl with rcode on
    POWER9 DD2.0 (LP: #1709964)
    - powerpc/perf: Fix Power9 test_adder fields

  * HID: multitouch: Support ALPS PTP Stick and Touchpad devices (LP: #1712481)
    - HID: multitouch: Support PTP Stick and Touchpad device
    - SAUCE: HID: multitouch: Support ALPS PTP stick with pid 0x120A

  * igb: Support using Broadcom 54616 as PHY (LP: #1712024)
    - SAUCE: igb: add support for using Broadcom 54616 as PHY

  * RPT related fixes missing in Ubuntu 16.04.3 (LP: #1709220)
    - powerpc/mm/radix: Optimise tlbiel flush all case
    - powerpc/mm/radix: Improve _tlbiel_pid to be usable for PWC flushes
    - powerpc/mm/radix: Improve TLB/PWC flushes
    - powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range

  * AMD RV platforms with SNPS 3.1 USB controller stop responding (S3 issue)
    (LP: #1711098)
    - usb: xhci: Issue stop EP command only when the EP state is running

  * dma-buf: performance issue when looking up the fence status (LP: #1711096)
    - dma-buf: avoid scheduling on fence status query v2

  * IPR driver causes multipath to fail paths/stuck IO on Medium Errors
    (LP: #1682644)
    - scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

  * Disable CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (LP: #1709171)
    - [Config] CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n for ppc64el

  * memory-hotplug test needs to be fixed (LP: #1710868)
    - selftests: typo correction for memory-hotplug test
    - selftests: check hot-pluggagble memory for memory-hotplug test
    - selftests: check percentage range for memory-hotplug test
    - selftests: add missing test name in memory-hotplug test
    - selftests: fix memory-hotplug test

  * Ubuntu 16.04.3: Qemu fails on P9 (LP: #1686019)
    - KVM: PPC: Pass kvm* to kvmppc_find_table()
    - KVM: PPC: Use preregistered memory API to access TCE list
    - KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
    - powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
    - powerpc/powernv/ioda2: Update iommu table base on ownership change
    - powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
    - powerpc/vfio_spapr_tce: Add reference counting to iommu_table
    - powerpc/mmu: Add real mode support for IOMMU preregistered memory
    - KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
    - KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers

  * [SRU][Zesty] [QDF2400] pl011 E44 erratum patch needed for 2.0 firmware and
    1.1 silicon (LP: #1709123)
    - tty: pl011: fix initialization or...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (14.4 KiB)

This bug was fixed in the package linux - 4.4.0-96.119

---------------
linux (4.4.0-96.119) xenial; urgency=low

  * linux: 4.4.0-96.119 -proposed tracker (LP: #1716613)

  * kernel panic -not syncing: Fatal exception: panic_on_oops (LP: #1708399)
    - s390/mm: no local TLB flush for clearing-by-ASCE IDTE
    - SAUCE: s390/mm: fix local TLB flushing vs. detach of an mm address space
    - SAUCE: s390/mm: fix race on mm->context.flush_mm

  * CVE-2017-1000251
    - Bluetooth: Properly check L2CAP config option output buffer length

linux (4.4.0-95.118) xenial; urgency=low

  * linux: 4.4.0-95.118 -proposed tracker (LP: #1715651)

  * Xenial update to 4.4.78 stable release broke Address Sanitizer
    (LP: #1715636)
    - mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes

linux (4.4.0-94.117) xenial; urgency=low

  * linux: 4.4.0-94.117 -proposed tracker (LP: #1713462)

  * mwifiex causes kernel oops when AP mode is enabled (LP: #1712746)
    - SAUCE: net/wireless: do not dereference invalid pointer
    - SAUCE: mwifiex: do not dereference invalid pointer

  * Backport more recent Broadcom bnxt_en driver (LP: #1711056)
    - SAUCE: bnxt_en_bpo: Import bnxt_en driver version 1.8.1
    - SAUCE: bnxt_en_bpo: Drop distro out-of-tree detection logic
    - SAUCE: bnxt_en_bpo: Remove unnecessary compile flags
    - SAUCE: bnxt_en_bpo: Move config settings to Kconfig
    - SAUCE: bnxt_en_bpo: Remove PCI_IDs handled by the regular driver
    - SAUCE: bnxt_en_bpo: Rename the backport driver to bnxt_en_bpo
    - bnxt_en_bpo: [Config] Enable CONFIG_BNXT_BPO=m

  * HID: multitouch: Support ALPS PTP Stick and Touchpad devices (LP: #1712481)
    - HID: multitouch: Support PTP Stick and Touchpad device
    - SAUCE: HID: multitouch: Support ALPS PTP stick with pid 0x120A

  * igb: Support using Broadcom 54616 as PHY (LP: #1712024)
    - SAUCE: igb: add support for using Broadcom 54616 as PHY

  * IPR driver causes multipath to fail paths/stuck IO on Medium Errors
    (LP: #1682644)
    - scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

  * accessing /dev/hvc1 with stress-ng on Ubuntu xenial causes crash
    (LP: #1711401)
    - tty/hvc: Use IRQF_SHARED for OPAL hvc consoles

  * memory-hotplug test needs to be fixed (LP: #1710868)
    - selftests: typo correction for memory-hotplug test
    - selftests: check hot-pluggagble memory for memory-hotplug test
    - selftests: check percentage range for memory-hotplug test
    - selftests: add missing test name in memory-hotplug test
    - selftests: fix memory-hotplug test

  * HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) does not work (LP: #1707643)
    - net: cdc_mbim: apply "NDP to end" quirk to HP lt4132

  * Migrating KSM page causes the VM lock up as the KSM page merging list is too
    large (LP: #1680513)
    - ksm: introduce ksm_max_page_sharing per page deduplication limit
    - ksm: fix use after free with merge_across_nodes = 0
    - ksm: cleanup stable_node chain collapse case
    - ksm: swap the two output parameters of chain/chain_prune
    - ksm: optimize refile of stable_node_dup at the head of the chain

  * sort ABI files with C.UTF-8 locale (LP: #1712345)
    - [Packaging] sort ABI ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers