IPR driver causes multipath to fail paths/stuck IO on Medium Errors

Bug #1682644 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Canonical Kernel Team
Xenial
Fix Released
High
Unassigned
Zesty
Fix Released
High
Unassigned

Bug Description

SRU Justification:

Impact: stuck I/O to multipath disks with medium errors (on IPR controllers)
Fix: upstream commit for IPR driver to allow SCSI layer to handle the error
Testcase: perform I/O to a failing disk which is multipathed (on IPR
          controller), which returns SCSI Medium Errors (without the fix,
          the I/O gets stuck).
          the commit message describes a test-case w/ sg_dd.

---Problem Description---
IPR driver causes multipath to fail paths/stuck IO on Medium Errors

This problem is resolved with this upstream accepted patch, scheduled for 4.11.
The detailed problem description and resolution are described in the commit message.

> scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=785a470496d8e0a32e3d39f376984eb2c98ca5b3

Please apply to 17.04 (target 16.04.3 HWE kernel) and 16.04 (GA kernel).
Patch already applied to 17.10.

The business justification for the SRU is:

Clients with a dual-controller multipathed IPR configuration that eventually runs into failing disk/sectors, will experience an I/O hang once the drive reports a Medium Error, which can hang an application or even the root filesystem (whatever is doing I/O to the failing drive), potentially hanging the system.

Thanks.

---Additional Hardware Info---
Dual (IPR) controller setup, multipath enabled

---Steps to Reproduce---
1) Use a disk with bad sectors (or force such condition, via internal/special tools)
2) Multipath that disk
3) Run IO to the multipath device on the bad sectors
4) Both paths will be failed, and IO is stuck due to queue_if_no_path (enabled by default for IPR)

The detailed problem description and resolution are described in the commit message.

CVE References

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-153445 severity-critical targetmilestone-inin16043
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
tags: added: kernel-da-key
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-08-14 13:12 EDT-------
This one hasn't been touched in months... We need to get this fix into a 16.04 SRU...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Bug description updated w/ SRU template.

Patch submitted to kernel-team mailing list.

[SRU Z/X][PATCH] scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
https://lists.ubuntu.com/archives/kernel-team/2017-August/086518.html

Requested for
- Zesty/17.04 (target 16.04.3 HWE kernel / v4.10-based) and
- Xenial/16.04 (GA kernel / v4.4-based).

Already applied on
- Artful/17.10 (16.04.4 HWE kernel).

description: updated
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Zesty):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Zesty):
importance: Undecided → High
bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Changed in linux (Ubuntu Zesty):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
tags: added: verification-needed-xenial
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-01 19:50 EDT-------
Marking as verification done on Zesty.

Verified that the commit is applied in the zesty's kernel master-next branch (cannot verify in hardware currently, but the code change is trivial and has been tested on real hardware before in Ubuntu kernels and upstream for upstream submission).

------- Comment From <email address hidden> 2017-09-01 19:55 EDT-------
(commit in zesty)
http://kernel.ubuntu.com/git/ubuntu/ubuntu-zesty.git/commit/?h=master-next&id=67317c9194ff46c043b10749b7ded3c2fed4be9a

tags: added: verification-done-zesty
removed: verification-needed-xenial verification-needed-zesty
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-01 20:30 EDT-------
Likewise for Xenial.

http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?h=master-next&id=6020175400028e2b5cdd8212a6b20d208a033973

Thanks.

tags: added: verification-done-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.2 KiB)

This bug was fixed in the package linux - 4.10.0-35.39

---------------
linux (4.10.0-35.39) zesty; urgency=low

  * linux: 4.10.0-35.39 -proposed tracker (LP: #1716606)

  * kernel panic -not syncing: Fatal exception: panic_on_oops (LP: #1708399)
    - SAUCE: s390/mm: fix local TLB flushing vs. detach of an mm address space
    - SAUCE: s390/mm: fix race on mm->context.flush_mm

  * CVE-2017-1000251
    - Bluetooth: Properly check L2CAP config option output buffer length

linux (4.10.0-34.38) zesty; urgency=low

  * linux: 4.10.0-34.38 -proposed tracker (LP: #1713470)

  * Ubuntu 16.04.03: perf tool does not count pm_run_inst_cmpl with rcode on
    POWER9 DD2.0 (LP: #1709964)
    - powerpc/perf: Fix Power9 test_adder fields

  * HID: multitouch: Support ALPS PTP Stick and Touchpad devices (LP: #1712481)
    - HID: multitouch: Support PTP Stick and Touchpad device
    - SAUCE: HID: multitouch: Support ALPS PTP stick with pid 0x120A

  * igb: Support using Broadcom 54616 as PHY (LP: #1712024)
    - SAUCE: igb: add support for using Broadcom 54616 as PHY

  * RPT related fixes missing in Ubuntu 16.04.3 (LP: #1709220)
    - powerpc/mm/radix: Optimise tlbiel flush all case
    - powerpc/mm/radix: Improve _tlbiel_pid to be usable for PWC flushes
    - powerpc/mm/radix: Improve TLB/PWC flushes
    - powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range

  * AMD RV platforms with SNPS 3.1 USB controller stop responding (S3 issue)
    (LP: #1711098)
    - usb: xhci: Issue stop EP command only when the EP state is running

  * dma-buf: performance issue when looking up the fence status (LP: #1711096)
    - dma-buf: avoid scheduling on fence status query v2

  * IPR driver causes multipath to fail paths/stuck IO on Medium Errors
    (LP: #1682644)
    - scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

  * Disable CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (LP: #1709171)
    - [Config] CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n for ppc64el

  * memory-hotplug test needs to be fixed (LP: #1710868)
    - selftests: typo correction for memory-hotplug test
    - selftests: check hot-pluggagble memory for memory-hotplug test
    - selftests: check percentage range for memory-hotplug test
    - selftests: add missing test name in memory-hotplug test
    - selftests: fix memory-hotplug test

  * Ubuntu 16.04.3: Qemu fails on P9 (LP: #1686019)
    - KVM: PPC: Pass kvm* to kvmppc_find_table()
    - KVM: PPC: Use preregistered memory API to access TCE list
    - KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
    - powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
    - powerpc/powernv/ioda2: Update iommu table base on ownership change
    - powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
    - powerpc/vfio_spapr_tce: Add reference counting to iommu_table
    - powerpc/mmu: Add real mode support for IOMMU preregistered memory
    - KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
    - KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers

  * [SRU][Zesty] [QDF2400] pl011 E44 erratum patch needed for 2.0 firmware and
    1.1 silicon (LP: #1709123)
    - tty: pl011: fix initialization or...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.4 KiB)

This bug was fixed in the package linux - 4.4.0-96.119

---------------
linux (4.4.0-96.119) xenial; urgency=low

  * linux: 4.4.0-96.119 -proposed tracker (LP: #1716613)

  * kernel panic -not syncing: Fatal exception: panic_on_oops (LP: #1708399)
    - s390/mm: no local TLB flush for clearing-by-ASCE IDTE
    - SAUCE: s390/mm: fix local TLB flushing vs. detach of an mm address space
    - SAUCE: s390/mm: fix race on mm->context.flush_mm

  * CVE-2017-1000251
    - Bluetooth: Properly check L2CAP config option output buffer length

linux (4.4.0-95.118) xenial; urgency=low

  * linux: 4.4.0-95.118 -proposed tracker (LP: #1715651)

  * Xenial update to 4.4.78 stable release broke Address Sanitizer
    (LP: #1715636)
    - mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes

linux (4.4.0-94.117) xenial; urgency=low

  * linux: 4.4.0-94.117 -proposed tracker (LP: #1713462)

  * mwifiex causes kernel oops when AP mode is enabled (LP: #1712746)
    - SAUCE: net/wireless: do not dereference invalid pointer
    - SAUCE: mwifiex: do not dereference invalid pointer

  * Backport more recent Broadcom bnxt_en driver (LP: #1711056)
    - SAUCE: bnxt_en_bpo: Import bnxt_en driver version 1.8.1
    - SAUCE: bnxt_en_bpo: Drop distro out-of-tree detection logic
    - SAUCE: bnxt_en_bpo: Remove unnecessary compile flags
    - SAUCE: bnxt_en_bpo: Move config settings to Kconfig
    - SAUCE: bnxt_en_bpo: Remove PCI_IDs handled by the regular driver
    - SAUCE: bnxt_en_bpo: Rename the backport driver to bnxt_en_bpo
    - bnxt_en_bpo: [Config] Enable CONFIG_BNXT_BPO=m

  * HID: multitouch: Support ALPS PTP Stick and Touchpad devices (LP: #1712481)
    - HID: multitouch: Support PTP Stick and Touchpad device
    - SAUCE: HID: multitouch: Support ALPS PTP stick with pid 0x120A

  * igb: Support using Broadcom 54616 as PHY (LP: #1712024)
    - SAUCE: igb: add support for using Broadcom 54616 as PHY

  * IPR driver causes multipath to fail paths/stuck IO on Medium Errors
    (LP: #1682644)
    - scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

  * accessing /dev/hvc1 with stress-ng on Ubuntu xenial causes crash
    (LP: #1711401)
    - tty/hvc: Use IRQF_SHARED for OPAL hvc consoles

  * memory-hotplug test needs to be fixed (LP: #1710868)
    - selftests: typo correction for memory-hotplug test
    - selftests: check hot-pluggagble memory for memory-hotplug test
    - selftests: check percentage range for memory-hotplug test
    - selftests: add missing test name in memory-hotplug test
    - selftests: fix memory-hotplug test

  * HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) does not work (LP: #1707643)
    - net: cdc_mbim: apply "NDP to end" quirk to HP lt4132

  * Migrating KSM page causes the VM lock up as the KSM page merging list is too
    large (LP: #1680513)
    - ksm: introduce ksm_max_page_sharing per page deduplication limit
    - ksm: fix use after free with merge_across_nodes = 0
    - ksm: cleanup stable_node chain collapse case
    - ksm: swap the two output parameters of chain/chain_prune
    - ksm: optimize refile of stable_node_dup at the head of the chain

  * sort ABI files with C.UTF-8 locale (LP: #1712345)
    - [Packaging] sort ABI ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.