ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe

Bug #1662666 reported by bugproxy on 2017-02-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Tim Gardner
Xenial
Undecided
Tim Gardner
Yakkety
Undecided
Tim Gardner
Zesty
Undecided
Tim Gardner

Bug Description

-- Problem Description --
The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f

The hang problem can be reproduced with the following shell script using an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel, although it took over 3 hours to manifest.

#!/bin/bash

if [[ ${#} -eq 0 ]]; then
        ${0} breaker &
        while true; do
                dd if=/dev/nvme0n1 bs=1024k of=/dev/null
        done
elif [[ ${1} == "breaker" ]]; then
        while true; do
                ppc64_cpu --smt=off
                sleep 5
                ppc64_cpu --smt=on
                sleep 5
        done
fi

bugproxy (bugproxy) on 2017-02-07
tags: added: architecture-ppc64le bugnameltc-146759 severity-critical targetmilestone-inin16041
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)

Steve,

Can Foundations take a look at this request, please.

                  Michael

On 02/07/2017 12:39 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> -- Problem Description --
> The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f
>
> The hang problem can be reproduced with the following shell script using
> an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within
> 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel,
> although it took over 3 hours to manifest.
>
> #!/bin/bash
>
> if [[ ${#} -eq 0 ]]; then
> ${0} breaker &
> while true; do
> dd if=/dev/nvme0n1 bs=1024k of=/dev/null
> done
> elif [[ ${1} == "breaker" ]]; then
> while true; do
> ppc64_cpu --smt=off
> sleep 5
> ppc64_cpu --smt=on
> sleep 5
> done
> fi
>
> ** Affects: ubuntu
> Importance: Undecided
> Assignee: Taco Screen team (taco-screen-team)
> Status: New
>
>
> ** Tags: architecture-ppc64le bugnameltc-146759 severity-critical targetmilestone-inin16041

--
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

------- Comment From <email address hidden> 2017-02-07 15:53 EDT-------
Correction: the hang reproduced by the previous shell script is actually being fixed separately. These commits fix various other problems with NVMe drives and are required as a prerequisite .

On Tue, Feb 07, 2017 at 12:42:39PM -0800, Michael Hohnbaum wrote:
> Can Foundations take a look at this request, please.

The bug is assigned to the linux package, so the kernel team should probably
be looking at it.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Michael Hohnbaum (hohnbaum) wrote :

Leann,

While the problem is in ppc64-cpu command, it appears the fix is in a
set of kernel patches. Can you have the kernel team take a look at
these. Thanks.

                     Michael

On 02/07/2017 01:05 PM, Steve Langasek wrote:
> On Tue, Feb 07, 2017 at 12:42:39PM -0800, Michael Hohnbaum wrote:
>> Can Foundations take a look at this request, please.
> The bug is assigned to the linux package, so the kernel team should probably
> be looking at it.
>

--
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

Tim Gardner (timg-tpi) on 2017-02-08
Changed in linux (Ubuntu Zesty):
status: New → Fix Released
assignee: Taco Screen team (taco-screen-team) → Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Tim Gardner (timg-tpi) on 2017-02-16
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Download full text (5.1 KiB)

-- Problem Description --
The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f

The hang problem can be reproduced with the following shell script using an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel, although it took over 3 hours to manifest.

#!/bin/bash

if [[ ${#} -eq 0 ]]; then
${0} breaker &
while true; do
dd if=/dev/nvme0n1 bs=1024k of=/dev/null
done
elif [[ ${1} == "breaker" ]]; then
while true; do
ppc64_cpu --smt=off
sleep 5
ppc64_cpu --smt=on
sleep 5
done
fi

Steve,

Can Foundations take a look at this request, please.

Michael

On 02/07/2017 12:39 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> -- Problem Description --
> The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f
>
> The hang problem can be reproduced with the following shell script using
> an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within
> 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel,
> although it took over 3 hours to manifest.
>
> #!/bin/bash
>
> if [[ ${#} -eq 0 ]]; then
> ${0} breaker &
> while true; do
> dd if=/dev/nv...

Read more...

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-yakkety
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-02-27 16:29 EDT-------
Verified, kernel 4.8.0-40-generic fixes this problem.

Tim Gardner (timg-tpi) on 2017-03-01
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Launchpad Janitor (janitor) wrote :
Download full text (6.0 KiB)

This bug was fixed in the package linux - 4.8.0-40.43

---------------
linux (4.8.0-40.43) yakkety; urgency=low

  * linux: 4.8.0-40.43 -proposed tracker (LP: #1667066)

  [ Andy Whitcroft ]
  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * shaking screen (LP: #1651981)
    - drm/radeon: drop verde dpm quirks

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * In Ubuntu 17.04 : after reboot getting message in console like Unable to
    open file: /etc/keys/x509_ima.der (-2) (LP: #1656908)
    - SAUCE: ima: Downgrade error to warning

  * 16.04.2: Extra patches for POWER9 (LP: #1664564)
    - powerpc/mm: Fix no execute fault handling on pre-POWER5
    - powerpc/mm: Fix spurrious segfaults on radix with autonuma

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the restriction on max segment size
    - scsi: storvsc: Enable multi-queue support
    - scsi: storvsc: use tagged SRB requests if supported by the device
    - scsi: storvsc: properly handle SRB_ERROR when sense message is present
    - scsi: storvsc: properly set residual data length on errors

  * Ubuntu16.10-KVM:Big configuration with multiple guests running SRIOV VFs
    caused KVM host hung and all KVM guests down. (LP: #1651248)
    - KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT
    - KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
    - KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends
    - KVM: PPC: Book 3S: XICS: Implement ICS P/Q states
    - KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

  * ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe (LP: #1662666)
    - blk-mq: Avoid memory reclaim when remapping queues
    - blk-mq: Fix failed allocation path when mapping queues
    - blk-mq: Always schedule hctx->next_cpu

  * systemd-udevd hung in blk_mq_freeze_queue_wait testing unpartitioned NVMe
    drive (LP: #1662673)
    - percpu-refcount: fix reference leak during percpu-atomic transition

  * [Yakkety SRU] Enable KEXEC support in ARM64 kernel (LP: #1662554)
    - [Config] Enable KEXEC support in ARM64.

  * [Hyper-V] Fix ring buffer handling to avoid host throttling (LP: #1661430)
    - Drivers: hv: vmbus: On write cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: On the read path cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read()

  * brd module compiled as built-in (LP: #1593293)
    - CONFIG_BLK_DEV_RAM=m

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in compla...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 4.4.0-65.86

---------------
linux (4.4.0-65.86) xenial; urgency=low

  * linux: 4.4.0-65.86 -proposed tracker (LP: #1667052)

  [ Stefan Bader ]
  * Upgrade Redpine RS9113 driver to support AP mode (LP: #1665211)
    - SAUCE: Redpine driver to support Host AP mode

  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * [Hyper-V] SAUCE: pci-hyperv fixes for SR-IOV on Azure (LP: #1665097)
    - SAUCE: PCI: hv: Fix wslot_to_devfn() to fix warnings on device removal
    - SAUCE: pci-hyperv: properly handle pci bus remove
    - SAUCE: pci-hyperv: lock pci bus on device eject

  * [Hyper-V/Azure] Please include Mellanox OFED drivers in Azure kernel and
    image (LP: #1650058)
    - net/mlx4_en: Fix bad WQE issue
    - net/mlx4_core: Fix racy CQ (Completion Queue) free
    - net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT
      transitions
    - net/mlx4_core: Avoid command timeouts during VF driver device shutdown

  * Xenial update to v4.4.49 stable release (LP: #1664960)
    - ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
    - selinux: fix off-by-one in setprocattr
    - Revert "x86/ioapic: Restore IO-APIC irq_chip retrigger callback"
    - cpumask: use nr_cpumask_bits for parsing functions
    - hns: avoid stack overflow with CONFIG_KASAN
    - ARM: 8643/3: arm/ptrace: Preserve previous registers for short regset write
    - target: Don't BUG_ON during NodeACL dynamic -> explicit conversion
    - target: Use correct SCSI status during EXTENDED_COPY exception
    - target: Fix early transport_generic_handle_tmr abort scenario
    - target: Fix COMPARE_AND_WRITE ref leak for non GOOD status
    - ARM: 8642/1: LPAE: catch pending imprecise abort on unmask
    - mac80211: Fix adding of mesh vendor IEs
    - netvsc: Set maximum GSO size in the right place
    - scsi: zfcp: fix use-after-free by not tracing WKA port open/close on failed
      send
    - scsi: aacraid: Fix INTx/MSI-x issue with older controllers
    - scsi: mpt3sas: disable ASPM for MPI2 controllers
    - xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
    - ALSA: seq: Fix race at creating a queue
    - ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
    - drm/i915: fix use-after-free in page_flip_completed()
    - Linux 4.4.49

  * NFS client : kernel 4.4.0-57 crash with nfsv4 enries in /etc/fstab
    (LP: #1650336)
    - SUNRPC: fix refcounting problems with auth_gss messages.

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers