ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe

Bug #1662666 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Tim Gardner
Xenial
Fix Released
Undecided
Tim Gardner
Yakkety
Fix Released
Undecided
Tim Gardner
Zesty
Fix Released
Undecided
Tim Gardner

Bug Description

-- Problem Description --
The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f

The hang problem can be reproduced with the following shell script using an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel, although it took over 3 hours to manifest.

#!/bin/bash

if [[ ${#} -eq 0 ]]; then
        ${0} breaker &
        while true; do
                dd if=/dev/nvme0n1 bs=1024k of=/dev/null
        done
elif [[ ${1} == "breaker" ]]; then
        while true; do
                ppc64_cpu --smt=off
                sleep 5
                ppc64_cpu --smt=on
                sleep 5
        done
fi

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-146759 severity-critical targetmilestone-inin16041
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Michael Hohnbaum (hohnbaum) wrote : Re: [Bug 1662666] [NEW] ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe

Steve,

Can Foundations take a look at this request, please.

                  Michael

On 02/07/2017 12:39 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> -- Problem Description --
> The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f
>
> The hang problem can be reproduced with the following shell script using
> an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within
> 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel,
> although it took over 3 hours to manifest.
>
> #!/bin/bash
>
> if [[ ${#} -eq 0 ]]; then
> ${0} breaker &
> while true; do
> dd if=/dev/nvme0n1 bs=1024k of=/dev/null
> done
> elif [[ ${1} == "breaker" ]]; then
> while true; do
> ppc64_cpu --smt=off
> sleep 5
> ppc64_cpu --smt=on
> sleep 5
> done
> fi
>
> ** Affects: ubuntu
> Importance: Undecided
> Assignee: Taco Screen team (taco-screen-team)
> Status: New
>
>
> ** Tags: architecture-ppc64le bugnameltc-146759 severity-critical targetmilestone-inin16041

--
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-02-07 15:53 EDT-------
Correction: the hang reproduced by the previous shell script is actually being fixed separately. These commits fix various other problems with NVMe drives and are required as a prerequisite .

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1662666] [NEW] ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe

On Tue, Feb 07, 2017 at 12:42:39PM -0800, Michael Hohnbaum wrote:
> Can Foundations take a look at this request, please.

The bug is assigned to the linux package, so the kernel team should probably
be looking at it.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message
Michael Hohnbaum (hohnbaum) wrote :

Leann,

While the problem is in ppc64-cpu command, it appears the fix is in a
set of kernel patches. Can you have the kernel team take a look at
these. Thanks.

                     Michael

On 02/07/2017 01:05 PM, Steve Langasek wrote:
> On Tue, Feb 07, 2017 at 12:42:39PM -0800, Michael Hohnbaum wrote:
>> Can Foundations take a look at this request, please.
> The bug is assigned to the linux package, so the kernel team should probably
> be looking at it.
>

--
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Zesty):
status: New → Fix Released
assignee: Taco Screen team (taco-screen-team) → Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (5.1 KiB)

-- Problem Description --
The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f

The hang problem can be reproduced with the following shell script using an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel, although it took over 3 hours to manifest.

#!/bin/bash

if [[ ${#} -eq 0 ]]; then
${0} breaker &
while true; do
dd if=/dev/nvme0n1 bs=1024k of=/dev/null
done
elif [[ ${1} == "breaker" ]]; then
while true; do
ppc64_cpu --smt=off
sleep 5
ppc64_cpu --smt=on
sleep 5
done
fi

Steve,

Can Foundations take a look at this request, please.

Michael

On 02/07/2017 12:39 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> -- Problem Description --
> The following upstream patches are needed for Ubuntu to fix a hang situation reported when executing ppc64_cpu --smt=on that occurs with various disk types. We need whichever ones have not yet been pulled into the base.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1b1cea1e58477dad88ff769f54c0d2dfa56d923
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36e1f3d107867b25c616c2fd294f5a1c9d4e5d09
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f
>
> The hang problem can be reproduced with the following shell script using
> an NVMe device executed on kernel versions 4.4.0-45 and 4.4.0-59 within
> 30 minutes. It was also reproduced on a 4.8.0-32-generic kernel,
> although it took over 3 hours to manifest.
>
> #!/bin/bash
>
> if [[ ${#} -eq 0 ]]; then
> ${0} breaker &
> while true; do
> dd if=/dev/nv...

Read more...

Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-yakkety
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-02-27 16:29 EDT-------
Verified, kernel 4.8.0-40-generic fixes this problem.

Tim Gardner (timg-tpi)
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.0 KiB)

This bug was fixed in the package linux - 4.8.0-40.43

---------------
linux (4.8.0-40.43) yakkety; urgency=low

  * linux: 4.8.0-40.43 -proposed tracker (LP: #1667066)

  [ Andy Whitcroft ]
  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * shaking screen (LP: #1651981)
    - drm/radeon: drop verde dpm quirks

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * In Ubuntu 17.04 : after reboot getting message in console like Unable to
    open file: /etc/keys/x509_ima.der (-2) (LP: #1656908)
    - SAUCE: ima: Downgrade error to warning

  * 16.04.2: Extra patches for POWER9 (LP: #1664564)
    - powerpc/mm: Fix no execute fault handling on pre-POWER5
    - powerpc/mm: Fix spurrious segfaults on radix with autonuma

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the restriction on max segment size
    - scsi: storvsc: Enable multi-queue support
    - scsi: storvsc: use tagged SRB requests if supported by the device
    - scsi: storvsc: properly handle SRB_ERROR when sense message is present
    - scsi: storvsc: properly set residual data length on errors

  * Ubuntu16.10-KVM:Big configuration with multiple guests running SRIOV VFs
    caused KVM host hung and all KVM guests down. (LP: #1651248)
    - KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT
    - KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
    - KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends
    - KVM: PPC: Book 3S: XICS: Implement ICS P/Q states
    - KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

  * ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe (LP: #1662666)
    - blk-mq: Avoid memory reclaim when remapping queues
    - blk-mq: Fix failed allocation path when mapping queues
    - blk-mq: Always schedule hctx->next_cpu

  * systemd-udevd hung in blk_mq_freeze_queue_wait testing unpartitioned NVMe
    drive (LP: #1662673)
    - percpu-refcount: fix reference leak during percpu-atomic transition

  * [Yakkety SRU] Enable KEXEC support in ARM64 kernel (LP: #1662554)
    - [Config] Enable KEXEC support in ARM64.

  * [Hyper-V] Fix ring buffer handling to avoid host throttling (LP: #1661430)
    - Drivers: hv: vmbus: On write cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: On the read path cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read()

  * brd module compiled as built-in (LP: #1593293)
    - CONFIG_BLK_DEV_RAM=m

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in compla...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 4.4.0-65.86

---------------
linux (4.4.0-65.86) xenial; urgency=low

  * linux: 4.4.0-65.86 -proposed tracker (LP: #1667052)

  [ Stefan Bader ]
  * Upgrade Redpine RS9113 driver to support AP mode (LP: #1665211)
    - SAUCE: Redpine driver to support Host AP mode

  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * [Hyper-V] SAUCE: pci-hyperv fixes for SR-IOV on Azure (LP: #1665097)
    - SAUCE: PCI: hv: Fix wslot_to_devfn() to fix warnings on device removal
    - SAUCE: pci-hyperv: properly handle pci bus remove
    - SAUCE: pci-hyperv: lock pci bus on device eject

  * [Hyper-V/Azure] Please include Mellanox OFED drivers in Azure kernel and
    image (LP: #1650058)
    - net/mlx4_en: Fix bad WQE issue
    - net/mlx4_core: Fix racy CQ (Completion Queue) free
    - net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT
      transitions
    - net/mlx4_core: Avoid command timeouts during VF driver device shutdown

  * Xenial update to v4.4.49 stable release (LP: #1664960)
    - ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
    - selinux: fix off-by-one in setprocattr
    - Revert "x86/ioapic: Restore IO-APIC irq_chip retrigger callback"
    - cpumask: use nr_cpumask_bits for parsing functions
    - hns: avoid stack overflow with CONFIG_KASAN
    - ARM: 8643/3: arm/ptrace: Preserve previous registers for short regset write
    - target: Don't BUG_ON during NodeACL dynamic -> explicit conversion
    - target: Use correct SCSI status during EXTENDED_COPY exception
    - target: Fix early transport_generic_handle_tmr abort scenario
    - target: Fix COMPARE_AND_WRITE ref leak for non GOOD status
    - ARM: 8642/1: LPAE: catch pending imprecise abort on unmask
    - mac80211: Fix adding of mesh vendor IEs
    - netvsc: Set maximum GSO size in the right place
    - scsi: zfcp: fix use-after-free by not tracing WKA port open/close on failed
      send
    - scsi: aacraid: Fix INTx/MSI-x issue with older controllers
    - scsi: mpt3sas: disable ASPM for MPI2 controllers
    - xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
    - ALSA: seq: Fix race at creating a queue
    - ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
    - drm/i915: fix use-after-free in page_flip_completed()
    - Linux 4.4.49

  * NFS client : kernel 4.4.0-57 crash with nfsv4 enries in /etc/fstab
    (LP: #1650336)
    - SUNRPC: fix refcounting problems with auth_gss messages.

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.