NVMe stress test fails after 12 hours on Ubuntu 16.04

Bug #1604995 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Tim Gardner
Xenial
High
Tim Gardner
Yakkety
High
Tim Gardner

Bug Description

==== State: Open by: mdate on 21 June 2016 10:22:28 ====

Stress testing for perfromance is being done in preparation for the the Tencent PoC Sort Competition using the attached per scripts. The system either hangs, or the NVMe goes offline after 12 hours or so.
The nvme drives are set up in a RAID 0 config. The script is run like this:

./perf_nvme_fio2.sh nvme_jobfile_raid0_huawei

mdate (<email address hidden>) added native attachment /opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/perf_nvme_fio2.sh on 2016-06-21 10:22:28
mdate (<email address hidden>) added native attachment /opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/nvme_jobfile_raid0_huawei on 2016-06-21 10:22:28

== Comment: #48 - Gabriel Krisman Bertazi - 2016-07-18 12:33:13 ==
Hi,

We need to apply the following patch to the Ubuntu kernel to prevent wrongly identification of ATARI partitions, as mentioned in the commit log.

It still didn't make to Linus tree, but Jens Axboe already approved it on linux-block.

https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-4.8/core

Revision history for this message
bugproxy (bugproxy) wrote : full kernel log - after a failure: v4.7-rc5 (I/O error + EEH)

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-142903 severity-critical targetmilestone-inin16041
Revision history for this message
bugproxy (bugproxy) wrote : Log of overnight runs - 4k LBA (issue reproduced)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Log of overnight runs - 512b LBA

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 4K block file used to create 3 Atari partitions on md0 device

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 0001-block-atari-Return-early-for-unsupported-sector-size.patch

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → In Progress
Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-29 13:00 EDT-------
(In reply to comment #51)
> https://lists.ubuntu.com/archives/kernel-team/2016-July/079275.html

Patch is in Ubuntu's master-next: ("3d5038ae701eb677c210f2c606e6e89e5d91f0a4")

Looks good.

Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-22 10:51 EDT-------
(In reply to comment #53)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

I have verified that with -proposed kernel nvme disk will no longer be identified as AHDI partition, preventing the bad request from being sent. Marking as verified and changing to accepted.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 4.4.0-36.55

---------------
linux (4.4.0-36.55) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1612305

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - SAUCE: pinctrl/amd: Remove the default de-bounce time

  * CVE-2016-5696
    - tcp: make challenge acks less predictable

linux (4.4.0-35.54) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1611215

  * [i915_bpo] Sync with v4.7 (LP: #1609742)
    - SAUCE: i915_bpo: Sync with v4.7

  * s390/cio: fix reset of channel measurement block (LP: #1609415)
    - s390/cio: allow to reset channel measurement block

  * in Ubuntu16.10: Hit on Call traces and system goes down when transactional
    memory tests are running in 32TB Brazos system (LP: #1606786)
    - powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    - powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()

  * Power Menu does not display after press the Power Button (LP: #1609204)
    - intel-vbtn: new driver for Intel Virtual Button
    - [config] enable CONFIG_INTEL_VBTN=m

  * OptiPlex 7450 AIO hangs when rebooting (LP: #1608762)
    - x86/reboot: Add Dell Optiplex 7450 AIO reboot quirk

  * virtualbox+usb 3.0 breaks boot, -28 kernel works (LP: #1604058)
    - SAUCE: xhci: Fix soft lockup in xhci_pci_probe path when XHCI_STATE_HALTED

  * linux-kernel: Freeing IRQ from IRQ context (LP: #1597908)
    - block: defer timeouts to a workqueue

  * Tunnel offload indications not stripped from encapsulated packets, causing
    performance overhead (LP: #1602755)
    - tunnels: Remove encapsulation offloads on decap.

  * lm-sensors is throwing "ERROR: Can't get value of subfeature temp1_input:
    I/O error" for be2net driver (LP: #1607387)
    - be2net: perform temperature query in adapter regardless of its interface
      state

  * Dell dock MAC Address pass through doesn't work in Ubuntu (LP: #1579984)
    - r8152: Add support for setting pass through MAC address on RTL8153-AD

  * vmxnet3 LRO IPv6 performance issues (stalling TCP) (LP: #1605494)
    - Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

  * ISST-LTE:pVM:monklp5:Ubuntu16.04.1:system crashed at
    lpfc_sli4_scmd_to_wqidx_distr (LP: #1597974)
    - SAUCE: lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from
      lpfc_send_taskmgmt()

  * Backport cxlflash shutdown patch to Xenial SRU (LP: #1605405)
    - SAUCE: cxlflash: Verify problem state area is mapped before notifying
      shutdown

  * Xenial update to v4.4.16 stable release (LP: #1607404)
    - mac80211: fix fast_tx header alignment
    - mac80211: mesh: flush mesh paths unconditionally
    - mac80211_hwsim: Add missing check for HWSIM_ATTR_SIGNAL
    - mac80211: Fix mesh estab_plinks counting in STA removal case
    - EDAC, sb_edac: Fix rank lookup on Broadwell
    - IB/cm: Fix a recently introduced locking bug
    - IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
    - powerpc/pseries: Fix IBM_ARCH_VEC_NRCORES_OFFSET since POWER8NVL was added
    - powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
    - usb: dwc2: fix reg...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-29 10:30 EDT-------
(In reply to comment #55)
> This bug was fixed in the package linux - 4.4.0-36.55

------- Comment From <email address hidden> 2016-08-29 10:30 EDT-------
(In reply to comment #56)
> (In reply to comment #55)
> > This bug was fixed in the package linux - 4.4.0-36.55

thanks. Closing it now.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-09 16:21 EDT-------

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers