zfs 0.7.9 fixes a bug (https://github.com/zfsonlinux/zfs/pull/7343) that hangs the system completely

Bug #1772412 reported by Sam Van den Eynde on 2018-05-21
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned
zfs-linux (Ubuntu)
High
Colin Ian King
Bionic
Undecided
Unassigned

Bug Description

SRU Bionic

== SRU Justification ==

Intensive I/O such as performed by ZFS send/receive can cause hangs because of monopolization of the global system_taskq. The outcome is that z_zvol hung
tasks occur and I/O gets blocked.

== Fix ==

Upstream ZFS commit 77d8a0f1a4d0b2f59cee63088f7987cb38e66538 ("Fix hung z_zvol tasks during 'zfs receive'") fixes this issue, it adds a dedicated per-pool prefetch taskq that prevents the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. This fixes the z_zvol hung tasks. A trivial backport is required for Bionic ZFS.

== Testcase ==

Perform large send/receives. Occasionally they lock up. With the fix, this issue is addressed and no more lockups occur. Also must pass the full ZFS ubuntu autotest tests to prove no regressions occur.

== Regression Potential ==

This fix adds more per-pool prefetch taskq's so we have more kernel resources being used. There is therefore a very small risk that this may impact ZFS running on memory and CPU constrained systems. However, the fix is small, has been upstream for a while and is in Cosmic+ releases and has not caused any regressions, so I think this is a relatively safe fix.

------

I have experienced the problems fixed by this commit https://github.com/zfsonlinux/zfs/pull/7343 a few times on my NAS. The system hangs completely when it occurs. It looks like 0.7.9 brings other interesting bug fixes that potentially freeze the system.

Changed in zfs-linux (Ubuntu):
status: New → Fix Released
Saurabh Nanda (saurabhnanda) wrote :

I noticed that the status of this bug has been changed to `fix released`. Where/how has the fix been released and how does one patch a live Ubuntu 18.04.2 system to get this bugfix?

Richard Laager (rlaager) wrote :

ZFS 0.7.9 was released in Cosmic (18.10). You could update to Cosmic. Alternatively, on 18.04, you can install the HWE kernel package: linux-image-generic-hwe-18.04

Saurabh Nanda (saurabhnanda) wrote :

What does `hwe` stand for? Is upgrading the kernel like this safe in production? What're the chance that the system is going to not reboot cleanly?

Saurabh Nanda (saurabhnanda) wrote :

Is this answer correct with respect to `hwe` - https://askubuntu.com/a/248936/911062 ?

Sam Van den Eynde (samvde) wrote :

Check out https://wiki.ubuntu.com/Kernel/LTSEnablementStack. This is the formal way Ubuntu provides newer kernels (and as such newer zfs versions) during the lifecycle of an LTS release.

Saurabh Nanda (saurabhnanda) wrote :

Thanks - I upgraded the server which was receiving the ZFS snapshot and things _seem_ to be working well.

Richard Laager (rlaager) wrote :

Your upgrade is done, but for the record, installing the HWE kernel doesn't remove the old kernel. So you still have the option to go back to that in the GRUB menu.

Also, once you're sure the HWE kernel is working, you'll probably want to remove the linux-image-generic package so you're not continuously upgrading two sets of kernels.

Changed in zfs-linux (Ubuntu):
status: Fix Released → In Progress
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
description: updated
Colin Ian King (colin-king) wrote :

Fix sent to kernel team mailing list for review:

https://lists.ubuntu.com/archives/kernel-team/2019-May/100990.html

description: updated
Changed in linux (Ubuntu):
status: New → Invalid
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed

Hello Sam, or anyone else affected,

Accepted zfs-linux into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.7.5-1ubuntu16.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zfs-linux (Ubuntu Bionic):
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (11.2 KiB)

This bug was fixed in the package linux - 4.15.0-55.60

---------------
linux (4.15.0-55.60) bionic; urgency=medium

  * linux: 4.15.0-55.60 -proposed tracker (LP: #1834954)

  * Request backport of ceph commits into bionic (LP: #1834235)
    - ceph: use atomic_t for ceph_inode_info::i_shared_gen
    - ceph: define argument structure for handle_cap_grant
    - ceph: flush pending works before shutdown super
    - ceph: send cap releases more aggressively
    - ceph: single workqueue for inode related works
    - ceph: avoid dereferencing invalid pointer during cached readdir
    - ceph: quota: add initial infrastructure to support cephfs quotas
    - ceph: quota: support for ceph.quota.max_files
    - ceph: quota: don't allow cross-quota renames
    - ceph: fix root quota realm check
    - ceph: quota: support for ceph.quota.max_bytes
    - ceph: quota: update MDS when max_bytes is approaching
    - ceph: quota: add counter for snaprealms with quota
    - ceph: avoid iput_final() while holding mutex or in dispatch thread

  * QCA9377 isn't being recognized sometimes (LP: #1757218)
    - SAUCE: USB: Disable USB2 LPM at shutdown

  * hns: fix ICMP6 neighbor solicitation messages discard problem (LP: #1833140)
    - net: hns: fix ICMP6 neighbor solicitation messages discard problem
    - net: hns: fix unsigned comparison to less than zero

  * Fix occasional boot time crash in hns driver (LP: #1833138)
    - net: hns: Fix probabilistic memory overwrite when HNS driver initialized

  * use-after-free in hns_nic_net_xmit_hw (LP: #1833136)
    - net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()

  * hns: attempt to restart autoneg when disabled should report error
    (LP: #1833147)
    - net: hns: Restart autoneg need return failed when autoneg off

  * systemd 237-3ubuntu10.14 ADT test failure on Bionic ppc64el (test-seccomp)
    (LP: #1821625)
    - powerpc: sys_pkey_alloc() and sys_pkey_free() system calls
    - powerpc: sys_pkey_mprotect() system call

  * [UBUNTU] pkey: Indicate old mkvp only if old and curr. mkvp are different
    (LP: #1832625)
    - pkey: Indicate old mkvp only if old and current mkvp are different

  * [UBUNTU] kernel: Fix gcm-aes-s390 wrong scatter-gather list processing
    (LP: #1832623)
    - s390/crypto: fix gcm-aes-s390 selftest failures

  * System crashes on hot adding a core with drmgr command (4.15.0-48-generic)
    (LP: #1833716)
    - powerpc/numa: improve control of topology updates
    - powerpc/numa: document topology_updates_enabled, disable by default

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * [UBUNTU] kernel: Fix wrong dispatching for control domain CPRBs
    (LP: #1832624)
    - s390/zcrypt: Fix wrong dispatching for control domain CPRBs

  * CVE-2019-11815
    - net: rds: force to destroy connection if t_sock is NULL in
      rds_tcp_kill_sock().

  * Sound device not detected after resume from hibernate (LP: #1826868)
    - drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled
    - drm/i915: Save the old CDCLK atomic state
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Colin Ian King (colin-king) wrote :

verified zfsutils-linux 0.7.5-1ubuntu16.6 with 4.15.0-55.60, tested against the Ubuntu zfs regression tests, no failures. Looks good to me.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.7.5-1ubuntu16.6

---------------
zfs-linux (0.7.5-1ubuntu16.6) bionic; urgency=medium

  * Fix hung z_zvol tasks during 'zfs receive' (LP: #1772412)
    - Adds a dedicated, per-pool, prefetch taskq to prevent the traverse
      code from monopolizing the global (and limited) system_taskq by
      inappropriately scheduling long running tasks on it. This fixes
      z_zvol hung tasks.

zfs-linux (0.7.5-1ubuntu16.5) bionic; urgency=medium

  * Fix build error with tracepoints enabled (LP: #1828763)
    - In b49151d684f44 (bionic kernel master-next branch) tx_waited has been
      renamed to tx_dirty_delayed, but only in the tracepoint definition (in
      trace_dmu.h) and not in the rest of the code, causing build errors if
      zfs tracepoints are enabled; fix by reverting tx_dirty_delayed back to
      the original name tx_waited.

 -- Colin Ian King <email address hidden> Wed, 29 May 2019 17:24:22 +0100

Changed in zfs-linux (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for zfs-linux has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in zfs-linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers