Livelock between ZFS evict and writeback threads

Bug #1856084 reported by Heitor Alves de Siqueira on 2019-12-11
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
zfs-linux (Debian)
Fix Released
Unknown
zfs-linux (Ubuntu)
Medium
Heitor Alves de Siqueira
Bionic
Medium
Heitor Alves de Siqueira
Disco
Medium
Heitor Alves de Siqueira
Eoan
Medium
Heitor Alves de Siqueira
Focal
Medium
Heitor Alves de Siqueira

Bug Description

Livelock between ZFS evict and writeback threads

[Impact]
ZIO pipeline stalls, causing ZFS workloads to hang indefinitely

[Description]
For certain ZFS workloads, we start seeing hung task timeouts in the kernel logs due to zil_commit() stalling. This is due to zfs_zget() not detecting whether a znode has been marked for deletion before attempting to access it, causing a constant "retry loop" in zfs_get_data() if that znode has been unlinked already. An example of the stack traces follows:

[72742.051703] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72742.070429] mysqld D 0 5713 2881 0x00000320
[72742.073220] Call Trace:
[72742.075305] __schedule+0x24e/0x880
[72742.090436] schedule+0x2c/0x80
[72742.090438] schedule_preempt_disabled+0xe/0x10
[72742.090441] __mutex_lock.isra.5+0x276/0x4e0
[72742.090547] ? dmu_tx_destroy+0x105/0x130 [zfs]
[72742.090555] __mutex_lock_slowpath+0x13/0x20
[72742.115374] ? __mutex_lock_slowpath+0x13/0x20
[72742.132266] mutex_lock+0x2f/0x40
[72742.134207] zil_commit_impl+0x1b0/0x1b30 [zfs]
[72742.150428] ? spl_kmem_alloc+0x115/0x180 [spl]
[72742.152622] ? mutex_lock+0x12/0x40
[72742.154819] ? zfs_refcount_add_many+0x9a/0x100 [zfs]
[72742.171450] zil_commit+0xde/0x150 [zfs]
[72742.173687] zfs_fsync+0x77/0xe0 [zfs]
[72742.175044] zpl_fsync+0x80/0x110 [zfs]
[72742.191690] vfs_fsync_range+0x51/0xb0
[72742.193876] do_fsync+0x3d/0x70
[72742.195126] SyS_fsync+0x10/0x20
[72742.211059] do_syscall_64+0x73/0x130
[72742.214078] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

It's possible to hit this issue due to a race between the ZFS evict and writeback threads. If the z_iput task is trying to evict a znode that's currently sitting in the writeback thread, both will "livelock" each other and stall the ZIO pipeline, causing other ZFS operations (such as zil_commit) to hang indefinitely.

This has been documented and fixed upstream in PR#9583 [0]. We need to pull two fixes from upstream: the first one fixes the zfs_zget() issue in the writeback thread, while the second fixes a regression on O_TMPFILE descriptors caused by the first one.

Upstream patches:
 - Break out of zfs_zget early if unlinked znode (41e1aa2a06f8)
 - Check for unlinked znodes after igrab() (0c46813805f4)

[Test Case]
Being a race condition, this issue has been hard to reproduce consistently. The racing window between evict() and the ZFS writeback thread is quite strict, but users have reported this to show up after some hours of running LXD-containerized mySQL workloads.

[Regression Potential]
These patches have been tested both in the ZFS test suite and in production environments, so the potential for further regressions should be low.
Additional regressions would likely cause issues with the ZFS writeback/commit and IO pipeline, so they should be spotted fairly quickly.

[0] https://github.com/zfsonlinux/zfs/pull/9583
[1] https://github.com/zfsonlinux/zfs/commit/41e1aa2a06f8
[2] https://github.com/zfsonlinux/zfs/commit/0c46813805f4

Changed in zfs-linux (Ubuntu):
importance: Undecided → Medium
Changed in zfs-linux (Ubuntu Bionic):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Disco):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Bionic):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in zfs-linux (Ubuntu Eoan):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Disco):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in zfs-linux (Ubuntu Eoan):
assignee: nobody → Heitor Alves de Siqueira (halves)
tags: added: sts-sponsor

The attachment "lp1856084-bionic.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
tags: removed: sts-sponsor
Colin Ian King (colin-king) wrote :

I've checked that the zfs kernel driver builds and it passes the ZFS regression tests. Patches look good, so I've uploaded these packages.

Changed in zfs-linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in zfs-linux (Ubuntu Disco):
importance: Undecided → Medium
Changed in zfs-linux (Ubuntu Eoan):
importance: Undecided → Medium
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.8.2-3ubuntu3

---------------
zfs-linux (0.8.2-3ubuntu3) focal; urgency=medium

  * Fix livelock between ZFS evict and writeback threads (LP: #1856084)
    - Upstream ZFS fix 41e1aa2a06f8 ("Break out of zfs_zget early if unlinked
      znode")
    - Upstream ZFS fix 0c46813805f4 ("Check for unlinked znodes after
      igrab()")

 -- Heitor Alves de Siqueira <email address hidden> Fri, 13 Dec 2019 11:27:39 -0300

Changed in zfs-linux (Ubuntu Focal):
status: Confirmed → Fix Released

Hello Heitor, or anyone else affected,

Accepted zfs-linux into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.8.1-1ubuntu14.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zfs-linux (Ubuntu Eoan):
status: Confirmed → Fix Committed
Łukasz Zemczak (sil2100) wrote :

Hello Heitor, or anyone else affected,

Accepted zfs-linux into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.7.12-1ubuntu5.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zfs-linux (Ubuntu Disco):
status: Confirmed → Fix Committed
Changed in zfs-linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Łukasz Zemczak (sil2100) wrote :

Hello Heitor, or anyone else affected,

Accepted zfs-linux into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.7.5-1ubuntu16.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Colin Ian King (colin-king) wrote :

I've tested zfs from the -proposed pockets with the ubuntu ZFS autotest regression tests:

ubuntu_zfs_fstest
ubuntu_zfs_smoke_test
ubuntu_zfs_stress
ubuntu_zfs_xfs_generic

All the following passed the regression testing.

bionic: 0.7.5-1ubuntu16.7
disco: 0.7.12-1ubuntu5.1
eoan: 0.8.1-1ubuntu14.3

I was unable to trip and lockups, so as far as I'm concerned I'm happy for these updates to be released.

Colin Ian King (colin-king) wrote :

*I was unable to trip any lockups

tags: added: verification-done-bionic verification-done-disco verification-done-eoan
Changed in zfs-linux (Debian):
status: Unknown → Fix Released

The verification of the Stable Release Update for zfs-linux has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.7.5-1ubuntu16.7

---------------
zfs-linux (0.7.5-1ubuntu16.7) bionic; urgency=medium

  * Fix livelock between ZFS evict and writeback threads (LP: #1856084)
    - Upstream ZFS fix 41e1aa2a06f8 ("Break out of zfs_zget early if unlinked
      znode")
    - Upstream ZFS fix 0c46813805f4 ("Check for unlinked znodes after
      igrab()")

 -- Heitor Alves de Siqueira <email address hidden> Thu, 12 Dec 2019 12:51:35 -0300

Changed in zfs-linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.7.12-1ubuntu5.1

---------------
zfs-linux (0.7.12-1ubuntu5.1) disco; urgency=medium

  * Fix livelock between ZFS evict and writeback threads (LP: #1856084)
    - Upstream ZFS fix 41e1aa2a06f8 ("Break out of zfs_zget early if unlinked
      znode")
    - Upstream ZFS fix 0c46813805f4 ("Check for unlinked znodes after
      igrab()")

 -- Heitor Alves de Siqueira <email address hidden> Thu, 12 Dec 2019 13:19:29 -0300

Changed in zfs-linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.8.1-1ubuntu14.3

---------------
zfs-linux (0.8.1-1ubuntu14.3) eoan; urgency=medium

  * Fix livelock between ZFS evict and writeback threads (LP: #1856084)
    - Upstream ZFS fix 41e1aa2a06f8 ("Break out of zfs_zget early if unlinked
      znode")
    - Upstream ZFS fix 0c46813805f4 ("Check for unlinked znodes after
      igrab()")

 -- Heitor Alves de Siqueira <email address hidden> Thu, 12 Dec 2019 13:21:35 -0300

Changed in zfs-linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.