PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed

Bug #1906476 reported by Trent Lloyd
208
This bug affects 31 people
Affects Status Importance Assigned to Milestone
Native ZFS for Linux
Fix Released
Unknown
linux (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Fix Released
Critical
Stefan Bader
linux-raspi (Ubuntu)
Fix Released
Undecided
Unassigned
Impish
Fix Released
Undecided
Unassigned
ubuntu-release-upgrader (Ubuntu)
Confirmed
Undecided
Unassigned
Impish
Won't Fix
Undecided
Unassigned
zfs-linux (Ubuntu)
Fix Released
Critical
Unassigned
Impish
Fix Released
Critical
Unassigned

Bug Description

Since today while running Ubuntu 21.04 Hirsute I started getting a ZFS panic in the kernel log which was also hanging Disk I/O for all Chrome/Electron Apps.

I have narrowed down a few important notes:
- It does not happen with module version 0.8.4-1ubuntu11 built and included with 5.8.0-29-generic

- It was happening when using zfs-dkms 0.8.4-1ubuntu16 built with DKMS on the same kernel and also on 5.8.18-acso (a custom kernel).

- For whatever reason multiple Chrome/Electron apps were affected, specifically Discord, Chrome and Mattermost. In all cases they seem (but I was unable to strace the processes so it was a bit hard ot confirm 100% but by deduction from /proc/PID/fd and the hanging ls) they seem hung trying to open files in their 'Cache' directory, e.g. ~/.cache/google-chrome/Default/Cache and ~/.config/Mattermost/Cache .. while the issue was going on I could not list that directory either "ls" would just hang.

- Once I removed zfs-dkms only to revert to the kernel built-in version it immediately worked without changing anything, removing files, etc.

- It happened over multiple reboots and kernels every time, all my Chrome apps weren't working but for whatever reason nothing else seemed affected.

- It would log a series of spl_panic dumps into kern.log that look like this:
Dec 2 12:36:42 optane kernel: [ 72.857033] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
Dec 2 12:36:42 optane kernel: [ 72.857036] PANIC at zfs_znode.c:335:zfs_znode_sa_init()

I could only find one other google reference to this issue, with 2 other users reporting the same error but on 20.04 here:
https://github.com/openzfs/zfs/issues/10971

- I was not experiencing the issue on 0.8.4-1ubuntu14 and fairly sure it was working on 0.8.4-1ubuntu15 but broken after upgrade to 0.8.4-1ubuntu16. I will reinstall those zfs-dkms versions to verify that.

There were a few originating call stacks but the first one I hit was

Call Trace:
 dump_stack+0x74/0x95
 spl_dumpstack+0x29/0x2b [spl]
 spl_panic+0xd4/0xfc [spl]
 ? sa_cache_constructor+0x27/0x50 [zfs]
 ? _cond_resched+0x19/0x40
 ? mutex_lock+0x12/0x40
 ? dmu_buf_set_user_ie+0x54/0x80 [zfs]
 zfs_znode_sa_init+0xe0/0xf0 [zfs]
 zfs_znode_alloc+0x101/0x700 [zfs]
 ? arc_buf_fill+0x270/0xd30 [zfs]
 ? __cv_init+0x42/0x60 [spl]
 ? dnode_cons+0x28f/0x2a0 [zfs]
 ? _cond_resched+0x19/0x40
 ? _cond_resched+0x19/0x40
 ? mutex_lock+0x12/0x40
 ? aggsum_add+0x153/0x170 [zfs]
 ? spl_kmem_alloc_impl+0xd8/0x110 [spl]
 ? arc_space_consume+0x54/0xe0 [zfs]
 ? dbuf_read+0x4a0/0xb50 [zfs]
 ? _cond_resched+0x19/0x40
 ? mutex_lock+0x12/0x40
 ? dnode_rele_and_unlock+0x5a/0xc0 [zfs]
 ? _cond_resched+0x19/0x40
 ? mutex_lock+0x12/0x40
 ? dmu_object_info_from_dnode+0x84/0xb0 [zfs]
 zfs_zget+0x1c3/0x270 [zfs]
 ? dmu_buf_rele+0x3a/0x40 [zfs]
 zfs_dirent_lock+0x349/0x680 [zfs]
 zfs_dirlook+0x90/0x2a0 [zfs]
 ? zfs_zaccess+0x10c/0x480 [zfs]
 zfs_lookup+0x202/0x3b0 [zfs]
 zpl_lookup+0xca/0x1e0 [zfs]
 path_openat+0x6a2/0xfe0
 do_filp_open+0x9b/0x110
 ? __check_object_size+0xdb/0x1b0
 ? __alloc_fd+0x46/0x170
 do_sys_openat2+0x217/0x2d0
 ? do_sys_openat2+0x217/0x2d0
 do_sys_open+0x59/0x80
 __x64_sys_openat+0x20/0x30

CVE References

Revision history for this message
Trent Lloyd (lathiat) wrote :

Should mention that Chrome itself always showed "waiting for cache" part of backing up the story around the cache files.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Trent Lloyd (lathiat) wrote :

I hit this problem again today, but now without zfs-dkms. After upgrading my kernel from initrd.img-5.8.0-29-generic to 5.8.0-36-generic my Google Chrome Cache directory is broken again, had to rename it and then reboot to get out of the problem.

Changed in zfs-linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Trent Lloyd (lathiat) wrote :

Another user report here:
https://github.com/openzfs/zfs/issues/10971

Curiously I found a 2016(??) report of similar here:
https://bbs.archlinux.org/viewtopic.php?id=217204

Revision history for this message
Trent Lloyd (lathiat) wrote :

This issue seems to have appeared somewhere between zfs-linux 0.8.4-1ubuntu11 (last known working version) and 0.8.4-1ubuntu16.

When the issue first hit, I had zfs-dkms installed, which was on 0.8.4-1ubuntu16 where as the kernel build had 0.8.4-1ubuntu11. I removed zfs-dkms to go back to the kernel built version and it was working OK. linux-image-5.8.0-36-generic is now released on Hirsute with 0.8.4-1ubuntu16 and so now the out of the box kernel is also broken and I am regularly having problems with this.

linux-image-5.8.0-29-generic: working
linux-image-5.8.0-36-generic: broken

`
lathiat@optane ~/src/zfs[zfs-2.0-release]$ sudo modinfo /lib/modules/5.8.0-29-generic/kernel/zfs/zfs.ko|grep version
version: 0.8.4-1ubuntu11

lathiat@optane ~/src/zfs[zfs-2.0-release]$ sudo modinfo /lib/modules/5.8.0-36-generic/kernel/zfs/zfs.ko|grep version
version: 0.8.4-1ubuntu16
`

I don't have a good quick/easy reproducer but just using my desktop for a day or two seems I am likely to hit the issue after a while.

I tried to install the upstream zfs-dkms package for 2.0 to see if I can bisect the issue on upstream versions but it breaks my boot for some weird systemd reason I cannot quite figure out as yet.

Looking at the Ubuntu changelog I'd say the fix for https://bugs.launchpad.net/bugs/1899826 that landed in 0.8.4-1ubuntu13 to backport the 5.9 and 5.10 compataibility patches is a prime suspect but could also be any other version. I'm going to try and 'bisect' 0.8.4-1ubuntu11 through 0.8.4-1ubuntu16 to figure out which version actually hit it.

Since the default kernel is now hitting this, there have been 2 more user reports of the same things in the upstream bug in the past few days since that kernel landed and I am regularly getting inaccessible files not just from chrome but even a linux git tree among other things I am going to raise the priority on this bug to Critical as you lose access to files so has data loss potential. I have not yet determined if you can somehow get the data back, so far it's only affected files I can replace such as cache/git files. It seems like snapshots might be OK (which would make sense).

Changed in zfs-linux (Ubuntu):
importance: High → Critical
tags: added: seg
Revision history for this message
Colin Ian King (colin-king) wrote :

Can you test the zfs 2.0.1 in https://launchpad.net/~colin-king/+archive/ubuntu/zfs-hirsute using:

sudo add-apt-repository ppa:colin-king/zfs-hirsute
sudo apt-get update

Hopefully this will address the issue.

Changed in zfs-linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
status: Confirmed → In Progress
Revision history for this message
Trent Lloyd (lathiat) wrote :

Using 2.0.1 from hirsute-proposed it seems like I'm still hitting this. Move and replace .config/google-chrome and seems after using it for a day, shutdown, boot up, same issue again.

Going to see if I can somehow try to reproduce this on a different disk or in a VM with xfstests or something.

Changed in zfs:
status: Unknown → New
Revision history for this message
Trent Lloyd (lathiat) wrote :

I can confirm 100% this bug is still happening with 2.0.1 from hirsute-proposed, even with a brand new install, on a different disk (SATA SSD instead of NVMe Intel Optane 900p SSD), using 2.0.1 inside the installer and from first boot. I can reproduce it reliably within about 2 hours just using the desktop with google chrome (after restoring my google chrome sync, so common set of data and extensions), it always seems to trigger first on an access from Google Chrome for some reason - that part is very reliable - but other files can get corrupt or lose access including git trees and the like.

So I am at a loss to explain the cause given no one outside of Ubuntu seems to be hitting this. It also, for whatever reason, seems to always cause my tampermonkey and lastpass extension files to show as corrupt - but not other extensions - very reliably happens every time.

The only notable change from default is I am using encryption=on with passphrase for /home/user. I have not tested with encryption off.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

2.0.2 is now in hirsute-proposed.

Revision history for this message
Colin Ian King (colin-king) wrote :

I've uploaded 2.0.2 with some minor extra fixes, please let us know if this addresses the issue.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

I'm trying out the ZFS 2.0.2 userspace packages now, from hursute (not -proposed), and will report back. I note that the kernel module for linux-modules-5.10.0-14-lowlatency that I'm using is still 2.0.1.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

I've install zfs-dkms to ensure that I'm using the 2.0.2 kmod. I'll post back later in the week about whether I see any panics with regular use.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

I have not yet had any problems with 2.0.2.

Revision history for this message
Colin Ian King (colin-king) wrote :

Excellent. News. I'll mark this as fixed released. If this problem occurs again, please feel free to re-open the bug report.

Changed in zfs-linux (Ubuntu):
status: In Progress → Fix Released
milestone: none → ubuntu-21.04
Revision history for this message
Andrew Paglusch (andrew-paglusch) wrote :
Download full text (6.9 KiB)

I just upgraded to the latest ZoL release and am still having the same problem. I also upgrade my pool (after creating a checkpoint).

$ modinfo zfs | head -12
filename: /lib/modules/5.4.0-66-generic/updates/dkms/zfs.ko
version: 2.0.3-0york0~20.04
license: CDDL
author: OpenZFS
description: ZFS
alias: devname:zfs
alias: char-major-10-249
srcversion: DCE77834FDF1B30075B1328
depends: spl,znvpair,icp,zlua,zzstd,zunicode,zcommon,zavl
retpoline: Y
name: zfs
vermagic: 5.4.0-66-generic SMP mod_unload

$ dpkg -l | grep zfs
ii libzfs2linux 2.0.3-0york0~20.04 amd64 OpenZFS filesystem library for Linux
ii zfs-dkms 2.0.3-0york0~20.04 all OpenZFS filesystem kernel modules for Linux
ii zfs-zed 0.8.3-1ubuntu12.6 amd64 OpenZFS Event Daemon
ii zfsutils-linux 2.0.3-0york0~20.04 amd64 command-line tools to manage OpenZFS filesystems

$ zfs --version
zfs-2.0.3-0york0~20.04
zfs-kmod-0.8.3-1ubuntu12.6

Panic:
```
Mar 7 20:29:28 home-nas kernel: [ 181.778239] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
Mar 7 20:29:28 home-nas kernel: [ 181.778605] PANIC at zfs_znode.c:339:zfs_znode_sa_init()
Mar 7 20:29:28 home-nas kernel: [ 181.778778] Showing stack for process 2854
Mar 7 20:29:28 home-nas kernel: [ 181.778793] CPU: 0 PID: 2854 Comm: ls Tainted: P OE 5.4.0-66-generic #74-Ubuntu
Mar 7 20:29:28 home-nas kernel: [ 181.778796] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
Mar 7 20:29:28 home-nas kernel: [ 181.778798] Call Trace:
Mar 7 20:29:28 home-nas kernel: [ 181.778948] dump_stack+0x6d/0x9a
Mar 7 20:29:28 home-nas kernel: [ 181.779068] spl_dumpstack+0x29/0x2b [spl]
Mar 7 20:29:28 home-nas kernel: [ 181.779081] spl_panic+0xd4/0xfc [spl]
Mar 7 20:29:28 home-nas kernel: [ 181.779491] ? __zfs_dbgmsg+0xe0/0x110 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.779624] ? sa_cache_constructor+0x27/0x50 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.779643] ? _cond_resched+0x19/0x30
Mar 7 20:29:28 home-nas kernel: [ 181.779657] ? mutex_lock+0x13/0x40
Mar 7 20:29:28 home-nas kernel: [ 181.779760] ? dmu_buf_replace_user+0x60/0x80 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.779863] ? dmu_buf_set_user_ie+0x1a/0x20 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780000] zfs_znode_sa_init.isra.0+0xdf/0xf0 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780138] zfs_znode_alloc+0x102/0x6d0 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780249] ? aggsum_add+0x196/0x1b0 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780343] ? dmu_buf_unlock_parent+0x38/0x80 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780429] ? dbuf_read_impl.constprop.0+0x614/0x700 [zfs]
Mar 7 20:29:28 home-nas kernel: [ 181.780442] ? spl_kmem_cache_alloc+0xc1/0x7d0 [spl]
Mar 7 20:29:28 home-nas kernel: [ 181.780447] ? _cond_resched+0x19/0x30
Mar 7 20:29:28 home-nas kerne...

Read more...

Revision history for this message
Colin Ian King (colin-king) wrote :

@Andrew, zfs-dkms 2.0.3-0york0~20.04 is not a recognized supported ZFS dkms package.

Revision history for this message
Andrew Paglusch (andrew-paglusch) wrote :

@Colin Would the package in your PPA be a better source of information for trying to reproduce this bug?

Revision history for this message
Trent Lloyd (lathiat) wrote :

It's worth noting that, as best I can understand, the patches won't fix an already broken filesystem. You have to remove all of the affected files, and it's difficult to know exactly what files are affected. I try to guess based on which show a ??? mark in "ls -la". But sometimes the "ls" hangs, etc.

I've been running zfs-dkms 2.0.2-1ubuntu2 for 24 hours now and so far so good.. won't call it conclusive but hoping this has solved it. Though I am thoroughly confused as to what patch solved it, nothing *seems* relevant. Which is frustrating.

Will try to update in a few days as to whether it definitely hasn't hit, most of the time I hit it in a day but wasn't strictly 100%.

Revision history for this message
Trent Lloyd (lathiat) wrote :

I got another couple of days out of it without issue - so I think it's likely fixed.

It seems like this issue looks very similar to the following upstream bug, same behaviour but a different error, and so I wonder if it was ultimately the same bug. Looks like this patch from 2.0.3 was pulled into the package?
https://github.com/openzfs/zfs/issues/11621
https://github.com/openzfs/zfs/issues/11474
https://github.com/openzfs/zfs/pull/11576

Further testing has been hampered as zsys deleted all of my home datasets entirely (including all snapshots) - tracked in https://github.com/ubuntu/zsys/issues/196 - using a non-zfs boot until I finish recovering that - but still seems likely fixed as I was hitting it most days before.

Revision history for this message
jamminfe (jamminfe) wrote :

The issue appeared :(

2021 May 16 21:19:09 laptop VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
2021 May 16 21:19:09 laptop PANIC at zfs_znode.c:339:zfs_znode_sa_init()

Linux laptop 5.8.0-45-generic

zfs-2.0.4
zfs-kmod-2.0.4

Revision history for this message
jamminfe (jamminfe) wrote :

Is there anything I can do to provide more debug info needed for the fix?

Revision history for this message
Trent Lloyd (lathiat) wrote :

Are you confident that the issue is a new issue? Unfortunately as best I can tell, the corruption can occur and then will still appear on a fixed system if it's reading corruption created in the past that unfortunately scrub doesn't seem to detect.

I've still had no re-occurance here after a few weeks on hirsute with 2.0.2-1ubuntu5 (which includes the https://github.com/openzfs/zfs/issues/11474 fix) - but from a fresh install.

Revision history for this message
mhosken (martin-hosken) wrote :

Is there a way to clear the corruption without having to do a fresh install? This bug is crippling me on hirsuite, kernel 5.13.0-12. zfs-2.0.2.

Revision history for this message
mhosken (martin-hosken) wrote :

Actually zfs-2.0.2-1ubuntu5, zfs-kmod-2.0.3-8ubuntu5.

I'm not sure who to create corruption, but I have plenty of it causing processes to kernel panic. https://github.com/openzfs/zfs/issues/11474 contains some instructions on how to create a corrupted file using wine.

Even after this fixed zfs hits, those of us with scrambled file systems are going to need help in clearing out the mess :( So please don't close this bug until there is a mitigation process as well as a fix to zfs to stop it happening further.

Changed in zfs-linux (Ubuntu):
status: Fix Released → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

Just for clarification, fixing the corruption caused by panic as noted in https://github.com/openzfs/zfs/issues/11474 is as follows:

"For anyone who hit this issue you should be able to fix the panic by temporarily enabling the zfs_recover module option (set /sys/module/zfs/parameters/zfs_recover=1). This will convert the panic in to a warning for any effected files/directories/symlinks/etc. Since the mode information is what was long the code will assume the inode is a regular file and you should be able to remove it."

Revision history for this message
mhosken (martin-hosken) wrote :

Actually zfs-2.0.2-1ubuntu5, zfs-kmod-2.0.3-8ubuntu5.

I'm not sure who to create corruption, but I have plenty of it causing processes to kernel panic. https://github.com/openzfs/zfs/issues/11474 contains some instructions on how to create a corrupted file using wine.

Even after this fixed zfs hits, those of us with scrambled file systems are going to need help in clearing out the mess :( So please don't close this bug until there is a mitigation process as well as a fix to zfs to stop it happening further.

Does using an encrypted pool have any impact on this?

Revision history for this message
Trent Lloyd (lathiat) wrote :

Try the zfs_recover step from Colin's comment above. And then look for invalid files and try to move them out of the way.

I'm not aware of encrypted pools being specifically implicated (no such mention in the bug and it doesn't seem like it), having said that, I am using encryption on the dataset where I was hitting this.

Revision history for this message
William Wilson (jawn-smith) wrote :

I am also hitting this error. I think I've been having it for a little while, but have just today started digging into it.

I tried the zfs_recover suggestion to get rid of the corrupted files, but that did not work:

jawn-smith@desktop:~$ zfs version
zfs-2.0.3-8ubuntu7
zfs-kmod-2.0.2-1ubuntu5
jawn-smith@desktop:~$ cat /sys/module/zfs/parameters/zfs_recover
1
jawn-smith@desktop:~$ cat /proc/cmdline
BOOT_IMAGE=/BOOT/ubuntu_ixviql@/vmlinuz-5.11.0-22-generic root=ZFS=rpool/ROOT/ubuntu_ixviql ro quiet splash acpi=force zfs.zfs_recover=1 vt.handoff=1

Trying to then run `rm -rf <corrupted_files>` resulted in this: https://paste.ubuntu.com/p/fh336kcStH/

The machine is usable for now, so I'll leave it in this state for a while in order to help troubleshoot.

Revision history for this message
Stéphane Graber (stgraber) wrote :

In my case I was constantly getting corruption of /etc/apparmor.d with the matching zfs PANIC. I'd fix that directory and it'd break again on next boot.

System is impish with 5.13 kernel (same on 5.11) using zfs encryption.

After fighting with this for over a day, I just gave the 2.1.0 dkms a go (won't recommend as that's quite hackish on impish and doesn't play super well with the initrd scripts) and so far so good, scanned the entire filesystem and all broken files are now readable.

Let's see if it ends up coming back for me...

Revision history for this message
Trent Lloyd (lathiat) wrote :

This has re-appeared for me today after upgrading to 5.13.0-14 on Impish. Same call stack, and same chrome-based applications (Mattermost was hit first) affected.

Not currently running DKMS, so:

Today:
5.13.0-14-lowlat Tue Aug 24 10:59 still running (zfs module is 2.0.3-8ubuntu6)

Yesterday:
5.11.0-25-lowlat Mon Aug 23 12:52 - 08:05 (19:13) (zfs module is 2.0.2-1ubuntu5)

I am a bit confused because the patched line "newmode = zp->z_mode;" still seems present in the package.

Revision history for this message
Trent Lloyd (lathiat) wrote :

I traced the call failure. I found the failing code is in sa.c:1291#sa_build_index()

  if (BSWAP_32(sa_hdr_phys->sa_magic) != SA_MAGIC) {

This code prints debug info to /proc/spl/kstat/zfs/dbgmsg, which for me is:
1629791353 sa.c:1293:sa_build_index(): Buffer Header: cb872954 != SA_MAGIC:2f505a object=0x45175e

So in this case seems the data is somehow corrupted, since this is supposed to be a magic value that is always correct and doesn't change. Not entirely clear how this actually played into the original bug. So it may be that this is really a different bug. Hrm.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Trent, can you open a new bug as it does seem to be a different bug and I'd like to separate out the issues to reduce debugging/fixing/tracking complexities on the original bug.

Revision history for this message
Trent Lloyd (lathiat) wrote :

@Colin To be clear this is the same bug I originally hit and opened the launchpad for, it just doesn't quite match with what most people saw in the upstream bugs. But it seemed to get fixed anyway for a while, and has regressed again somehow.

Same exception as from the original description and others reporting:
2021 May 16 21:19:09 laptop VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed

The upstream bug mostly reported slightly different errors though similar symptoms (files get stuck and can't be accessed).

I also tried to use 'zdb' to check if incorrect file modes were saved, unfortunately it seems zdb does not work for encrypted datasets, it only dumps the unencrypted block info and doesn't dump info about filemodes etc from the encrypted part. So I can't check that.

I've reverted back to 5.11.0-25 for now and it's stable again.

Revision history for this message
Trent Lloyd (lathiat) wrote (last edit ):

3 more user reports on the upstream bug of people hitting it on Ubuntu 5.13.0:
https://github.com/openzfs/zfs/issues/10971

I think this needs some priority. It doesn't seem like it's hitting upstream, for some reason only really hitting on Ubuntu.

Revision history for this message
Trent Lloyd (lathiat) wrote :

While trying to setup a reproducer that would excercise chrome or wine or something I stumbled across the following reproducer that worked twice in a row in a libvirt VM on my machine today.

The general gist is to
(1) Create a zfs filesystem with "-o encryption=aes-256-gcm -o compression=zstd -o atime=off -o keyformat=passphrase"
(2) rsync a copy of the openzfs git tree into it
(3) Reboot
(4) Use silversearcher-ag to search the directory for "DISKS="

Precise steps:
mkdir src
cd src
git clone https://github.com/openzfs/zfs
sudo apt install zfsutils-linux zfs-initramfs
sudo zpool create tank /dev/vdb
sudo zfs create tank/lathiat2 -o encryption=aes-256-gcm -o compression=zstd -o atime=off -o keyformat=passphrase
rsync -va --progress -HAX /etc/skel /tank/lathiat2/; chown -R lathiat:lathiat /tank/lathiat2; rsync -va --progress /home/lathiat/src/ /tank/lathiat2/src/; chown -R lathiat:lathiat /tank/lathiat2/src/
# reboot
sudo zfs load-key tank/lathiat2
sudo zfs mount -a
cd /tank/lathiat2/src/zfs/
ag DISKS=

Hit on the exact same crash:
[ 61.377929] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
[ 61.377930] PANIC at zfs_znode.c:339:zfs_znode_sa_init()

Now will test this out on the beta 2.0.6 package and also see if the standard zfs test suite will trigger it or not as a matter of somewhat curiosity.

Revision history for this message
Trent Lloyd (lathiat) wrote :

Have created a 100% reliable reproducer test case and also determined the Ubuntu-specific patch 4701-enable-ARC-FILL-LOCKED-flag.patch to fix Bug #1900889 is likely the cause.

[Test Case]

The important parts are:
- Use encryption
- rsync the zfs git tree
- Use parallel I/O from silversearcher-ag to access it after a reboot. A simple "find ." or "find . -exec cat {} > /dev/null \;" does not reproduce the issue.

Reproduction done using a libvirt VM installed from the Ubuntu Impish daily livecd using a normal ext4 root but with a second 4GB /dev/vdb disk for zfs later

= Preparation
apt install silversearcher-ag git zfs-dkms zfsutils-linux
echo -n testkey2 > /root/testkey
git clone https://github.com/openzfs/zfs /root/zfs

= Test Execution
zpool create test /dev/vdb
zfs create test/test -o encryption=on -o keyformat=passphrase -o keylocation=file:///root/testkey
rsync -va --progress -HAX /root/zfs/ /test/test/zfs/

# If you access the data now it works fine.
reboot

zfs load-key test/test
zfs mount -a
cd /test/test/zfs/
ag DISKS=

= Test Result
ag hangs, "sudo dmesg" shows an exception

[Analysis]
I rebuilt the zfs-linux 2.0.6-1ubuntu1 package from ppa:colin-king/zfs-impish without the Ubuntu-specific patch ubuntu/4701-enable-ARC-FILL-LOCKED-flag.patch which fixed Bug #1900889. With this patch disabled the issue does not reproduce. Re-enabling the patch it reproduces reliably every time again.

Seems this bug was never sent upstream. No code changes upstream setting the flag ARC_FILL_IN_PLACE appear to have been added since that I can see however interestingly the code for this ARC_FILL_IN_PLACE handling was added to fix a similar sounding issue "Raw receive fix and encrypted objset security fix"
 in https://github.com/openzfs/zfs/commit/69830602de2d836013a91bd42cc8d36bbebb3aae . This first shipped in zfs 0.8.0 and the original bug was filed against 0.8.3.

I also have found the same issue as the original Launchpad bug reported upstream without any fixes and a lot of discussion (and quite a few duplicates linking back to 11679):
https://github.com/openzfs/zfs/issues/11679
https://github.com/openzfs/zfs/issues/12014

Without fully understanding the ZFS code in relation to this flag, the code at https://github.com/openzfs/zfs/blob/ce2bdcedf549b2d83ae9df23a3fa0188b33327b7/module/zfs/arc.c#L2026 talks about how this flag is to do with decrypting blocks in the ARC and doing so 'inplace'. It makes some sense thus that I need encryption to reproduce it and it works best after a reboot (thus flushing the ARC) and why I can still read the data in the test case before doing a reboot when it then fails.

This patch was added in 0.8.4-1ubuntu15 and I first experienced the issue somewhere between 0.8.4-1ubuntu11 and 0.8.4-1ubuntu16.

So it all adds up and I suggest that this patch should be reverted.

Revision history for this message
Sebastian Heiden (seb-heiden) wrote :

I also experienced this bug on Ubuntu 21.10 Beta during boot on a fresh ZFS Install with Encryption.

The only way to boot into GNOME without crashing was booting through recovery mode.

Revision history for this message
Colin Ian King (colin-king) wrote :

Fix is currently waiting to be accepted into the archive. Meanwhile the dkms package with the kernel drivers is available in ppa:colin-king/zfs-impish

Changed in zfs-linux (Ubuntu Impish):
status: In Progress → Fix Committed
Revision history for this message
Robin H. Johnson (robbat2) wrote :

Will new media be available w/ this fix in place?

I ask because running "dkms status" or any dkms command presently triggers this bug for me, and the only way forward I can see to being able to fix that system is new media w/ a patched kernel.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

If the issue doesn't affect hirsute then it should be possible to update an impish system by chroot from hirsute media, I would think.

I've been booting up only occasionally this past week when I need to access something, but I find that apt, or possibly initramfs, updates seem to trigger the issue for me and render my system unbootable. I've been recovering with zfs rollback from hirsute media. Unfortunately, something also seems to go wrong with the zsys recovery entries in grub, where I end up with a dataset called rpool/ROOT/ubuntu_ (no trailing unique ID). I'll file that as a separate bug if it persists after this is resolved, though.

Anyway, I plan to try to chroot and update later and I'll report back.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

Actually, I realize I may have been overconfident about that. Should it work to fix from a chroot or is the fix in the userland tools such that a chroot from hirsute would not help due to an issue in the userland ZFS tools under the chroot?

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 2.0.6-1ubuntu2

---------------
zfs-linux (2.0.6-1ubuntu2) impish; urgency=medium

  * Revert workaround on fill locked flag patch (LP: #1906476)
    - disable 4701-enable-ARC-FILL-LOCKED-flag.patch as this
      is causing zfs_node.c panics

 -- Colin Ian King <email address hidden> Sun, 26 Sep 2021 12:59:43 +0100

Changed in zfs-linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
Simon May (socob) wrote :

@#41: Dataset “rpool/ROOT/ubuntu_” sounds like bug 1894329, although that one should have been fixed.

Revision history for this message
Randall Leeds (randall-leeds) wrote :

Thanks, Simon. I'll verify a revert from the menu after I update and post there if things don't go well.

Revision history for this message
Trent Lloyd (lathiat) wrote :

So to be clear this patch revert fixes the issue being caused new, but, if the issue already happened on your filesystem it will continue to occur because the exception is reporting corruption on disk. I don't currently have a good fix for this other than to move the affected files to a directory you don't use (but it's sometimes tricky to figure out which files are the cause).

For dkms status you could try check ls -la /proc/$(pidof dkms)/fd to see what file it opened, or strace it, to try figure out what file it's up to when it hangs. then move that file or directory out of the way and then replace them.

Revision history for this message
Trent Lloyd (lathiat) wrote :

In a related way say you wanted to recover a system from a boot disk, and copy all the data off to another disk. If you use a sequential file copy like from tar/cp in verbose mode and watch it, eventaully it will hang on the file triggering the issue (watch dmesg/kern.log). Once that happens, move that file into a directory like /broken which you exclude from tar/cp, reboot to get back into a working state, then start the copy again. Basically what I did incrementally to find all the broken files. Fortunately for me they were mostly inside chrome or electron app dirs.

Revision history for this message
Doki (lkishalmi) wrote :

Is the fix released on 20.04 as well?
My whole rpool is encrypted. This bug rendered my system useless.

Revision history for this message
Colin Ian King (colin-king) wrote (last edit ):

Just to clarify, Trent identified the following patch as problematic:

4701-enable-ARC-FILL-LOCKED-flag.patch

This does not appear in the following ZFS releases:

0.6.5.6-0ubuntu30 (xenial)
0.7.5-1ubuntu16.12 (bionic)
0.8.3-1ubuntu12.12 (focal)
2.0.2-1ubuntu5.2 (hirsute)
2.0.6-1ubuntu2 (impish)

Please double check that you have the latest version, e.g. after booting use:

sudo dmesg | grep ZFS

or cat /sys/module/zfs/version

If you are seeing the *same* crash, please add ZFS version in your comment. If it is a different crash, please file a new bug.

Revision history for this message
Brent Spillner (spillner) wrote :

>So to be clear this patch revert fixes the issue being caused new, but, if the issue already >happened on your filesystem it will continue to occur because the exception is reporting >corruption on disk. I don't currently

I don't think that's quite correct--- like the OP I can boot an older kernel, with a pre-regression ZFS driver (ZFS 2.02), with the same filesystem(s) (and the same/newest userspace library and utility versions) on the same hardware and it works quite happily, without any error or warning messages. I'm not at all convinced that the error message truly indicates irreparable filesystem damage--- there may not even be anything wrong at all with the on-disk data structures, only in the driver's in-memory reconstruction or interpretation of them.

I hit this bug after allowing a 20.04 LTS installation to upgrade from a 5.11.0-7620-generic kernel to the 5.13.0-7614-generic in stable. The panics occurred on every boot attempt under 5.13.0, at the same fairly early point in the boot process (first page of kernel messages) every time. Rebooting with 5.11.0 doesn't generate any errors and has been running stably for over two weeks (as it did for months before the failed attempt to upgrade to 5.13.0). The 5.11.0 build reports ZFS module version v2.0.2-1ubuntu5, while the 5.13.0 has 2.0.3-8ubuntu6. My zfs package family is all 0.8.3-1ubuntu12.12.

It's annoying to get a regression like this in an LTS kernel, but at least reverting is easy and seems effective.

Revision history for this message
Doki (lkishalmi) wrote :

I can also confirm the versions.
Kernel: 5.13.0-7614-generic
ZFS package: 0.8.3-1ubuntu12.12
ZFS Module Version: 2.0.3-8ubuntu6

Can't confirm yet but the system seems to be more stable when booting with 5.11 kernel.

Revision history for this message
Sebastian Heiden (seb-heiden) wrote :

The latest Ubuntu 21.10 install still comes with zfs-kmod-2.0.3, which prevents the system from booting in the first place.

Installing zfs-dkms via the recovery mode, which updates the zfs-kmod to version 2.0.6, fixed the issue for me.

I hope that the release image of Ubuntu 21.10 will ship with zfs-kmod-2.0.6 in the kernel by default.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

I ran into this over the past few days, and the zfs-dkms install via recovery also worked for me, though be warned that you will need to take careful steps with secureboot to enrol the MOK that this generates for you.

Revision history for this message
Evan Mesterhazy (emesterhazy) wrote :

Looking for advice if anyone has it as to how I can recover my filesystem after being affected by this bug. I installed zfs 2.0.6 via dkms and have moved affected files like my chrome cache to a quarantine location as suggested by Trent. However, I seem to continue stumbling on affected files, and even with /sys/module/zfs/parameters/zfs_recover=1 set I see panics in dmesg and hung processes that cant be killed.

From what I've gathered:

Running 2.0.6 via dkms means that I'm not running the affected code anymore. However, it seems like there is permanent FS corruption (which is very unfortunate since this is the sort of thing ZFS is supposed to be good at preventing).

Is there a best course of action to prevent future kernel panics? Would reverting to a snapshot from before I noticed this issue work, or is there shared metadata that's been corrupted? Should I just copy the files I can access onto another disk and re-create my ZFS pool / dataset?

Any advice is appreciated :/ Thanks

Revision history for this message
David Griffin (habilain) wrote :

Having just gotten bitten by this: At present, fresh installs of Ubuntu 21.10 have the 2.0.3 ZFS kernel module, and as such get impacted by this bug. This means that out of the box installs of Ubuntu 21.10 with ZFS essentially don't work at present because of this - someone might want to look into this!

Stefan Bader (smb)
no longer affects: zfs-linux (Ubuntu Impish)
Changed in zfs-linux (Ubuntu Impish):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → Critical
status: New → Fix Released
Changed in linux (Ubuntu Impish):
importance: Undecided → Critical
status: New → In Progress
Changed in linux (Ubuntu):
status: New → Invalid
Changed in linux (Ubuntu Impish):
assignee: nobody → Stefan Bader (smb)
Stefan Bader (smb)
Changed in linux (Ubuntu Impish):
status: In Progress → Fix Committed
Revision history for this message
David D Lowe (flimm) wrote :

I've also gotten bitten by this. Upgrading to Ubuntu 21.10 caused weird filesystem issues to be present on my ZFS filesystem, with errors on the command-line such as:

Cannot access 'foobar': No such file or directory

When running ls -l in some directory, I get question marks, like this:

-????????? ? ? ? ? ? enum_binary_params.hpp
-????????? ? ? ? ? ? enum.hpp

I tried to select the "revert" option in the Grub boot menu to go back to Ubuntu 21.04, and that worked.

I then tried to install a fresh installation of Ubuntu 21.10 from a USB stick, selecting the ZFS option with encryption, and I am getting these errors again.

I only noticed now this note in the release notes:

> The version of the ZFS driver included in the 5.13.0-19 kernel contains a bug 16 that can result in filesystem corruption. Users of ZFS are advised to wait until the first Stable Release Update of the kernel in 21.10 before upgrading.

I know that this comment is not that useful to Ubuntu developers, but I'm posting it for the benefit of those searching for this issue in their favourite search engine.

Revision history for this message
David D Lowe (flimm) wrote :

I installed Ubuntu 21.10 system with ZFS and encryption, and I installed all APT updates. I quickly started experiencing filesystem corruption within an hour or two, and now my system won't boot. I see that this bug has been marked "Fix Released", but I am still experiencing it.

Even before the bug fix is released, I urge the developers to take urgent preventative measures to stop users from experiencing filesystem corruption, especially users who are upgrading, since they have more data to lose. Specifically, I ask you to take these three actions:

1. Make the upgrade tool refuse to upgrade to Ubuntu 21.10 if ZFS is being used, until a permanent fix is released.

2. If possible, don't allow users to install a fresh installation of Ubuntu 21.10 with ZFS enabled. I don't know if it's possible to release a new ISO, but please consider that it might be worth releasing a 21.10.01 ISO.

3. The release notes for Ubuntu 21.10 do contain warnings about filesystem corruption when using ZFS, but you have to read 2800 words before reaching the paragraph that warns about this. Please consider modify the release notes to make this warning more prominent. Also, modify any other web pages that users might check before installing Ubuntu 21.10 or upgrading to it.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu-release-upgrader (Ubuntu Impish):
status: New → Confirmed
Changed in ubuntu-release-upgrader (Ubuntu):
status: New → Confirmed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.13.0-20.20 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-impish' to 'verification-done-impish'. If the problem still exists, change the tag 'verification-needed-impish' to 'verification-failed-impish'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-impish
Revision history for this message
William Wilson (jawn-smith) wrote (last edit ):

I have confirmed that the kernel 5.13.0-20 in impish-proposed solves this problem for me.

To confirm this, I performed the following steps:

On a ZFS installation:
1) boot from kernel 5.13.0-19.19
    a) observe the boot failed with PANIC at zfs_node.c: 339:zfs_znode_sa_init
2) install kernel 5.13.0-20.20 from impish-proposed
    a) observe that the boot succeeds
    b) observe that there are no PANIC messages in journalctl for the most recent boot.

tags: added: verification-done-impish
removed: verification-needed-impish
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.13.0-20.20

---------------
linux (5.13.0-20.20) impish; urgency=medium

  * impish/linux: 5.13.0-20.20 -proposed tracker (LP: #1947380)

  * PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 ==
    sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl))
    failed (LP: #1906476)
    - debian/dkms-versions -- Update zfs to latest version

 -- Stefan Bader <email address hidden> Fri, 15 Oct 2021 15:53:08 +0200

Changed in linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
andrum99 (andrum99) wrote (last edit ):

I'm on Hirsute and zfs -V reports:

zfs-2.0.2-1ubuntu5.3
zfs-kmod-2.0.2-1ubuntu5.1

I'm guessing it's the zfs-kmod version that's important here - is that right? I'm not 100% sure from reading https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476/comments/49 which version I need on Hirsute to ensure I don't have the buggy code. Luckily I've seem to have escaped any corruption, perhaps because none of my filesystems are encrypted. I've only ever used the ZFS kernel module binaries from the default hirsute repos for my Pi.

Also, were the kernel module binaries for 21.04 from the 'default' repos affected by the bug? I can't tell from the comments above if they were or not.

Revision history for this message
Harshal Prakash Patankar (pharshalp) wrote (last edit ):

I understand that "zpool scrub" isn't going to show any errors for this type of corruption. So, to check if any of the files in a given directory were corrupted, would it be sufficient to run "sudo find ." and check if the command returns without any error or without getting stuck at a file?

--- UPDATE based on the comment below ---
Would it be sufficient to run "sudo find . -exec stat {} +" and check if the command hangs? If it doesn't, would it imply that no files were corrupted?

Revision history for this message
Randall Leeds (randall-leeds) wrote :

I don't believe `fimd` is sufficient, but running `stat` on a file was enough to trigger it for me.

Revision history for this message
Jacquin Antoine (antitbone) wrote :
Download full text (4.6 KiB)

ubuntu 21.10

Linux i5 5.13.0-20-generic #20-Ubuntu SMP Fri Oct 15 14:21:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

zfs-2.0.6-1ubuntu2
zfs-kmod-2.0.6-1ubuntu2

$ rm -rfv .steam
...
'.steam/steam/config/htmlcache/Cache/5f67979416a221e5_0' supprimé
'.steam/steam/config/htmlcache/Cache/2b3b061ac9d6b292_0' supprimé
'.steam/steam/config/htmlcache/Cache/88ec9cdf0f2e7cf6_0' supprimé
'.steam/steam/config/htmlcache/Cache/446e83c112a55833_0' supprimé
'.steam/steam/config/htmlcache/Cache/25bce87ba6a10af5_0' supprimé
'.steam/steam/config/htmlcache/Cache/a409ef32a0f5a1b3_0' supprimé
'.steam/steam/config/htmlcache/Cache/2e8722be934b8d51_0' supprimé
'.steam/steam/config/htmlcache/Cache/cb27d7e85cfb9396_0' supprimé
'.steam/steam/config/htmlcache/Cache/f7bb287f03ab70bb_0' supprimé
'.steam/steam/config/htmlcache/Cache/18b17be83cac58df_0' supprimé
'.steam/steam/config/htmlcache/Cache/9f7e378b5b8fe6cf_0' supprimé
'.steam/steam/config/htmlcache/Cache/3742398e7e6ac7aa_0' supprimé
stall

[ 549.052760] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
[ 549.052763] PANIC at zfs_znode.c:339:zfs_znode_sa_init()
[ 549.052765] Showing stack for process 18526
[ 549.052766] CPU: 0 PID: 18526 Comm: rm Tainted: P OE 5.13.0-20-generic #20-Ubuntu
[ 549.052768] Hardware name: System manufacturer System Product Name/PRIME Z270-A, BIOS 0505 11/08/2016
[ 549.052769] Call Trace:
[ 549.052772] show_stack+0x52/0x58
[ 549.052776] dump_stack+0x7d/0x9c
[ 549.052781] spl_dumpstack+0x29/0x2b [spl]
[ 549.052791] spl_panic+0xd4/0xfc [spl]
[ 549.052799] ? queued_spin_unlock+0x9/0x10 [zfs]
[ 549.052889] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 549.052944] ? __raw_spin_unlock+0x9/0x10 [zfs]
[ 549.052998] ? dmu_buf_replace_user+0x65/0x80 [zfs]
[ 549.053053] ? dmu_buf_set_user+0x13/0x20 [zfs]
[ 549.053107] ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
[ 549.053160] zfs_znode_sa_init+0xd9/0xe0 [zfs]
[ 549.053242] zfs_znode_alloc+0x101/0x580 [zfs]
[ 549.053325] ? dmu_buf_unlock_parent+0x5d/0x90 [zfs]
[ 549.053380] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 549.053436] ? dbuf_read_impl.constprop.0+0x30a/0x3e0 [zfs]
[ 549.053489] ? dbuf_rele_and_unlock+0x13b/0x520 [zfs]
[ 549.053541] ? __cond_resched+0x1a/0x50
[ 549.053544] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
[ 549.053547] ? queued_spin_unlock+0x9/0x10 [zfs]
[ 549.053597] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 549.053647] ? __cond_resched+0x1a/0x50
[ 549.053648] ? down_read+0x13/0x90
[ 549.053650] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
[ 549.053652] ? queued_spin_unlock+0x9/0x10 [zfs]
[ 549.053711] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 549.053770] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
[ 549.053773] ? dmu_object_info_from_dnode+0x8e/0xa0 [zfs]
[ 549.053829] zfs_zget+0x235/0x280 [zfs]
[ 549.053909] zfs_dirent_lock+0x420/0x560 [zfs]
[ 549.053990] zfs_dirlook+0x91/0x2d0 [zfs]
[ 549.054070] zfs_lookup+0x257/0x400 [zfs]
[ 549.054149] zpl_lookup+0xcb/0x220 [zfs]
[ 549.054227] ? __d_alloc+0x138/0x1f0
[ 549.054229] __lookup_hash+0x70/0xa0
[ 549.054231] ? __cond_resched+0x1a/0x50
...

Read more...

Revision history for this message
Alexey Gusev (alexeygusev) wrote :

I have the same problem on my two machines that have identical setup.

ZFS on root and data (two pools), compression enabled, some subvolumes are encrypted.

ubuntu 21.10
Kernel 5.13.0-20-generic

$ zfs --version
zfs-2.0.6-1ubuntu2
zfs-kmod-2.0.6-1ubuntu2

First panic happens after I log in with my OS user, this triggers decryption of zfs subvolumes via PAM and voila:

VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
PANIC at zfs_znode.c:339:zfs_znode_sa_init()
Showing stack for process 8821
CPU: 6 PID: 8821 Comm: Cache2 I/O Tainted: P O 5.13.0-20-generic #20-Ubuntu
Hardware name: ASUS System Product Name/PRO H410T, BIOS 1401 07/27/2020
Call Trace:
 show_stack+0x52/0x58
 dump_stack+0x7d/0x9c
 spl_dumpstack+0x29/0x2b [spl]
 spl_panic+0xd4/0xfc [spl]
 ? queued_spin_unlock+0x9/0x10 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dmu_buf_replace_user+0x65/0x80 [zfs]
 ? dmu_buf_set_user+0x13/0x20 [zfs]
 ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
 zfs_znode_sa_init+0xd9/0xe0 [zfs]

The system itself is still usable but becomes unresponsive here and there, on irregular basis. And then it goes on and on with messages like this in dmesg:

INFO: task Cache2 I/O:8821 blocked for more than 1208 seconds.
      Tainted: P O 5.13.0-20-generic #20-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:Cache2 I/O state:D stack: 0 pid: 8821 ppid: 4247 flags:0x00000000
Call Trace:
 __schedule+0x268/0x680
 schedule+0x4f/0xc0
 spl_panic+0xfa/0xfc [spl]
 ? queued_spin_unlock+0x9/0x10 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dmu_buf_replace_user+0x65/0x80 [zfs]
 ? dmu_buf_set_user+0x13/0x20 [zfs]
 ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
 zfs_znode_sa_init+0xd9/0xe0 [zfs]

Processes that are suffering from being locked forever in 'D' state (ps output second column) are usually firefox, gsd-housekeeping , sometimes gnome-shell and, as in case above, find. I believe gnome-shell causes nautilus to misbehave. What also sucks is that this seems to cause my laptop to abort entering sleep mode with resource busy error, recursively. So it would try to enter sleep, abort (there's a message in syslog) and try again, until the battery depletes completely.

Adding `zfs.zfs_recover=1` to kernel boot parameter list maybe helps (thank you https://launchpad.net/~jawn-smith). At least it prevented the first zfs_node panic message from appearing in dmesg after login, but this needs longer and more detailed observation under different loads. Also, an open question remains whether having such kernel parameter for regular use is appropriate.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-5.13/5.13.0-1018.22 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Changed in zfs-linux (Ubuntu):
assignee: Colin Ian King (colin-king) → nobody
Changed in zfs-linux (Ubuntu Impish):
assignee: Colin Ian King (colin-king) → nobody
Revision history for this message
Seth Arnold (seth-arnold) wrote :

Stefan, do filesystems affected by this need to be rebuilt in order to be used by other OpenZFS distributions or releases of Ubuntu that predate the introduction of the bug?

Thanks

Revision history for this message
Mason Loring Bliss (y-mason) wrote :

(Or Ubuntu systems post-fix but with pools created while the bug was
active - and is there a fix possible, or is it "make a new pool"? Is
there a diagnostic possible to be sure either way?)

Revision history for this message
Ben Wibking (bwibking) wrote :

I installed 21.10 with ZFS+encryption before this was fixed. Do I need to reformat and reinstall? Or is there a way to check for corruption on each zpool?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm on a PI4, where the latest impish kernel is linux-image-5.13.0-1009-raspi 5.13.0-1009.10 and I saw the bug there. I reverted to kernel linux-image-5.11.0-1021-raspi 5.11.0-1021.22 and was able to mount my encrypted datasets again.

The kernel versioning seems to be different in the pi4 tree, and I don't see 5.13.0-20.20 there.

Revision history for this message
Juerg Haefliger (juergh) wrote :

The raspi kernel based on 5.13.0-20.20 is not yet in updates, it's still in proposed. Can you test 5.13.0-1010.12 from impish-proposed?

Revision history for this message
Fred (enoriel) wrote (last edit ):

As many people here, I upgraded my encrypted ZFS Ubuntu system to 21.10, unaware of the issues with ZFS,

I only used 5.13.0-19 for a day, and began to notice odd things, like: incapacity to suspend to RAM, freezes, some process were impossible to kill.
I updated my kernel to 5.13.0-20 and then 5.13.0-21, but I still have issues with my ZFS encrypted filesystem. It seems the kernel update does not fix everything.

Even now it is unable to suspend (it says some process refuse to suspend), which is really annoying for a laptop. Moreoever, automatically scheduled "updatedb" and "apt update" get stuck.
Even system shutdown often takes a long time, to the point I have to force it with power button. Many times it seems to get stuck waiting for AppArmor.
In addition I keep getting errors with stack traces in dmesg, I put an example in attachment.

Now I am wondering what I should do to repair it ? I considered trying to downgrade my kernel to 5.11.0-37 which was my version before upgrading to 21.10, but it was automatically removed, and it does not seem very easy to put it back. I am wondering if it is worth the hassle.

Besides, the ZFS version was also upgraded in the process, and it is not clear to me whether I could keep the new version with the old kernel, or not. It seems to have evolved from 0.8.4-1ubuntu11.3 to 2.0.2-1ubuntu5.1, and then 2.0.2-1ubuntu5.2, and 2.0.6-1ubuntu2 (though oddly I cannot find any trace of this last version being installed in apt logs).
I am wondering if rolling back to 0.8.4 can do any good ?

My whole system is based on ZFS, if I have to reinstall it all, it will take me days to put everything back. Is there a real corruption on the filesystem, which would be interpreted as such by an older ZFS version, or is it just new 2.0.x versions that are disturbed by it ?
Would there be a way to fix the corruption ? Deleting some files or changing meta-data ?

[EDIT] Also, I have a ZFS snapshot of the / and /var datasets (but not /home), from before the update to 21.10, I am wondering if it could help to rollback back to them or if the corruption is so bad that it could affect snapshots too ? The issue may arise if some corrupted files where not modified by the upgrade, in which case they would not have been copied and both the snapshot and the current tree would reference the same corrupted file. Anyway, I have no backup of the /boot, so I am not very sure it would work just restoring snapshots from a USB live system or if I would have to rollback the kernel to 5.11.0-37 and ZFS to 0.8.4 before that, so that everything seem identical to the system.

Any help would be greatly appreciated !
Thank you.

Revision history for this message
Christian Sarrasin (sxc731) wrote :

@enoriel - I've read carefully the info in this ticket; I'm still contemplating the Hirsute to Impish upgrade and *really* glad I didn't pull the trigger earlier!!

The bad news is that it seems clear that the original Impish kernel (5.13.0-19) corrupted filesystems it ran against; thus running any (older/newer) kernel that doesn't include the bug won't "undo" the corruption. I *believe* only ZFS-encrypted filesystems were affected.

https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476/comments/64 and https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476/comments/47 have suggestions on how to rescue your filesystem. It may be best to boot the system to be rescued by other means and import the corrupted filesystem onto it; a Hirsute ISO should do as its ZFS version is close enough to the one used by Impish (2.0.x) but I haven't tested any of this; merely collating info above.

I'm still unclear as to whether a Hirsute -> Impish upgrade is safe. A clear cut statement by Canonical wouldn't go amiss given the severity of this issue (to put it mildly)...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-raspi (Ubuntu Impish):
status: New → Confirmed
Changed in linux-raspi (Ubuntu):
status: New → Confirmed
Revision history for this message
Jonas Dedden (jonded94) wrote (last edit ):

I can also confirm that I am facing this bug, since I was running 5.13.0-19 for some short amount of time. I can not report any new broken files since I'm using 5.13.0-21-generic. I'm using an encrypted zfs filesystem.

Nevertheless, it seems like I still have a small handful of files that are broken in just one singular folder.

Even when I reboot into the Ubuntu recovery mode with additionally parameter zfs-recovery enabled, I receive Kernel panics when trying to ls the broken directory or remove the broken file:

VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl))
PANIC at zfs_znode.c:335:zfs_znode_sa_init()

After that even the recovery mode is not responding to any user inputs.

Is there any way to remove the broken directory?

Revision history for this message
Fred (enoriel) wrote (last edit ):
Download full text (3.6 KiB)

Thank you Christian,
I think I managed to repair my system.
Here is how I did, if it can help others.
By the way, Jonas, it is impossible to remove broken files/folders, so the strategy I suggest is to destroy the dataset and restore it from a backup, while running from a bootable media.
One can backup everything in the dataset except the corrupted files, and finally try to restore these by other means: reinstalling package, or using eventual backups for personnal files.

I scanned every dataset with find and fstat, as suggested in this thread, until fstat got stalled, for example with /var (I did it for /, /var, /opt and /home, which all had their own datasets):

sudo find /var -mount -exec echo '{}' \; -exec stat {} \;

At the same time I monitored kernel errors:

tail -f /var/log/kern.log

When it freezes on a file, its name is printed by the echo command (this is the last thing printed out), and a stack trace appears in the log.

Each time a corrupted file is found, it is necessary to restart the scan from the beginning, while excluding it, example:

sudo find /var -mount -not -path '/var/lib/app-info/icons/ubuntu-impish-universe/*' -exec echo '{}' \; -exec stat {} \;

I was lucky, only one file got corrupted: /var/lib/app-info/icons/ubuntu-impish-universe/48x48/plasma-workspace_preferences-desktop-color.png
It is quite amazing that such an harmless file would cause such a mess in my system, but well, now I have repaired /var, there are no more spl_panic stack traces, and now I can use apt and updatedb without being blocked.

Apparently, my corrupted file did not belong to any package (I checked with apt-file search <filepath>), and in the end, I figured out it was recreated automatically, I don´t know how...
Otherwise, I would have reinstalled the package after restoring the rest.

I backed up the whole /var with tar:

sudo tar --exclude=/var/lib/app-info/icons/ubuntu-impish-universe/48x48 --acls --xattrs --numeric-owner --one-file-system -zcpvf backup_var.tar.gz /var

At first I did not put --numeric-owner, but the owners where all messed up, and it prevented it from going to graphical mode (GDM was complaining not having write access to /var/lib/gdm3/.config/)
It is probably because by default, owner/group are saved as text, and are assigned different uid/gid on the bootable media.

The backup process shall not get stalled, otherwise there might be other corrupted files, not seen by fstat, I don't know if it is possible.

In order to be extra sure about non-corruption of my root dir (/), I also created a backup of it, looking for a possible freeze, but it did not occur.

I created a bootable USB media with Ubuntu 21.04, and booted it.
I accessed my ZFS pool:

sudo mkdir /mnt/install
sudo zpool import -f -R /mnt/install rpool
zfs list

I destroyed and recreated the dataset for /var (with options from my installation notes):

sudo zfs destroy -r rpool/root/var
sudo zfs create -o quota=16G -o mountpoint=/var rpool/root/var

It is necessary to reopen the pool, otherwise, a simple mount does not allow populating the new dataset (for me, /var was created in the root dataset):

sudo zpool export -a
sudo zpool import -R /mnt/install rpool
s...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (25.3 KiB)

This bug was fixed in the package linux-raspi - 5.13.0-1010.12

---------------
linux-raspi (5.13.0-1010.12) impish; urgency=medium

  * impish/linux-raspi: 5.13.0-1010.12 -proposed tracker (LP: #1950142)

  * [Impish] Unable to boot rpi3b and cm3+ after upgrade to kernel
    5.13.0-1010.11-raspi (LP: #1950117)
    - firmware: raspberrypi: Partially revert 'firmware: bcm2835: Support
      ARCH_BCM270x'

  * Packaging resync (LP: #1786013)
    - [Packaging] update Ubuntu.md

  * HDMI output freezes under current/proposed impish kernels (LP: #1946368)
    - Revert "drm/vc4: Increase the core clock to a minimum of 500MHz"
    - Revert "drm/vc4: Increase the core clock based on HVS load"
    - Revert "drm/vc4: fix vc4_atomic_commit_tail() logic"
    - Revert "drm: Introduce a drm_crtc_commit_wait helper"
    - Revert "drm/vc4: kms: Convert to atomic helpers"
    - Revert "drm/vc4: kms: Remove async modeset semaphore"
    - Revert "drm/vc4: kms: Remove unassigned_channels from the HVS state"
    - Revert "drm/vc4: kms: Wait on previous FIFO users before a commit"
    - drm/vc4: Increase the core clock based on HVS load
    - drm/vc4: Increase the core clock to a minimum of 500MHz

  [ Ubuntu: 5.13.0-21.21 ]

  * impish/linux: 5.13.0-21.21 -proposed tracker (LP: #1947347)
  * It hangs while booting up with AMD W6800 [1002:73A3] (LP: #1945553)
    - drm/amdgpu: Rename flag which prevents HW access
    - drm/amd/pm: Fix a bug communicating with the SMU (v5)
    - drm/amd/pm: Fix a bug in semaphore double-lock
  * Add final-checks to check certificates (LP: #1947174)
    - [Packaging] Add system trusted and revocation keys final check
  * No sound on Lenovo laptop models Legion 15IMHG05, Yoga 7 14ITL5, and 13s
    Gen2 (LP: #1939052)
    - ALSA: hda/realtek: Quirks to enable speaker output for Lenovo Legion 7i
      15IMHG05, Yoga 7i 14ITL5/15ITL5, and 13s Gen2 laptops.
    - ALSA: hda/realtek: Fix for quirk to enable speaker output on the Lenovo 13s
      Gen2
  * Check for changes relevant for security certifications (LP: #1945989)
    - [Packaging] Add a new fips-checks script
    - [Packaging] Add fips-checks as part of finalchecks
  * BCM57800 SRIOV bug causes interfaces to disappear (LP: #1945707)
    - bnx2x: Fix enabling network interfaces without VFs
  * CVE-2021-3759
    - memcg: enable accounting of ipc resources
  * [impish] Remove the downstream xr-usb-uart driver (LP: #1945938)
    - SAUCE: xr-usb-serial: remove driver
    - [Config] update modules list
  * Fix A yellow screen pops up in an instant (< 1 second) and then disappears
    before loading the system (LP: #1945932)
    - drm/i915: Stop force enabling pipe bottom color gammma/csc
  * Impish update: v5.13.18 upstream stable release (LP: #1946249)
    - Linux 5.13.18
  * Impish update: v5.13.17 upstream stable release (LP: #1946247)
    - locking/mutex: Fix HANDOFF condition
    - regmap: fix the offset of register error log
    - regulator: tps65910: Silence deferred probe error
    - crypto: mxs-dcp - Check for DMA mapping errors
    - sched/deadline: Fix reset_on_fork reporting of DL tasks
    - power: supply: axp288_fuel_gauge: Report register-address on readb / writeb
      ...

Changed in linux-raspi (Ubuntu Impish):
status: Confirmed → Fix Released
Revision history for this message
Ethan (packruler) wrote :

Is there any way to resolve this issue without making a full back of the ZFS volume? I have a volume that is multiple TBs and a backup to restore is not too simple, especially considering the corrupted files are relatively minor jpg files.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

In my pi4 case, I reverted to a previous kernel and didn't see the problem anymore. I was seeing it during snapshot send, though, I didn't to a full `find /storage` and touch every single file to confirm the problem was gone.

Revision history for this message
mog (launchpad-net-mog) wrote :

It would be real nice if the ubuntu livecd iso for usb booting could also get the fixed kernel.

Revision history for this message
AndreK (andre-k) wrote :

I have a laptop that was installed soon after 21.10 release with full disk encryption and ZFS, after several crazy issues, I've found this thread - this explains it..
1:-What can I do to properly re-install the system? (fix all corrupted files.)
2:-how can I even clean-install disk encrypted system if the 21.10 bootable iso is corrupting stuff wint ZFS?

Revision history for this message
AndreK (andre-k) wrote :

...maybe installing from scratch with the current 22.04 build from https://cdimage.ubuntu.com/daily-live/current/
can solve my problem? - will that dev install eventually become a real release? (or is using a dev upgraded after release, somehow worse than a release?)

Revision history for this message
Trent Lloyd (lathiat) wrote :

Re-installing from scratch should resolve the issue. I suspect in most cases if you install with the 21.10 installer (even though it has the old kernel) as long as you install updates during the install this issue probably won't hit you. It mostly seems to occur after a reboot and it's loading data back from disk again.

As per some of the other comments you'll have a bit of a hard time copying data off the old broken install.. you need to work through which files/folders are corrupt and reboot and then exclude those from the next rsync.

You could use the 22.04 daily build, it will eventually upgrade into the final release. However not usually recommended as there may be bugs or other problems in those daily images and/or it's not uncommon for the development release to sometimes break during the development cycle. Most of the time it doesn't and it usually works most of the time, but it's much more likely than using 21.10.

I'd try a re-install with 21.10 as I described. Obviously you'll need to backup all of your data from the existing install first.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (44.5 KiB)

This bug was fixed in the package linux-raspi - 5.15.0-1002.2

---------------
linux-raspi (5.15.0-1002.2) jammy; urgency=medium

  * jammy/linux-raspi: 5.15.0-1002.2 -proposed tracker (LP: #1958834)

  * Packaging resync (LP: #1786013)
    - [Packaging] update Ubuntu.md
    - debian/dkms-versions -- update from kernel-versions (main/master)

  * Kernel fails to boot in ScalingStack (LP: #1959102)
    - [Config] raspi: Set VIRTIO_PCI=m
    - [Config] raspi: Set ACPI=y

  * jammy/linux-raspi: Update to upstream raspberrypi rpi-5.15.y (2022-01-24)
    (LP: #1958854)
    - brcmfmac: firmware: Fix crash in brcm_alt_fw_path
    - ARM: dts: Remove VL805 USB node from CM4 dts
    - mfd: simple-mfd-i2c: Add configuration for RPi POE HAT
    - pwm: raspberrypi-poe: Add option of being created by MFD or FW
    - power: rpi-poe: Drop CURRENT_AVG as it is not hardware averaged
    - power: rpi-poe: Add option of being created by MFD or FW
    - defconfigs: Add MFD_RASPBERRYPI_POE_HAT to Pi defconfigs.
    - dtoverlays: Add option for PoE HAT to use Linux I2C instead of FW.
    - drivers: bcm2835_unicam: Disable trigger mode operation
    - arm: Remove spurious .fnend directive
    - drm/vc4: Fix deadlock on DSI device attach error
    - drm/vc4: dsi: Correct max divider to 255 (not 7)
    - defconfig: Add BACKLIGHT_PWM to bcm2709 and bcmrpi defconfigs
    - dtoverlays: Add pwm backlight option to vc4-kms-dpi-generic
    - dtoverlays: Correct [h|v]sync_invert config in vc4-kms-dpi-generic
    - ARM: dts: BCM2711 AON_INTR2 generates IRQ edges
    - update rpi-display-overlay.dts pins for 5.10+

  [ Ubuntu: 5.15.0-18.18 ]

  * jammy/linux: 5.15.0-18.18 -proposed tracker (LP: #1958638)
  * CVE-2021-4155
    - xfs: map unwritten blocks in XFS_IOC_{ALLOC, FREE}SP just like fallocate
  * CVE-2022-0185
    - SAUCE: vfs: test that one given mount param is not larger than PAGE_SIZE
  * [UBUNTU 20.04] KVM hardware diagnose data improvements for guest kernel -
    kernel part (LP: #1953334)
    - KVM: s390: add debug statement for diag 318 CPNC data
  * OOB write on BPF_RINGBUF (LP: #1956585)
    - SAUCE: bpf: prevent helper argument PTR_TO_ALLOC_MEM to have offset other
      than 0
  * Miscellaneous Ubuntu changes
    - [Config] re-enable shiftfs
    - [SAUCE] shiftfs: support kernel 5.15
    - [Config] update toolchain versions
  * Miscellaneous upstream changes
    - vfs: fs_context: fix up param length parsing in legacy_parse_param

linux-raspi (5.15.0-1001.1) jammy; urgency=medium

  * Missing overlays/README (LP: #1954757)
    - SAUCE: Install overlays/README

  * dtoverlay=uart4 breaks Raspberry Pi 4B boot (LP: #1875454)
    - SAUCE: arm: dts: Add 'brcm,bcm2835-pl011' compatible for uart2-5

  * jammy/linux-raspi: Update to upstream raspberrypi rpi-5.15.y (2022-01-14)
    (LP: #1958146)
    - clk: bcm-2835: Pick the closest clock rate
    - clk: bcm-2835: Remove rounding up the dividers
    - drm/vc4: hdmi: Set a default HSM rate
    - drm/vc4: hdmi: Move the HSM clock enable to runtime_pm
    - drm/vc4: hdmi: Make sure the controller is powered in detect
    - drm/vc4: hdmi: Make sure the controller is powered up during bind
    - drm/vc4: hdmi: Rework the ...

Changed in linux-raspi (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Evan Mesterhazy (emesterhazy) wrote :

Following up on #57. IMO, filesystem corruption is one of the worst types of bugs a Linux distro can introduce. That this was bug was known, released anyways, and only mentioned 2000+ words into the patch notes is particularly egregious. I doubt a bug in EXT4 would be treated with the same carelessness. If you're shipping ZFS as a root filesystem in Ubuntu then you need to treat it as such.

Has Canonical made any statement regarding how they will prevent this from happening in the future?

Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 21.10 (Impish Indri) has reached end of life, so this bug will not be fixed for that specific release.

Changed in ubuntu-release-upgrader (Ubuntu Impish):
status: Confirmed → Won't Fix
Changed in zfs:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.