PANIC at zfs_znode.c:339:zfs_znode_sa_init()

Bug #1931660 reported by Chris Halse Rogers
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I'm seeing a non-deterministic panic in the ZFS code. Last boot, this occurred in systemd-udevd, resulting in a failed boot. The boot before that, firefox hit the same thing, and this boot it looks like it's hit an Evolution component.

ProblemType: Bug
DistroRelease: Ubuntu 21.10
Package: linux-image-5.11.0-18-generic 5.11.0-18.19+21.10.1
ProcVersionSignature: Ubuntu 5.11.0-18.19+21.10.1-generic 5.11.17
Uname: Linux 5.11.0-18-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu67
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: chris 4988 F.... pulseaudio
 /dev/snd/controlC1: chris 4988 F.... pulseaudio
CasperMD5CheckResult: unknown
CurrentDesktop: ubuntu:GNOME
Date: Fri Jun 11 11:49:47 2021
InstallationDate: Installed on 2021-05-13 (28 days ago)
InstallationMedia: Ubuntu 20.04.2.0 LTS "Focal Fossa" - Release amd64 (20210209.1)
MachineType: Dell Inc. XPS 15 9575
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu_z8j7yc@/vmlinuz-5.11.0-18-generic root=ZFS=rpool/ROOT/ubuntu_z8j7yc ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-18-generic N/A
 linux-backports-modules-5.11.0-18-generic N/A
 linux-firmware 1.198
SourcePackage: linux
UpgradeStatus: Upgraded to impish on 2021-05-13 (28 days ago)
dmi.bios.date: 07/07/2019
dmi.bios.release: 1.7
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.7.1
dmi.board.name: 0C32VW
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 31
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.7.1:bd07/07/2019:br1.7:svnDellInc.:pnXPS159575:pvr:rvnDellInc.:rn0C32VW:rvrA00:cvnDellInc.:ct31:cvr:
dmi.product.family: XPS
dmi.product.name: XPS 15 9575
dmi.product.sku: 080D
dmi.sys.vendor: Dell Inc.

Revision history for this message
Chris Halse Rogers (raof) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Robert Bordelon (riselstrom) wrote :

I encountered this problem on kernel 5.11.0-25 when using version 2.0.2-1ubuntu5.1 of the ZFS libs (I have the following ZFS related libraries installed on my system: libnvpair3linux, libuutil3linux, libzfs4linux, libzpool4linux, zfs-initramfs, zfs-zed and zfsutils-linux), but it went away after downgrading these libraries to version '2.0.2-1ubuntu5'.

Revision history for this message
mhosken (martin-hosken) wrote :
Download full text (6.1 KiB)

Downgrading these libraries did not fix it for me.

One canary is to look at the state of the updatedb.mlocate process which goes into live hang as it hits the file that is currently faulty. Thus for example:

   2728 ? DNs 0:01 /usr/bin/updatedb.mlocate

Using lsof one can find out which is the faulty file but, if you try to ls the directory with the file in, then you terminal live hangs. And if you try to delete the directory containing the faulty file, then that silently fails, leaving the directory and the live hang should you try to read it.

It's said that this bug relates to 1906476 and via that to https://github.com/openzfs/zfs/issues/11474 but I think I have that one fixed with zfs_recover=1. But the syslog message is very similar:

Aug 11 10:13:45 silmh9 kernel: [ 242.531205] INFO: task updatedb.mlocat:2728 blocked for more than 120 seconds.
Aug 11 10:13:45 silmh9 kernel: [ 242.531211] Tainted: P U O 5.13.0-13-generic #13-Ubuntu
Aug 11 10:13:45 silmh9 kernel: [ 242.531212] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 11 10:13:45 silmh9 kernel: [ 242.531213] task:updatedb.mlocat state:D stack: 0 pid: 2728 ppid: 1 flags:0x00004220
Aug 11 10:13:45 silmh9 kernel: [ 242.531216] Call Trace:
Aug 11 10:13:45 silmh9 kernel: [ 242.531221] __schedule+0x268/0x680
Aug 11 10:13:45 silmh9 kernel: [ 242.531225] ? arch_local_irq_enable+0xb/0xd
Aug 11 10:13:45 silmh9 kernel: [ 242.531228] schedule+0x4f/0xc0
Aug 11 10:13:45 silmh9 kernel: [ 242.531231] spl_panic+0xfa/0xfc [spl]
Aug 11 10:13:45 silmh9 kernel: [ 242.531238] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531287] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531313] ? __raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531339] ? dmu_buf_replace_user+0x65/0x80 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531367] ? dmu_buf_set_user+0x13/0x20 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531393] ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531420] zfs_znode_sa_init+0xd9/0xe0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531479] zfs_znode_alloc+0x101/0x560 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531536] ? dmu_buf_unlock_parent+0x5d/0x90 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531564] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531590] ? dbuf_read_impl.constprop.0+0x316/0x3e0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531616] ? __cond_resched+0x1a/0x50
Aug 11 10:13:45 silmh9 kernel: [ 242.531618] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Aug 11 10:13:45 silmh9 kernel: [ 242.531620] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531642] ? __cond_resched+0x1a/0x50
Aug 11 10:13:45 silmh9 kernel: [ 242.531644] ? down_read+0x13/0x90
Aug 11 10:13:45 silmh9 kernel: [ 242.531645] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Aug 11 10:13:45 silmh9 kernel: [ 242.531646] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531680] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh...

Read more...

Revision history for this message
mhosken (martin-hosken) wrote :

Does the fact that this is an encrypted pool have any impact on this?

Revision history for this message
Doki (lkishalmi) wrote :

It stared to happen recently on my encrypted pool. The FS scrub reported no errors.

Revision history for this message
Robert Bordelon (riselstrom) wrote :

I've hit this bug on a non-encrypted pool. FS scrub also reported no errors for me. I'm currently on Kernel 5.11.0-37-generic / zfs libs 2.0.2-1ubuntu5.2 (Hirsute), though the corruption may have occurred on a previous kernel / library. The directory which is triggering the hang for me was created Jun 7 of this year so it appears to be recent problem. I've identified the corrupted directory and moved it out of the way (fortunately is contains nothing important), but I'm unable to delete it from the filesystem. Any attempts to read or delete the directory hangs. Setting zfs_recovery=1 does not help. The only way I can recover from hang is to reboot. I'd like to know if there is any way to get rid of the corrupted directory other than having to create a new filesystem and copy over all the other files / dirs except the problem one?

Revision history for this message
Trent Lloyd (lathiat) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.