Downgrading these libraries did not fix it for me.
One canary is to look at the state of the updatedb.mlocate process which goes into live hang as it hits the file that is currently faulty. Thus for example:
2728 ? DNs 0:01 /usr/bin/updatedb.mlocate
Using lsof one can find out which is the faulty file but, if you try to ls the directory with the file in, then you terminal live hangs. And if you try to delete the directory containing the faulty file, then that silently fails, leaving the directory and the live hang should you try to read it.
It's said that this bug relates to 1906476 and via that to https://github.com/openzfs/zfs/issues/11474 but I think I have that one fixed with zfs_recover=1. But the syslog message is very similar:
Aug 11 10:13:45 silmh9 kernel: [ 242.531205] INFO: task updatedb.mlocat:2728 blocked for more than 120 seconds.
Aug 11 10:13:45 silmh9 kernel: [ 242.531211] Tainted: P U O 5.13.0-13-generic #13-Ubuntu
Aug 11 10:13:45 silmh9 kernel: [ 242.531212] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 11 10:13:45 silmh9 kernel: [ 242.531213] task:updatedb.mlocat state:D stack: 0 pid: 2728 ppid: 1 flags:0x00004220
Aug 11 10:13:45 silmh9 kernel: [ 242.531216] Call Trace:
Aug 11 10:13:45 silmh9 kernel: [ 242.531221] __schedule+0x268/0x680
Aug 11 10:13:45 silmh9 kernel: [ 242.531225] ? arch_local_irq_enable+0xb/0xd
Aug 11 10:13:45 silmh9 kernel: [ 242.531228] schedule+0x4f/0xc0
Aug 11 10:13:45 silmh9 kernel: [ 242.531231] spl_panic+0xfa/0xfc [spl]
Aug 11 10:13:45 silmh9 kernel: [ 242.531238] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531287] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531313] ? __raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531339] ? dmu_buf_replace_user+0x65/0x80 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531367] ? dmu_buf_set_user+0x13/0x20 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531393] ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531420] zfs_znode_sa_init+0xd9/0xe0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531479] zfs_znode_alloc+0x101/0x560 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531536] ? dmu_buf_unlock_parent+0x5d/0x90 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531564] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531590] ? dbuf_read_impl.constprop.0+0x316/0x3e0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531616] ? __cond_resched+0x1a/0x50
Aug 11 10:13:45 silmh9 kernel: [ 242.531618] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Aug 11 10:13:45 silmh9 kernel: [ 242.531620] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531642] ? __cond_resched+0x1a/0x50
Aug 11 10:13:45 silmh9 kernel: [ 242.531644] ? down_read+0x13/0x90
Aug 11 10:13:45 silmh9 kernel: [ 242.531645] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Aug 11 10:13:45 silmh9 kernel: [ 242.531646] ? queued_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531680] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531712] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Aug 11 10:13:45 silmh9 kernel: [ 242.531713] ? dmu_object_info_from_dnode+0x8e/0xa0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531742] zfs_zget+0x237/0x280 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531796] zfs_dirent_lock+0x42a/0x570 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531850] zfs_dirlook+0x91/0x2a0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531902] zfs_lookup+0x1fb/0x3f0 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.531954] zpl_lookup+0xcb/0x230 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.532007] __lookup_slow+0x84/0x150
Aug 11 10:13:45 silmh9 kernel: [ 242.532009] walk_component+0x141/0x1b0
Aug 11 10:13:45 silmh9 kernel: [ 242.532010] ? dput+0x5c/0x180
Aug 11 10:13:45 silmh9 kernel: [ 242.532012] path_lookupat+0x6e/0x1c0
Aug 11 10:13:45 silmh9 kernel: [ 242.532013] filename_lookup+0xbb/0x1c0
Aug 11 10:13:45 silmh9 kernel: [ 242.532015] ? __check_object_size.part.0+0x128/0x150
Aug 11 10:13:45 silmh9 kernel: [ 242.532017] ? __check_object_size+0x1c/0x20
Aug 11 10:13:45 silmh9 kernel: [ 242.532018] ? strncpy_from_user+0x44/0x150
Aug 11 10:13:45 silmh9 kernel: [ 242.532020] ? __raw_spin_unlock+0x9/0x10 [zfs]
Aug 11 10:13:45 silmh9 kernel: [ 242.532066] user_path_at_empty+0x59/0x90
Aug 11 10:13:45 silmh9 kernel: [ 242.532068] vfs_statx+0x7a/0x120
Aug 11 10:13:45 silmh9 kernel: [ 242.532070] __do_sys_newlstat+0x3e/0x70
Aug 11 10:13:45 silmh9 kernel: [ 242.532072] ? __secure_computing+0x42/0xe0
Aug 11 10:13:45 silmh9 kernel: [ 242.532073] ? syscall_trace_enter.constprop.0+0xa0/0x1c0
Aug 11 10:13:45 silmh9 kernel: [ 242.532075] __x64_sys_newlstat+0x16/0x20
Aug 11 10:13:45 silmh9 kernel: [ 242.532076] do_syscall_64+0x61/0xb0
Aug 11 10:13:45 silmh9 kernel: [ 242.532078] ? exit_to_user_mode_prepare+0x37/0xb0
Aug 11 10:13:45 silmh9 kernel: [ 242.532079] ? syscall_exit_to_user_mode+0x27/0x50
Aug 11 10:13:45 silmh9 kernel: [ 242.532080] ? do_syscall_64+0x6e/0xb0
Aug 11 10:13:45 silmh9 kernel: [ 242.532081] ? do_syscall_64+0x6e/0xb0
Aug 11 10:13:45 silmh9 kernel: [ 242.532086] ? do_syscall_64+0x6e/0xb0
Aug 11 10:13:45 silmh9 kernel: [ 242.532087] entry_SYSCALL_64_after_hwframe+0x44/0xae
Aug 11 10:13:45 silmh9 kernel: [ 242.532088] RIP: 0033:0x7fbfb3cd97ba
Aug 11 10:13:45 silmh9 kernel: [ 242.532090] RSP: 002b:00007ffc377eead8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Aug 11 10:13:45 silmh9 kernel: [ 242.532091] RAX: ffffffffffffffda RBX: 000055f3abe4af70 RCX: 00007fbfb3cd97ba
Aug 11 10:13:45 silmh9 kernel: [ 242.532092] RDX: 00007ffc377eeb50 RSI: 00007ffc377eeb50 RDI: 000055f3abe68499
Aug 11 10:13:45 silmh9 kernel: [ 242.532092] RBP: 000055f3abe68499 R08: 0000000000000001 R09: 000055f3ab979740
Aug 11 10:13:45 silmh9 kernel: [ 242.532093] R10: 000055f3ab979580 R11: 0000000000000246 R12: 00007ffc377eed20
Aug 11 10:13:45 silmh9 kernel: [ 242.532093] R13: 0000000000000000 R14: 00007ffc377eed30 R15: 0000000000000000
Downgrading these libraries did not fix it for me.
One canary is to look at the state of the updatedb.mlocate process which goes into live hang as it hits the file that is currently faulty. Thus for example:
2728 ? DNs 0:01 /usr/bin/ updatedb. mlocate
Using lsof one can find out which is the faulty file but, if you try to ls the directory with the file in, then you terminal live hangs. And if you try to delete the directory containing the faulty file, then that silently fails, leaving the directory and the live hang should you try to read it.
It's said that this bug relates to 1906476 and via that to https:/ /github. com/openzfs/ zfs/issues/ 11474 but I think I have that one fixed with zfs_recover=1. But the syslog message is very similar:
Aug 11 10:13:45 silmh9 kernel: [ 242.531205] INFO: task updatedb. mlocat: 2728 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. mlocat state:D stack: 0 pid: 2728 ppid: 1 flags:0x00004220 0x268/0x680 irq_enable+ 0xb/0xd spin_unlock+ 0x9/0x10 [zfs] spin_unlock+ 0x9/0x10 [zfs] unlock+ 0x9/0x10 [zfs] replace_ user+0x65/ 0x80 [zfs] set_user+ 0x13/0x20 [zfs] set_user_ ie+0x15/ 0x20 [zfs] sa_init+ 0xd9/0xe0 [zfs] alloc+0x101/ 0x560 [zfs] unlock_ parent+ 0x5d/0x90 [zfs] spin_unlock+ 0x9/0x10 [zfs] impl.constprop. 0+0x316/ 0x3e0 [zfs] resched+ 0x1a/0x50 save___ native_ queued_ spin_unlock+ 0x15/0x23 spin_unlock+ 0x9/0x10 [zfs] resched+ 0x1a/0x50 save___ native_ queued_ spin_unlock+ 0x15/0x23 spin_unlock+ 0x9/0x10 [zfs] spin_unlock+ 0x9/0x10 [zfs] save___ native_ queued_ spin_unlock+ 0x15/0x23 info_from_ dnode+0x8e/ 0xa0 [zfs] 0x237/0x280 [zfs] lock+0x42a/ 0x570 [zfs] 0x91/0x2a0 [zfs] 0x1fb/0x3f0 [zfs] 0xcb/0x230 [zfs] slow+0x84/ 0x150 0x141/0x1b0 0x6e/0x1c0 lookup+ 0xbb/0x1c0 object_ size.part. 0+0x128/ 0x150 object_ size+0x1c/ 0x20 from_user+ 0x44/0x150 unlock+ 0x9/0x10 [zfs] at_empty+ 0x59/0x90 0x7a/0x120 newlstat+ 0x3e/0x70 computing+ 0x42/0xe0 trace_enter. constprop. 0+0xa0/ 0x1c0 newlstat+ 0x16/0x20 64+0x61/ 0xb0 user_mode_ prepare+ 0x37/0xb0 exit_to_ user_mode+ 0x27/0x50 64+0x6e/ 0xb0 64+0x6e/ 0xb0 64+0x6e/ 0xb0 64_after_ hwframe+ 0x44/0xae 7eead8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Aug 11 10:13:45 silmh9 kernel: [ 242.531211] Tainted: P U O 5.13.0-13-generic #13-Ubuntu
Aug 11 10:13:45 silmh9 kernel: [ 242.531212] "echo 0 > /proc/sys/
Aug 11 10:13:45 silmh9 kernel: [ 242.531213] task:updatedb.
Aug 11 10:13:45 silmh9 kernel: [ 242.531216] Call Trace:
Aug 11 10:13:45 silmh9 kernel: [ 242.531221] __schedule+
Aug 11 10:13:45 silmh9 kernel: [ 242.531225] ? arch_local_
Aug 11 10:13:45 silmh9 kernel: [ 242.531228] schedule+0x4f/0xc0
Aug 11 10:13:45 silmh9 kernel: [ 242.531231] spl_panic+0xfa/0xfc [spl]
Aug 11 10:13:45 silmh9 kernel: [ 242.531238] ? queued_
Aug 11 10:13:45 silmh9 kernel: [ 242.531287] ? do_raw_
Aug 11 10:13:45 silmh9 kernel: [ 242.531313] ? __raw_spin_
Aug 11 10:13:45 silmh9 kernel: [ 242.531339] ? dmu_buf_
Aug 11 10:13:45 silmh9 kernel: [ 242.531367] ? dmu_buf_
Aug 11 10:13:45 silmh9 kernel: [ 242.531393] ? dmu_buf_
Aug 11 10:13:45 silmh9 kernel: [ 242.531420] zfs_znode_
Aug 11 10:13:45 silmh9 kernel: [ 242.531479] zfs_znode_
Aug 11 10:13:45 silmh9 kernel: [ 242.531536] ? dmu_buf_
Aug 11 10:13:45 silmh9 kernel: [ 242.531564] ? do_raw_
Aug 11 10:13:45 silmh9 kernel: [ 242.531590] ? dbuf_read_
Aug 11 10:13:45 silmh9 kernel: [ 242.531616] ? __cond_
Aug 11 10:13:45 silmh9 kernel: [ 242.531618] ? __raw_callee_
Aug 11 10:13:45 silmh9 kernel: [ 242.531620] ? queued_
Aug 11 10:13:45 silmh9 kernel: [ 242.531642] ? __cond_
Aug 11 10:13:45 silmh9 kernel: [ 242.531644] ? down_read+0x13/0x90
Aug 11 10:13:45 silmh9 kernel: [ 242.531645] ? __raw_callee_
Aug 11 10:13:45 silmh9 kernel: [ 242.531646] ? queued_
Aug 11 10:13:45 silmh9 kernel: [ 242.531680] ? do_raw_
Aug 11 10:13:45 silmh9 kernel: [ 242.531712] ? __raw_callee_
Aug 11 10:13:45 silmh9 kernel: [ 242.531713] ? dmu_object_
Aug 11 10:13:45 silmh9 kernel: [ 242.531742] zfs_zget+
Aug 11 10:13:45 silmh9 kernel: [ 242.531796] zfs_dirent_
Aug 11 10:13:45 silmh9 kernel: [ 242.531850] zfs_dirlook+
Aug 11 10:13:45 silmh9 kernel: [ 242.531902] zfs_lookup+
Aug 11 10:13:45 silmh9 kernel: [ 242.531954] zpl_lookup+
Aug 11 10:13:45 silmh9 kernel: [ 242.532007] __lookup_
Aug 11 10:13:45 silmh9 kernel: [ 242.532009] walk_component+
Aug 11 10:13:45 silmh9 kernel: [ 242.532010] ? dput+0x5c/0x180
Aug 11 10:13:45 silmh9 kernel: [ 242.532012] path_lookupat+
Aug 11 10:13:45 silmh9 kernel: [ 242.532013] filename_
Aug 11 10:13:45 silmh9 kernel: [ 242.532015] ? __check_
Aug 11 10:13:45 silmh9 kernel: [ 242.532017] ? __check_
Aug 11 10:13:45 silmh9 kernel: [ 242.532018] ? strncpy_
Aug 11 10:13:45 silmh9 kernel: [ 242.532020] ? __raw_spin_
Aug 11 10:13:45 silmh9 kernel: [ 242.532066] user_path_
Aug 11 10:13:45 silmh9 kernel: [ 242.532068] vfs_statx+
Aug 11 10:13:45 silmh9 kernel: [ 242.532070] __do_sys_
Aug 11 10:13:45 silmh9 kernel: [ 242.532072] ? __secure_
Aug 11 10:13:45 silmh9 kernel: [ 242.532073] ? syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532075] __x64_sys_
Aug 11 10:13:45 silmh9 kernel: [ 242.532076] do_syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532078] ? exit_to_
Aug 11 10:13:45 silmh9 kernel: [ 242.532079] ? syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532080] ? do_syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532081] ? do_syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532086] ? do_syscall_
Aug 11 10:13:45 silmh9 kernel: [ 242.532087] entry_SYSCALL_
Aug 11 10:13:45 silmh9 kernel: [ 242.532088] RIP: 0033:0x7fbfb3cd97ba
Aug 11 10:13:45 silmh9 kernel: [ 242.532090] RSP: 002b:00007ffc37
Aug 11 10:13:45 silmh9 kernel: [ 242.532091] RAX: ffffffffffffffda RBX: 000055f3abe4af70 RCX: 00007fbfb3cd97ba
Aug 11 10:13:45 silmh9 kernel: [ 242.532092] RDX: 00007ffc377eeb50 RSI: 00007ffc377eeb50 RDI: 000055f3abe68499
Aug 11 10:13:45 silmh9 kernel: [ 242.532092] RBP: 000055f3abe68499 R08: 0000000000000001 R09: 000055f3ab979740
Aug 11 10:13:45 silmh9 kernel: [ 242.532093] R10: 000055f3ab979580 R11: 0000000000000246 R12: 00007ffc377eed20
Aug 11 10:13:45 silmh9 kernel: [ 242.532093] R13: 0000000000000000 R14: 00007ffc377eed30 R15: 0000000000000000