Multiple data corruption issues in zfs

Bug #2044657 reported by Tobias Heider
188
This bug affects 31 people
Affects Status Importance Assigned to Milestone
zfs-linux (Ubuntu)
Status tracked in Noble
Xenial
Confirmed
Low
Unassigned
Bionic
Confirmed
Medium
Unassigned
Focal
Fix Released
Medium
Dimitri John Ledkov
Jammy
Fix Released
High
Dimitri John Ledkov
Lunar
Won't Fix
Undecided
Unassigned
Mantic
Fix Released
High
Dimitri John Ledkov
Noble
Fix Released
Undecided
Dimitri John Ledkov

Bug Description

[ Impact ]

 * Multiple data corruption issues have been identified and fixed in ZFS. Some of them, at varying real-life reproducibility frequency have been deterimed to affect very old zfs releases. Recommendation is to upgrade to 2.2.2 or 2.1.14 or backport dnat patch alone. This is to ensure users get other potentially related fixes and runtime tunables to possibly mitigate other bugs that are related and are being fixed upstream for future releases.

 * For jammy the 2.1.14 upgrade will bring HWE kernel support and also compatiblity/support for hardened kernel builds that mitigate SLS (straight-line-speculation).

 * In the absence of the upgrade a cherry-pick will address this particular popular issue alone - without addressing other issues w.r.t. Redbleed / SLS, bugfixes around trim support, and other related improvements that were discovered and fixed around the same time as this popular issue.

[ Test Plan ]

 * !!! Danger !!! use reproducer from https://zfsonlinux.topicbox.com/groups/zfs-discuss/T12876116b8607cdb and confirm if that issue is resolved or not. Do not run on production ZFS pools / systems.

 * autopkgtest pass (from https://ubuntu-archive-team.ubuntu.com/proposed-migration/ )

 * adt-matrix pass (from https://kernel.ubuntu.com/adt-matrix/ )

 * kernel regression zfs testsuite pass (from Kernel team RT test results summary, private)

 * zsys integration test pass (upgrade of zsys installed systems for all releases)

 * zsys install test pass (for daily images of LTS releases only that have such installer support, as per iso tracker test case)

 * LXD (ping LXD team to upgrade vendored in tooling to 2.2.2 and 2.1.14, and test LXD on these updated kernels)

[ Where problems could occur ]

 * Upgrade to 2.1.14 on jammy with SLS mitigations compatiblity will introduce slight slow down on amd64 (for hw accelerated assembly code-paths only in the encryption primitives)

 * Uncertain of the perfomance impact of the extra checks in dnat patch fix itself. Possibly affecting speed of operation, at the benefit of correctness.

 * The cherry-picked patch ("dnat"? dnode) changes the dirty data check, but
   only makes it stronger and not weaker, thus if it were incorrect, likely
   only performance would be impacted (and it is unlikely to be incorrect
   given upstream reviews and attention to data corruption issues; also,
   there are no additional changes to that function upstream)

[ Other Info ]

 * https://github.com/openzfs/zfs/pull/15571 is most current consideration of affairs

Revision history for this message
Tobias Heider (tobhe) wrote :

There is also a CVE that seems to be caused by the same bug: https://nvd.nist.gov/vuln/detail/CVE-2023-49298

Revision history for this message
Tobias Heider (tobhe) wrote :

Below is a potential fix/workaround for noble. This includes the two upstream commits https://github.com/openzfs/zfs/commit/03e9caaec006134b3db9d02ac40fe9369ee78b03 and https://github.com/openzfs/zfs/commit/479dca51c66a731e637bd2d4f9bba01a05f9ac9f to make block cloning optional and then disable it by default.

The same should also work for mantic.

Please let me know if there is a better way to submit the fix.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "0001-Fix-block-cloning-corruption-bug.patch" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
Tobias Heider (tobhe) wrote :

FreeBSD has released an official advisory for the issue today: https://lists.freebsd.org/archives/freebsd-stable/2023-November/001726.html

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

zfs-linux (2.2.1-0ubuntu1) noble; urgency=medium

  * New upstream release.

Uploaded. But I don't want to close this bug report, because it is not clear if that is everything yet or not.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Apparently 2.2.1 did not fix the issue, there is an on-going PR https://github.com/openzfs/zfs/pull/15571 that has the latest fix for the data corruption. As a side note it also affects zfs 2.1 series (block cloning made it more evident, but apparently possible to happen on earlier zfs releases), would we see a linux-hwe-6.2 bump for this with a 2.1 patch?

Below is the 2.1.x series PR:
https://github.com/openzfs/zfs/pull/15578

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe-6.5 (Ubuntu):
status: New → Confirmed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I'm not sure why linux-hwe-6.2 is marked as affected. hwe-6.2 use 2.1.9 based zfs-linux and will not be upgraded to use 2.2.0 series.

affects: linux-hwe-6.2 (Ubuntu) → linux-hwe-6.5 (Ubuntu)
Revision history for this message
Cam Cope (ccope) wrote :

Apparently the bug was not in block cloning, it was just exacerbated by it. The issue affects older versions of ZFS as well, including 2.1.x. That is why the upstream OpenZFS project has backported the patch to 2.1.x (Chad linked the PR above). Thus, linux-hwe-6.2 is affected.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

It appears upstream is prepping a 2.1.14 and 2.2.2 release that includes this fix to both branches.
https://github.com/openzfs/zfs/pull/15601
https://github.com/openzfs/zfs/pull/15602

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe-6.2 (Ubuntu):
status: New → Confirmed
Changed in linux-hwe-6.5 (Ubuntu):
status: New → Confirmed
Revision history for this message
Chad Wagner (chad-wagner) wrote :

Reproducible on Ubuntu 22.04 LTS w/ linux-hwe-6.2 (zfs-2.1.9) using the NixOS test suite:

[zhammer::647] checking 10000 files at iteration 0
[zhammer::647] zhammer_647_0 differed from zhammer_647_576!
[zhammer::647] Hexdump diff follows
--- zhammer_647_0.hex 2023-11-30 15:37:43.887596987 +0000
+++ zhammer_647_576.hex 2023-11-30 15:37:43.891596970 +0000
@@ -1,3 +1,3 @@
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
 *
 00004000
[zhammer::639] checking 10000 files at iteration 0
[zhammer::631] checking 10000 files at iteration 0
[zhammer::635] checking 10000 files at iteration 0
[zhammer::677] checking 10000 files at iteration 0
parallel: This job failed:
./zhammer-min.sh /test 10000000 16k 10000 7

The zhammer script is available at:
https://github.com/numinit/nixpkgs/blob/zhammer/nixos/tests/zhammer.sh

The bootstrap for it is to use a ramdisk backed file:
truncate -s 4G /dev/shm/zfs
zpool create -f -o ashift=12 -O canmount=off -O mountpoint=none -O xattr=sa -O dnodesize=auto -O acltype=posix -O atime=off -O relatime=on tank /dev/shm/zfs
zfs create -o canmount=on -o mountpoint=/test tank/test
parallel --lb --halt-on-error now,fail=1 zhammer /test 10000000 16k 10000 ::: $(seq $(nproc))

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Forgot to mention you also need coreutils-9.0 or later, or some other program that uses lseek SEEK_HOLE/SEEK_DATA as that is the culprit for the bug. Essentially it errorneously reports holes in files that are still dirty buffers. I had a local copy of "cp" from coreutils/sid compiled for jammy.

So the primary trigger of the original report was always coreutils, but obviously any software can use hole detection features of lseek and run into it. I personally haven't noticed any data corruption.

Revision history for this message
Dylan (themttakeover) wrote :

2.2.2 (version containing a fix for this) is now out

https://github.com/openzfs/zfs/releases/tag/zfs-2.2.2

Revision history for this message
Mike Ferreira (mafoelffen) wrote (last edit ):

This is my merge fo a Bug Report marked as a duplicate of this one: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2044969

In the OpenZFS patch thread, they had said that feature@block_cloning, because of the underlying bug, OpenZFS had said they turned that feature off by default in 2.2.1. (They actually didn't.)

2.2.1 is in Noble Proposed. Rick S. and I have been testing it (in DEV Noble) ever since I reported my Bug Report which recently has been marked as a duplicate of this Bug... (Noted above)

In my tests, in version 2.2.1, that package does not turn off that feature by default. Noble (with zfsutils 2.2.1) creates the rpool with feature@block_cloning=Active, as a default. I debugged the install activity for noble during the ZFS install, and it is passing features=defaults... That is not something that was injected i the code of ubuntu-desktop-installer, so it still set as a default.

Just to test that, I confirmed it. If you create a new pool by defaults, that feature gets set to enabled, unless you explicitly give it: -o feature@block_cloning=disabled

My previous proposal was for 2.2.1 to get pushed out of Noble Proposed to Noble Main. But that was before we found out that was was said about 2.2.1 was wrong. Now my proposal is to build 2.2.2, move it to Noble proposed, so Rick S. and I can test and verify it... The move it to Noble Main.

Those go along with backporting the patch back through Focal. Yes that feature@block_cloning uncovered the Bug, and increases the likelihood that it occurs, but... The underlying bug in the code goes back to code from 2012. Focal is still in support.

Ubuntu Pro ESM is another matter, which 14.04 is still in ESM, and still has 4 months left... But is also affected by this Bug/CVE.

We had also started a thread on this in the Noble Dev section: https://ubuntuforums.org/showthread.php?t=2492927&p=14167819

OpenZFS did release zfs-linux 2.2.2 last week. But I do not see any zfs-linux builds yet of that there.

Yes we know if only happens during a cp (does nto affect rsync), but the OpenZFs patches seem to work on the ZFS end of things...

This summarizes what we had there in that bug report, into this one...

Revision history for this message
Mike Ferreira (mafoelffen) wrote :

I wondered why my old bug report has not gotten any attention. Being a duplicate of this bug didn't help, but at least someone from the security team marked it as a duplicate of this bug.

BUT... On Bug Reports filed against zfs-linux, there are 55 bugs, with not one single person assigned, that are either marked as new, or confirmed, just because there were multiple users affected...

Has anyone else, besides me, wondered why they are not even getting triaged?

We do our due diligence to report these issues. I do my research, testing try to find answers, and work-around's... That we "can" get past what is wrong. I keep trying to keep faith in this support system. We try to make a difference, and do things for the betterment of Ubuntu. It does not seem sometimes like we are being taken seriously, or being heard.

What other choices do we have? Compile it ourselves or add it to our own PPA"s? For actively supported versions, this seems like it should just be dealt with right? Especially for LTS.

Revision history for this message
Tony Phan (toekneefan) wrote :

@mafoelffen I'm sure the developers are hard at work with this and other issues.

Ubuntu LTS, I believe, does not yet have coreutils v9, so it is unlikely to be affected. This bug is actually extremely old (possibly going back almost two decades), but recent changes, such as a change to zfs_dmu_offset_next_sync and use of SEEK_HOLE in cp of coreutils increased the likelihood of this bug manifesting.

According to recent comments by OpenZFS contributors, this bug actually seems unrelated to block cloning, though they are keeping it off by default to be safe (https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73) in case there are other unknown bugs/interactions. So 2.2.1 is actually a false patch, and 2.2.2 is the real one, grounded in more evidence and investigation of the bug.

Revision history for this message
Tuomas Vanhala (zuomas) wrote :

@toekneefan I know the workforce is always limited and I don't want to blame anyone. After all, I'm not paying for support.

However, silent data corruption in a file system that is supposed to first and foremost prevent silent data corruption is a serious defect, and I understand that people are worried about the apparent inactivity on this bug. I'm also using ZFS on FreeBSD and there the patch was rushed out as soon as it was available. So it would be great to get some kind of indication if this is being worked on (such as triaging the bug). If we know that shipping the patch is going to take very long, then at least affected users can make an informed decision if they are just going to live with the bug or start looking for alternative solutions.

Revision history for this message
Mike Ferreira (mafoelffen) wrote (last edit ):

@toekneefan -- Our tests of 2.2.1 from Noble Proposed showed that it was actually "enabled" by default, instead of being disabled like they said they did in it's release notes... So the rpool installs as "active", and any other pools created with feature=default (or just omitted) are created as that being "enabled"... So assuming that was turned off, turned out to be wrong.

The only way I got that feature to create new pools with feature@block_cloning=disabled, was to explicitly pass that in the creation statement.

The Canonical 'ubuntu-desktop-installer' team needs to know they need to add option that to the ZFS scripted installer script... In my debug traces during the install, they are passing "features=default"... If that feature does need to be turned off.

I guess that only really matters if these patches for the Bug works and prevents it when they hit us and are applied. That "is" why turning off that feature was done for the meantime.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

xnox, rincebrain came to the conclusion that the conditions for this bug have existed since 0.6.2: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73 and robn thinks the conditions for this bug was introduced in 2006: https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73?permalink_comment_id=4778688#gistcomment-4778688 . (Both rincebrain and robn used appropriate waffle-words for something from ten years and fifteen years ago.)

Upstream made a new 2.1.14 release:

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14

This is only coming to light with new coreutils using SEEK_HOLE/SEEK_DATA -- before now these features were so rarely used the bug was able to hide pretty well.

I suggest we should backport https://github.com/openzfs/zfs/pull/15571/files through all our currently supported systems.

Thanks

Revision history for this message
Mike Ferreira (mafoelffen) wrote (last edit ):

@seth-arnold -- Thank you for those links. Those explain a lot of what is currently happening, and the changes coming down.

This was comforting from robn:
"If you can't get an updated ZFS (2.2.2, 2.1.14, or a patch from your vendor) then 'dmu_offset_next_sync=0' is the next best thing."

Anyone know how long it might be before it gets built by Launchpad? zfs-linux 2.2.2 was released last week.

Noble was slated for 2.2.1. Now that looks like for 2.2.2. Waiting for it to hit so Rick S and I can start testing it for Noble...

Now just sitting on my thumbs and chewing through my lip. In the meanwhile, thinking of moving Noble from my VM test-cases to physical on my main laptop (Jammy ZFS-On-Root), so I can test these on "Metal" with 2.2.2 when it comes down. And test the 2.1.14 on my other 4 machines.

Revision history for this message
Matthew Bradley (mwbradley) wrote :

Since this bug has been around since 2006, is there a possibility this will be backported to the 0.8.3-1ubuntu12.15 version of ZFS? It's the up-to-date version of ZFS installed on systems here with ubuntu 20.04.6.

OS: Ubuntu 20.04.6 LTS x86_64
Kernel: 5.4.0-167-generic

uname -a
Linux archive-box 5.4.0-167-generic #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Rob N (robn) wrote :

Hi all,

Patch for 0.8 (and others) available here: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T12876116b8607cdb/lseek-glitch-dirty-dnode-patches-for-older-openzfs-zfs-on-linux

If you didn't want to take all of 2.2.2 or 2.1.14, specific patches for those series are:

https://github.com/robn/zfs/commits/dnode-dirty-data-2.2
https://github.com/robn/zfs/commits/dnode-dirty-data-2.1

Block cloning appears unrelated at this point, though there are multiple cloning-related fixes in 2.2.2, and more on master that (presumably) will land in 2.2.3.

Regarding cloning being "off by default" in 2.2.1, it will still appear "enabled" in the feature list because it is actually enabled in the disk structures. The new module option `zfs_bclone_enabled` is disabled (0) by default, and when disabled it rejects the syscalls that lead cloning to happen. So eg `cp --reflink=always` would return `ENOTSUPP`.

Revision history for this message
Rob N (robn) wrote :

Oh, one other thing: the CVE is unlikely to be anything. No one knows who posted it, and the scenario it describes is extremely vague and at best, suggests the author didn't actually understand the issue.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

seth-arnold: yes, I do think we need to backport it all the way back. And since it is kernel code, it sort of should go via normal SRU, and be integrated into a kernel cycle and release over time. As actually releasing zfs-linux into security pocket will not actually do much for most users that are running kernel compiled in zfs.

I will discuss with kernel cycle leads to see how to fit this in.

no longer affects: linux-hwe-6.2 (Ubuntu)
no longer affects: linux-hwe-6.5 (Ubuntu)
Changed in zfs-linux (Ubuntu Noble):
assignee: nobody → Dimitri John Ledkov (xnox)
Changed in zfs-linux (Ubuntu Mantic):
assignee: nobody → Dimitri John Ledkov (xnox)
Changed in zfs-linux (Ubuntu Lunar):
assignee: nobody → Dimitri John Ledkov (xnox)
Changed in zfs-linux (Ubuntu Jammy):
assignee: nobody → Dimitri John Ledkov (xnox)
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

this bug report will now be limited to things that are presumed to be fixed by 2.2.2 or 2.1.14, or the dirty dnode patch backport alone.

description: updated
summary: - zfs block cloning file system corruption
+ Multiple data corruption issues in zfs
description: updated
Changed in zfs-linux (Ubuntu Noble):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu Bionic):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Focal):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Jammy):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Lunar):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Mantic):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Xenial):
status: New → Confirmed
Changed in zfs-linux (Ubuntu Mantic):
importance: Undecided → High
Changed in zfs-linux (Ubuntu Lunar):
importance: Undecided → High
Changed in zfs-linux (Ubuntu Jammy):
importance: Undecided → High
Changed in zfs-linux (Ubuntu Focal):
importance: Undecided → Medium
Changed in zfs-linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in zfs-linux (Ubuntu Xenial):
importance: Undecided → Low
Changed in zfs-linux (Ubuntu Lunar):
importance: High → Medium
description: updated
Changed in zfs-linux (Ubuntu Mantic):
status: Confirmed → In Progress
Revision history for this message
Mike Ferreira (mafoelffen) wrote (last edit ):

@xnox --

Thank you!!!

This is what I see in Noble right now:
>>>
mafoelffen@noble-d05:~$ apt list -a zfsutils-linux
Listing... Done
zfsutils-linux/noble-proposed 2.2.2-0ubuntu1 amd64
zfsutils-linux/noble,now 2.2.1-0ubuntu1 amd64 [installed]
>>>
So it 2.2.1 got pushed to Noble main, with 2.2.2 to Noble Proposed (so far)...

Rick S and I can start testing/verifying that today from Noble Proposed.

Any idea on when it might hit the Snap 'ubuntu-desktop-installer' beta channel, so we can start testing it in the Noble installer?

Revision history for this message
FL140 (fl140) wrote :

@mafoelffen I also filed a bug report here, but it looks like it never showed up anywhere. Anyways, since the 2.2.1 HotFix never made it to mantic and I needed this fixed ASAP I made a script for building the deb packages, so if you also like to build your own until there is an official HotFix 2.2.2 from Ubuntu have a look here: https://github.com/openzfs/zfs/issues/15586#issuecomment-1836806381
(There seems to be a little (wrong path) bug at the end on the final deb packages install command, but this should be easy to fix by hand.)

Revision history for this message
FL140 (fl140) wrote :

BTW I have the 2.2.2 self built custom package running here on a 23.10 Ubuntu installation and it looks good so far. I would also advice to check any possible affected pools running `zdb -ccc -vvv pool_name` (Note: I don't know if the third -c makes any difference found nothing in the man page. Also run the check with the file systems NOT mounted.) There was not only the "bclone" bug fixed in 2.2.2 there also is a bug fix for ZFS on LUKS which can lead to corruption, this is also important. (Note: A similar bug has just been opened for 2.2.2 see: https://github.com/openzfs/zfs/issues/15646, we need to watch this.)

Revision history for this message
FL140 (fl140) wrote :

FYI The `zdb -ccc` command does NOT show you which files where affected, but it helped me to identify at least one zpool which got corrupted between 2.2.0 and 2.2.1, but the source of this has to be investigated further. At least it can't hurt to see if at least the pool is consistent (there still can be corrupted files though).

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2040181 updating mantic to 2.2.0 final is in mantic-proposed, and this bug here is in mantic-unapproved.

Should we drop 2.2.0 from mantic-proposed in favor of this one here?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Oh, I missed this bit in https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2040181/comments/10 in the other bug:

"It is still worth it to release this sru, before staging the next one in proposed"

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Could https://github.com/openzfs/zfs/pull/15651 be picked up for this update still?

Revision history for this message
Mike Ferreira (mafoelffen) wrote (last edit ):

Rick and I have been testing the Proposed 2.2.2 packages in Noble since yesterday morning. <EDITED>

zfs-linux packages blew out Rick S's Noble system today. He had to revert back to 2.2.1 and add zfs-dkms back into his. I'll try to get him here to talk on that himself.

I'm still up and going fine on 2.2.2.

Waiting to see if anything of this helps with: zfs-dkms 2.1.5-1ubuntu6~22.04.2 https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2044630... But I know that is (1) focal-lunar (some Mantic), and (2) most likely just a regression issue like the last update had with zfs-dkms 2.1.5-1ubuntu6~22.04.1 https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2030100. Where the make did not recognize that kernels after 5.19.0 are supported. It's not needed by most now, but some of us still need it for out of series kernel testing.

Revision history for this message
Mike Ferreira (mafoelffen) wrote :

Wait... I seem to be confused by this:
>>>
mafoelffen@noble-d05:~$ apt-cache show zfsutils-linux | grep 'Package:\|Breaks:'
Package: zfsutils-linux
Breaks: openrc, spl (<< 0.7.9-2), spl-dkms (<< 0.8.0~rc1), zfs-dkms (>> 2.2.2-0ubuntu1...), zfs-dkms (<< 2.2.2-0ubuntu1)
>>>
Does that mean, even though zfs-dkms is built from zfs-linux 2.2.2, that if installed it will break ZFS... and therefore no longer works?

On Rick's 2.2.2 blew out access to all his pools. He had to revert to 2.2.1, and restore plus add zfs-dkms back in, just to stay running.

My machine is still doing fine.

Should we start a new bug report against zfs-dkms for Noble on the above? That just doesn't sound logical somehow.

Revision history for this message
Steve Langasek (vorlon) wrote :

zfs-linux (2.2.2-0ubuntu1~23.10) mantic; urgency=medium

  * New upstream release. LP: #2044657
    - Address potential data corruption

I'm assuming the plan for fixing it on earlier releases does not involve updating to 2.2.2.

And this does not fall under a hardware enablement exception. AND there are packaging changes under debian/.

The diff for this new upstream release is 9kloc. I would expect a cherry-picked fix here for mantic, not a full upstream backport - just as will be required for releases prior to mantic.

Changed in zfs-linux (Ubuntu Mantic):
status: In Progress → Incomplete
Revision history for this message
Rick S (1fallen) wrote :

All good hear, but curious about:
   *apt show zfsutils-linux |grep -e "Breaks"

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Breaks: openrc, spl (<< 0.7.9-2), spl-dkms (<< 0.8.0~rc1), zfs-dkms (>> 2.2.2-0ubuntu1...), zfs-dkms (<< 2.2.2-0ubuntu1)

Both live nicely here for me:
    * apt policy zfsutils-linux zfs-dkms
zfsutils-linux:
  Installed: 2.2.2-0ubuntu1
  Candidate: 2.2.2-0ubuntu1
  Version table:
 *** 2.2.2-0ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu noble/main amd64 Packages
        100 /var/lib/dpkg/status
zfs-dkms:
  Installed: 2.2.2-0ubuntu1
  Candidate: 2.2.2-0ubuntu1
  Version table:
 *** 2.2.2-0ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu noble/universe amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu noble/universe i386 Packages
        100 /var/lib/dpkg/status

Seems solid today with multiple boots, reboots and cold boots

Revision history for this message
Dimitri John Ledkov (xnox) wrote : Re: [Bug 2044657] Re: Multiple data corruption issues in zfs

On Fri, 8 Dec 2023 at 23:35, Steve Langasek <email address hidden> wrote:
>
> zfs-linux (2.2.2-0ubuntu1~23.10) mantic; urgency=medium
>
> * New upstream release. LP: #2044657
> - Address potential data corruption
>
> I'm assuming the plan for fixing it on earlier releases does not involve
> updating to 2.2.2.
>
> And this does not fall under a hardware enablement exception. AND there
> are packaging changes under debian/.
>
> The diff for this new upstream release is 9kloc. I would expect a
> cherry-picked fix here for mantic, not a full upstream backport - just
> as will be required for releases prior to mantic.

My plan was to do updates as follows:
- upgrade to 2.2.2 for mantic
- upgrade to 2.1.14 for jammy (to cover this issue, but also potential
straight line speculation fixes)
- cherry-pick of individual bug fix for focal and earlier

I also do not understand why I ended up maintaining any of zfs-linux
anymore btw. Especially the userspace side of things. I am happy to
SRU just the kernel code changes alone, and let foundations handle the
zfs userspace fixes.

Separately, do note that lxd snap builds and vendors the full multiple
zfs userspace tooling. So i'm not sure there is any stability value in
not taking the full point releases for the userspace tooling changes
in the classic ubuntu release - or try to untangle them.

Regards,

Dimitri.

Changed in zfs-linux (Ubuntu Noble):
status: Fix Committed → Fix Released
Revision history for this message
Geoff Nordli (geoffn) wrote :

Hi. Will there be a release for Jammy soon?

Revision history for this message
Milan Cvejic (milan-cvejic) wrote :

I am also interested if there will be release for Jammy soon?

Revision history for this message
Paul (pgjensen) wrote :

Sorry I can't tell, is the data corruption bug backported into mantic 2.2.0? Or is mantic still not protected?

Revision history for this message
Keeley Hoek (khoek) wrote :

@Paul no, it isn't.

Revision history for this message
FL140 (fl140) wrote :

IMHO it is alarming, that there STILL is no official 2.2.2 package for Mantic which is the official current release, for a problem which would have needed real attention and a hotfix. I build my own packages as I noticed the problem early, but what with all the users out there which have faith in their distribution providing critical updates in an acceptable time? And NO one month is not acceptable for that kind of bug.

Revision history for this message
trackwitz (trackwitz) wrote :

I have to agree with FL140, it is absolutely shocking that there is not even a sign of activity here.
Not only is the bug still present in Mantic but also in Jammy (22.04 LTS) which is the current long-term-support release and supposed to be used in server environments.
That is absolutely unacceptable for a data critical bug like this. Especially since the bugfix is a simple one-liner and was even backported to 2.1 as well as 0.8 by a zfs maintainer (robn) is this very thread.

If this is really the way canonical is handling its long-term “support”, then the only conclusion has to be to absolutely AVOID using Ubuntu in ANY productive environment.

Revision history for this message
Michael Baehr (notunix) wrote :

It looks like there is one person charged with maintaining this package, and we're still in the holidays. I agree this has taken too long and it sounds like there should be more resources assigned but I will wait for Dimitri to finish his work. The Ubuntu ZFS package has quite a few customizations, mostly to work with things like zsys and full disk encryption. I am sure he is working to get it just right.

For the time being, I've been running Satadru's "zfs-experimental" package for Mantic, but the missing zsys/fde support does mean I have to manually unlock my drive from the initramfs rescue shell. You can find it here: https://launchpad.net/~satadru-umich/+archive/ubuntu/zfs-experimental

Revision history for this message
Chris Halse Rogers (raof) wrote :

So, looking at 2.1.14 in lunar-proposed -
This is not bugfix-only (or even bugfix and hardware-enablement only - it includes, at least, changes to colourise the output of some commands). Also, I think this will introduce bug #2046082 into lunar and jammy?

I see we've accepted some ZFS micro-releases in the past. I'm not familiar with the circumstances surrounding those, but I'm uncomfortable with this one. It seems clear that zfs upstream do not have the same expectations of stable releases as we do, so I'm not confident that just taking an upstream microrelease will be acceptable.

I'm not saying a hard “no”, but what investigation was done to determine how much effort cherry-picks to jammy/lunar would be? What are the actual set of changes between 2.1.9 (/2.1.5 for jammy) and 2.1.14, and are any of those changes maybe unsuitable for an SRU?

We don't appear to have any special process documented for zfs - maybe we should? Regressions have extremely high potential impact but at the same time we *do* need to sometimes make significant updates for hardware enablement, and, to the best of my knowledge, unlike the hwe kernel the zfs changes are not opt-in.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Additionally: my understanding is that the corruption fix is minimally invasive and should be easy to backport. Should we quickly do such a minimal backport to get that fix out quickly, and then have less pressure to handle the hardware enablement kernel quickly?

Revision history for this message
Matthew Bradley (mwbradley) wrote :

>Should we quickly do such a minimal backport to get that fix out quickly

Unequivocally, yes. It has been several weeks and users of ZFS across multiple releases are still exposed to a data corruption bug. The fix should go out asap, especially when it's only a single-line fix.

This bug has already been fixed in every other OS supporting ZFS. Even if the bug is rare on older versions of ZFS, it's extremely troubling to users when data corruption fixes aren't rolled out in a timely manor.

Revision history for this message
Chris Halse Rogers (raof) wrote :

This is a bit more difficult, because the primary way that users get zfs in Ubuntu is via the module built with the *kernel* image. My understanding is that fixing that requires a zfs-linux upload and then a further kernel upload to pull in the new zfs.

Timo Aaltonen (tjaalton)
Changed in zfs-linux (Ubuntu Lunar):
status: Confirmed → Won't Fix
description: updated
Changed in zfs-linux (Ubuntu Mantic):
status: Incomplete → In Progress
Changed in zfs-linux (Ubuntu Jammy):
status: Confirmed → In Progress
Changed in zfs-linux (Ubuntu Focal):
status: Confirmed → In Progress
Changed in zfs-linux (Ubuntu Lunar):
assignee: Dimitri John Ledkov (xnox) → nobody
Changed in zfs-linux (Ubuntu Focal):
assignee: nobody → Dimitri John Ledkov (xnox)
Changed in zfs-linux (Ubuntu Lunar):
importance: Medium → Undecided
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Chris,

> ... the primary way that users get zfs in Ubuntu is via the module built with the *kernel* image.
> ... fixing that requires a zfs-linux upload and then a further kernel upload to pull in the new zfs.

Yes, fixing the zfs modules shipped with the kernel packages require uploads for zfs-linux and kernel packages.

Additionally, users can build zfs modules locally with the zfs-dkms package (from zfs-linux; similarly to the kernel package build).

This may be useful after the new zfs-linux is available but before a new kernel package (based on it) is avaiable.

Hope this helps!

For reference,

$ lxc launch ubuntu:mantic --vm -c limits.cpu=2 -c limits.memory=2GiB -c security.secureboot=false mantic-vm-zfs
$ lxc shell mantic-vm-zfs

# modinfo zfs | head -n2
filename: /lib/modules/6.5.0-15-generic/kernel/zfs/zfs.ko.zst
version: 2.2.0-0ubuntu1~23.10

# dpkg -l | grep zfs
#

# apt-cache show zfs-dkms | grep -m1 Source:
Source: zfs-linux

# apt update && apt install --yes zfs-dkms
...
Setting up zfs-dkms (2.2.0-0ubuntu1~23.10.1) ...
Loading new zfs-2.2.0 DKMS files...
Building for 6.5.0-15-generic
Building initial module for 6.5.0-15-generic
Done.
...

# modinfo zfs | head -n2
filename: /lib/modules/6.5.0-15-generic/updates/dkms/zfs.ko.zst
version: 2.2.0-0ubuntu1~23.10.1

# modprobe zfs

# cat /sys/module/zfs/version
2.2.0-0ubuntu1~23.10.1

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Reviewed zfs-linux on mantic/jammy/focal-unapproved for acceptance.

They all look good to me (notes below, in addition to usual checks),
but I'll ask for another reviewer to provide input too, as it's ZFS.
(But I did want to provide some assistance in this long-running SRU.)

Thanks,
Mauricio

...

mantic:

 The code change looks good and hasn't changed further upstream.
 (There are suggested /alternatives/ in Draft state [1], but the draft reassures that the original, merged PR is correct.)

 The [Test Plan] section provides both a specific reproducer and general testing (in many forms), which is reassuring for -proposed verification.

 P.S.: not very polished changelog/patch files, but combined they do provide the information to find the LP bug and upstream/commit, usually found via DEP-3 headers as Origin/Bug-Ubuntu.
 (Style isn't the most important thing in this case or at the moment.)

 $ git log --oneline zfs-2.2.2 -- module/zfs/dnode.c | grep -m1 'dnode_is_dirty: check dnode and its data for dirtiness'
 9b9b09f452a4 dnode_is_dirty: check dnode and its data for dirtiness

jammy:

 Likewise.

 Regarding the point release, 22.04.4 (scheduled for Feb 22 [2]),
 we are still before 'Release minus 14 (or 7) days' [3]
 (when '6. Coordinate with the SRU team' and '7. ... -updates pocket freeze' happen).

 $ git log --oneline zfs-2.1.14 -- module/zfs/dnode.c | grep -m1 'dnode_is_dirty: check dnode and its data for dirtiness'
 77b0c6f0403b dnode_is_dirty: check dnode and its data for dirtiness

focal:

 The code change is a backport not from upstream (0.8 no longer maintained, apparently),
 but comes from the original author of the patch (and noted with DEP-3 Origin:, great!),
 and looks equivalent (same code change, just in a different function and context).
 [4]

links:
[1] https://github.com/openzfs/zfs/pull/15615
[2] https://discourse.ubuntu.com/t/jammy-jellyfish-release-schedule/23906
[3] https://wiki.ubuntu.com/PointReleaseProcess
[4] https://github.com/robn/zfs/commit/f2f7f43a9bf4628328292f25b1663b873f271b1a

Revision history for this message
Charles Hedrick (hedrick) wrote :

For what it's worth, we've been running a file server for several weeks with the 2.1.14 .ko files in 22.04 with no problem. I didn't try to cherry pick the one fix as I'd prefer to have the whole 2.1.14.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Tobias, or anyone else affected,

Accepted zfs-linux into mantic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/2.2.0-0ubuntu1~23.10.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-mantic to verification-done-mantic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-mantic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zfs-linux (Ubuntu Mantic):
status: In Progress → Fix Committed
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Tobias, or anyone else affected,

Accepted zfs-linux into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/2.1.5-1ubuntu6~22.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zfs-linux (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in zfs-linux (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Tobias, or anyone else affected,

Accepted zfs-linux into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.8.3-1ubuntu12.17 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
trackwitz (trackwitz) wrote :

So, I did test the -proposed package for mantic on a clean install by creating a test-pool and running the zhammer.sh linked #16.


As expected on an unpatched system the error occurred during the first iteration:

[zhammer::1858] zhammer_1858_0 differed from zhammer_1858_538!
[zhammer::1858] Hexdump diff follows
--- zhammer_1858_0.hex 2024-02-03 12:44:07.478205144 +0000
+++ zhammer_1858_538.hex 2024-02-03 12:44:07.478205144 +0000
@@ -1,3 +1,3 @@
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
 *
 00004000
[zhammer::1858] Uname: Linux zfstest 6.5.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan 9 17:03:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
[zhammer::1858] ZFS userspace: zfs-2.2.0-0ubuntu1~23.10.1
[zhammer::1858] ZFS kernel: zfs-kmod-2.2.0-0ubuntu1~23.10
[zhammer::1858] Module: /lib/modules/6.5.0-15-generic/kernel/zfs/zfs.ko.zst
[zhammer::1858] Srcversion: 92158472E32FE6AEEEC7201
[zhammer::1858] SHA256: 177442f43f4c94537f8b003ab28ed33d00240c175e500370ad5bdd5c50234655
parallel: This job failed: zhammer /test 10000000 16k 10000 7


After enabling the -proposed repository, installing the updates and restarting the system is looks like the userspace-tools are now on the patched version (zfs-2.2.0-0ubuntu1~23.10.1), however the kernel module is still on the old version (without the .1) and, as expected, the bug is still reproducible:

[zhammer::1706] zhammer_1706_0 differed from zhammer_1706_1204!
[zhammer::1706] Hexdump diff follows
--- zhammer_1706_0.hex 2024-02-04 14:29:28.296850257 +0000
+++ zhammer_1706_1204.hex 2024-02-04 14:29:28.296850257 +0000
@@ -1,3 +1,3 @@
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
 *
 00004000
[zhammer::1706] Uname: Linux zfstest 6.5.0-17-generic #17-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan 11 14:01:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
[zhammer::1706] ZFS userspace: zfs-2.2.0-0ubuntu1~23.10.2
[zhammer::1706] ZFS kernel: zfs-kmod-2.2.0-0ubuntu1~23.10
[zhammer::1706] Module: /lib/modules/6.5.0-17-generic/kernel/zfs/zfs.ko.zst
[zhammer::1706] Srcversion: 92158472E32FE6AEEEC7201
[zhammer::1706] SHA256: 0f6a069f6c3045e7c86507d7c158691d4ace8c6785888579652236fbdf8c66c0
parallel: This job failed: zhammer /test 10000000 16k 10000 2

Only when I am explicitly using the zfs-dkms package instead of the build-in kernel module, the correct module is loaded and the bug can’t be triggered any more even after 5 iterations (x10.000 files).

Therefore, I can conclude, that the fix itself is working correctly, however the package distributed in the -proposed repository does not include the correct kernel module. However, as this is the way most people are using ZFS on Ubuntu (instead of using the dkms module) this fix also has to be introduced in the current kernel package to fix the problem.

Revision history for this message
FL140 (fl140) wrote :

@Timo I just saw the 2.2.0-0ubuntu1~23.10.2 announcement for mantic. IMHO cherry picking is not a very good approach when it comes to 2.2.2. I followed the ZFS github repo closely during the time of the bug in November and December until 2.2.2 came out on the ZFS GitHub repo and there where multiple fixes going into 2.2.2. which where important IIRC.

So we are waiting for an official package for Ubuntu now for a very long time for this kind of severe bug and we still don't end up with the intended fixes by the ZFS devs, which have given quite some thought on what to put into 2.2.2.

I would strongly suggest to go with the 2.2.2 package not just one commit. I am running 2.2.2 on Ubuntu with a custom package for quite some time now (see earlier post) and everything looks good so far.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@fl140 i have previously prepared upgrades to new point release from 2.1.x and 2.2.x series all of those SRUs got rejected.

@mfo all dkms based kernel modules that kernel packages prebuild get picked up every kernel sru cycle, including zfs, which will be picked in a future kernel sru cycle.

tags: added: verification-done-mantic verification-needed-focal verification-needed-jammy
Revision history for this message
FL140 (fl140) wrote :

@Dimitri If I see it correctly you have been assigned this bug. So I thought this should be your decision!? If not this is bad, who is responsible then for the rejection of this? Whoever made that decision obviously didn´t follow close enough what was going on in the zfs GitHub repo regarding this bug and the multiple raised issues.

I honestly don't get it how this important fixes ending up in 2.2.2 needed over 2 months and then don't make it to a decent solution in Ubuntu. Almost every relevant distribution had that fixed fast and of course went with 2.2.2 having the trust in the ZFS devs in knowing what they are fixing. (Well I know that all raised from a bug in the zfs repo in the first place, but that one was quite complex.)

So I can just ask again everyone involved in this decision to rethink this and go with 2.2.2 for mantic. Please don´t waste 2 months with cherry picking and brewing own solutions when the house is burning on a central part of the OS as the filesystem. This should have been a hotfix in the first place.

Revision history for this message
Jonathan Stucklen (stuckj) wrote :

Just updated my pool today and then remembered hearing about this bug. Guess I should have searched first. Looking forward to the proposed fix getting released. Thanks for the patch. :)

For now, I'm using the suggestion of setting `zfs_dmu_offset_next_sync=0` based on this: https://www.reddit.com/r/zfs/comments/1826lgs/psa_its_not_block_cloning_its_a_data_corruption.

Plopped the kernel param into `/etc/default/grub` and did an `update-grub` to make it persistent until the patch is out.

Revision history for this message
Pavel Titov (ptitov) wrote :

To add to Jonathan's comment, a slightly easier way to do this (at least for someone not familiar with bootloaders) is to add `options zfs zfs_dmu_offset_next_sync=0` to /etc/modprobe.d/zfs.conf

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html

On a side note, I'd like to second the concerns about cherry-picking. Given that the bug was originally introduced due to a combination of changes introduced in 2.1 and the pre-existing issue, I'm worried that backporting a fix can introduce its own issues and won't have the same degree of testing and community support as the upstream 2.2.2.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 2.2.0-0ubuntu1~23.10.2

---------------
zfs-linux (2.2.0-0ubuntu1~23.10.2) mantic; urgency=medium

  * d/p/9b9b09f452a469458451c221debfbab944e7f081.patch Cherry-pick
    "dnode_is_dirty: check dnode and its data for dirtiness" fix from
    zfs-2.2.2. LP: #2044657

 -- Dimitri John Ledkov <email address hidden> Tue, 30 Jan 2024 15:54:02 +0000

Changed in zfs-linux (Ubuntu Mantic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for zfs-linux has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
trackwitz (trackwitz) wrote :
Download full text (3.9 KiB)

Verification done for jammy-proposed.

TEST CASE:
I did test the -proposed package on a clean install of jammy (without the HWE kernel) and a test pool.

Since you need a tool that uses lseek SEEK_HOLE/SEEK_DATA and the coreutils bundled with Ubuntu 22.04 are still at version 8.x, I manually downloaded the tarball of coreutils 9.1 from https://www.gnu.org/software/coreutils/ and installed the newly compiled version of "cp" that is used by the zhammer script.

test@zfstest:/test$ cp --version
cp (GNU coreutils) 9.1

Then I used the zhammer.sh script linked in #16 to test for data corruption:
test@zfstest:~$ sudo parallel --lb --halt-on-error now,fail=1 ./zhammer.sh /test 10000000 16k 10000 ::: $(seq $(nproc))

...
As expected on an unpatched system I immediately got the error on the first run:

[zhammer::765410] zhammer_765410_0 differed from zhammer_765410_5930!
[zhammer::765410] Hexdump diff follows
--- zhammer_765410_0.hex 2024-02-15 15:27:22.370732491 +0000
+++ zhammer_765410_5930.hex 2024-02-15 15:27:22.370732491 +0000
@@ -1,3 +1,3 @@
-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
+00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
 *
 00004000
[zhammer::765410] Uname: Linux zfstest 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 GNU/Linux
[zhammer::765410] ZFS userspace: zfs-2.1.5-1ubuntu6~22.04.2
[zhammer::765410] ZFS kernel: zfs-kmod-2.1.5-1ubuntu6~22.04.2
[zhammer::765410] Module: /lib/modules/5.15.0-94-generic/kernel/zfs/zfs.ko
[zhammer::765410] Srcversion: 5A94B4662A7A991696CC35F
[zhammer::765410] SHA256: d83e630d4e46280ba6b8bf922850899318ebd24dedd14ab5672574c4bd811ffc
[zhammer::765391] checking 10000 files at iteration 0
parallel: This job failed: ./zhammer.sh /test 10000000 16k 10000 5

...
Then I enabled the -proposed repository, installed the patched zfs-version and enabled the dkms kernel module for zfs.
After a reboot I re-ran the zhammer.sh and it didn't trigger the bug any more after 5 iterations (x10.000 files).

[zhammer::4265] Uname: Linux zfstest 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 GNU/Linux
[zhammer::4265] ZFS userspace: zfs-2.1.5-1ubuntu6~22.04.3
[zhammer::4265] ZFS kernel: zfs-kmod-2.1.5-1ubuntu6~22.04.3
[zhammer::4265] Module: /lib/modules/5.15.0-94-generic/updates/dkms/zfs.ko
[zhammer::4265] Srcversion: 9CA2EC22B1B594D0C432666
[zhammer::4265] SHA256: f78e84e7f995c15b715f6010ec939ca4202a73654bfafd24572d214ebfcb6364
[zhammer::4265] Work dir: /test
[zhammer::4265] Count: 10000000 files
[zhammer::4265] Block size: 16k
[zhammer::4265] Check every: 10000 files
[zhammer::4265] writing 10000 files at iteration 0
[...]
[zhammer::4253] writing 10000 files at iteration 40000
[zhammer::4248] writing 10000 files at iteration 40000
[zhammer::4246] writing 10000 files at iteration 40000
[zhammer::4256] writing 10000 files at iteration 40000
[zhammer::4252] writing 10000 files at iteration 40000
[zhammer::4280] writing 10000 files at iteration 40000
[zhammer::4271] writing 10000 files at iteration 40000
[zhammer::4265] writing 10000 files at iteration 40000
[zhammer::4246] checking 10000 files at iterati...

Read more...

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I struggle to reproduce the issue on focal with v5.4 and 0.8.3-1ubuntu12.16 zfs

but also see no regressions with the proposed 12.17 update.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Mattias Heimlich (xyv41) wrote :

I don't get it. The latest ZFS version in Jammy seem to be 2.1.5 and not 2.1.14 nor 2.2.2. When will the committed version with the fix be available to the general public?

Revision history for this message
Charles Hedrick (hedrick) wrote :

it's likely to take a while. It's been committed but not released. They chose to cherry pick the patch, so the version will still show as 2.1.5. A lot of us would prefer going to 2.1.14, but there are enough other changes that Ubuntu staff didn't want to do it. I'm considering building my own zfs module.

We chose Ubuntu for our file servers primarily because their support for ZFS. As I look more closely at the quality of that support I'm questioning the decision.

Revision history for this message
trackwitz (trackwitz) wrote :

@xyv41:
As mentioned in the last comment, the fix is only a cherry-pick so the version will go from 2.1.5-1ubuntu6~22.04.2 to 2.1.5-1ubuntu6~22.04.3 because upgrading to 2.1.14 would bear the risk that other bugs might get fixed, too.

Regarding the release date: for mantic it took about 7 days from verification until release. I don’t know if that's an automated process, but in that case the fix should go out in the next few days.
On the other hand, in post #73 it says that the SRU-Team has been unsubscribed from this bug-report, so maybe they don’t even know that the verification for jammy has been completed and the fix will just get stalled.

Additionally, even when (or if?) the committed package gets released, this only affects the userland tools as well as the dkms-build kernel modules. If you (like 99,9% of all Ubuntu users) use the pre-bundled kernel module, then this also needs an update of the kernel package as well.

To give you an idea: The fix for mantic was committed on February, 2nd and released on the 12th. However, it is still not included in the current kernel package (6.5.0.17.19) nor in the -proposed package (6.5.0.25.25) which is yet to be released. So based on the recent update cycles I suspect it will not reach the end-user until mid of march (for matic!!!).

Therefore, I am expecting that it will be at least one more month (more likely 6 weeks) or so until the jammy fix reaches the end-user. Then we will have a fix found (December, 1st) to fix released time of around 5 months. Quite a good result for a one-line-Hotfix.

Revision history for this message
FL140 (fl140) wrote :

I can only second @hedrick on his comment and as already written above, recommend everyone to build 2.2.2 by yourself. I am running 2.2.2 stable for a couple of weeks on mantic. I still have to fix one notebook here which got f**ked up (most likely) by this and reports no errors on a pool scrub but crashes with a ZFS kernel core dump on a `zdb -vvv -ccc POOL_NAME`. The pool is marked having errors and when you restart the machine works as if nothing ever happened. So it is a hidden ticking time bomb as you don't know when the system will crash. Unfortunately I wasn't able to narrow down the problem to a specific file system on the pool yet.

But I can only recommend to go with the full 2.2.2 patch set which has been tested by the ZFS dev team and a couple of highly motivated folks involved in fixing the bug(s) back then.

Revision history for this message
Charles Hedrick (hedrick) wrote :

Since I was just referenced, note that I wouldn't favor putting 2.2 into Jammy. I would, however, prefer to see the latest 2.1.x.

2.2 is still seeing a substantial number of bug fixes. It will probably be ok by the release date of 24.04, but I still wouldn't change major versions for Jammy.

Revision history for this message
FL140 (fl140) wrote :

Sorry hedrick if a left a wrong impression. Looks, like my comment was not clear enough, I was recommending the 2.2.2 version for systems which came with a 2.2.x package, e.g. 23.10 (mantic). There the bugs started to make real problems (happening much more often) because a bunch of things came together (IIRC the enabled block cloning feature, new packages of tools like cp which used the new features, etc.) and could be triggered much easier. I wasn't following older branches too close, so no recommendations on those from my side, but IIRC the origin of the "clone feature bug" was much much older.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 2.1.5-1ubuntu6~22.04.3

---------------
zfs-linux (2.1.5-1ubuntu6~22.04.3) jammy; urgency=medium

  * d/p/77b0c6f0403b2b7d145bf6c244b6acbc757ccdc9.patch Cherry-pick
    "dnode_is_dirty: check dnode and its data for dirtiness" fix from
    zfs-2.1.14. LP: #2044657

 -- Dimitri John Ledkov <email address hidden> Wed, 31 Jan 2024 14:32:34 +0000

Changed in zfs-linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.8.3-1ubuntu12.17

---------------
zfs-linux (0.8.3-1ubuntu12.17) focal; urgency=medium

  * Cherry-pick
    https://github.com/robn/zfs/commit/f2f7f43a9bf4628328292f25b1663b873f271b1a.patch
    backport of "dnode_is_dirty: check dnode and its data for dirtiness"
    by robn. LP: #2044657

 -- Dimitri John Ledkov <email address hidden> Wed, 31 Jan 2024 16:02:07 +0000

Changed in zfs-linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Keeley Hoek (khoek) wrote (last edit ):

FYI everyone, another corruption-with-zeroes bug was found (so this issue was not completely fixed): https://github.com/openzfs/zfs/issues/15933

The fixed was committed last week, and is not yet part of a release: https://github.com/openzfs/zfs/pull/16019

Revision history for this message
Charles Hedrick (hedrick) wrote :

This issue is different. It's another problem with block copy. But that's known to still be an issue. It's why block copy is disabled in 2.2.1. This is only an issue for the Jammy HWE, which has 2.2.0. No one should be shipping 2.2.0 with anything. It should be replaced with 2.2.1 at least.

Revision history for this message
Charles Hedrick (hedrick) wrote :

Actually, it may not be an issue even for Jammy HWE. My copy shows block cloning as an option that's off.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.