ZFS unrecoverable error after upgrading from 20.04 to 22.04.1

Bug #1987190 reported by Luis Hernanz
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
zfs-linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I have a server that has been running its data volume using ZFS in 20.04 without any problem. The volume is using ZFS encryption and a raidz1-0 configuration. I performed a scrub operations before the upgrade and it did not find any problem. After the reboot for the upgrade, I was welcomed with the following message:

status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

The volumes still do not have any checksum error but there are 5 zvols that are not accessible. zpool status displays a line similar to the below for each of the five:

errors: Permanent errors have been detected in the following files:

        tank/data/data:<0x0>

I run a scrub and it has not identified any problem but the error messages are not there and the data is still not available. There are 10+ other zvols in the zpool that do not have any kind of problem. I have been unable to identify any correlation between the zvols that are failing.

I have seen people reporting similar problems in github after the 20.04 to the 22.04 upgrade (see https://github.com/openzfs/zfs/issues/13763). I wonder how widespread the problem will be as more people upgrades to 22.04.

I will try to downgrade the version of zfs in the system and report back

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: zfsutils-linux 2.1.4-0ubuntu0.1
ProcVersionSignature: Ubuntu 5.15.0-46.49-generic 5.15.39
Uname: Linux 5.15.0-46-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu82.1
Architecture: amd64
CasperMD5CheckResult: unknown
Date: Sat Aug 20 22:24:54 2022
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: zfs-linux
UpgradeStatus: Upgraded to jammy on 2022-08-20 (0 days ago)
modified.conffile..etc.sudoers.d.zfs: [inaccessible: [Errno 13] Permission denied: '/etc/sudoers.d/zfs']

Revision history for this message
Luis Hernanz (lhernanz) wrote :
Revision history for this message
Luis Hernanz (lhernanz) wrote :

I saw that downgrading the module was not a very realistic option so I decided to follow a different route.

It seems that the problem is the same as the one described here https://github.com/openzfs/zfs/issues/13709. The solution that worked for me is described here: https://github.com/openzfs/zfs/issues/13709#issuecomment-1200430509.

It took a while to recover all the data because of the need for send / receive but it seems it is all back.

I would suggest that you would consider to include some wording about this in the release notes or that you would even stop the upgrade for users that are using native ZFS encryption until this is solved.

Thanks

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Walter (wdoekes) wrote :

So. I'm affected as well.

Luckily there is a patch already:
https://patch-diff.githubusercontent.com/raw/openzfs/zfs/pull/14161.patch

------------
HOW TO APPLY
-------------

=== Step 1: get patch ===

(see zfs-dkms-2.1.4-fix-zero-mac-io-error.patch attachment)

=== Step 2: install zfs-dkms and patch it ===

# apt-get install zfs-dkms
(you may get questions about enrolling MOK keys: do as you're told)

# cd /usr/src/zfs-2.1.4

# dkms remove -m zfs -v 2.1.4 -k $(uname -r)

# dkms status
(should yield nothing)

# patch -p1 < /tmp/zfs-dkms-2.1.4-fix-zero-mac-io-error.patch

# dkms install -m zfs -v 2.1.4 -k $(uname -r)
(rebuild the module again, with patch this time)

# update-initramfs -uk $(uname -r)
(if you run zfs on the root fs, you'll probably want this)

# reboot
(and cross fingers)

=== Step 3: confirm that the module is loaded ===

# modinfo zfs | grep ^version:
version: 2.1.4-0ubuntu0.1+ossomacfix

(zfs mount should now work again, no "Input/Output error")

----------------
CLEANUP/THOUGHTS
----------------

- If you have 'zpool status -v' errors, you'll need to clean them with a 'zpool scrub <pool>'. If you got to this bug report _before_ trying to mount, you might be okay.

- I'm not sure yet, but I think you can mount/unmount datasets and that might correct the MAC everywhere. If it does, you can remove the dkms module once you've done this. (Needs testing.)

- I haven't tested syncing data from (a patched) 22.04 (zfs-linux-2.1.4) to a 20.04 (zfs-linux-0.8.3) yet. I try to be a little careful at first and see how it goes.

- I think this bug needs a HIGH severity label.

Cheers,
Walter Doekes
OSSO B.V.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "zfs-dkms-2.1.4-fix-zero-mac-io-error.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Walter (wdoekes) wrote :

... and, answering my own questions:

- both the pre-fix and the post-fix datasets are mountable on a 20.04 zfs-0.8.3 pool (zfs recv'd from an affected 22.04);

- after a mount/unmount, the MAC is fixed, and we have no need for the patched zfs module anymore.

Cheers,
Walter

Revision history for this message
LucidBrot (lucidbrot) wrote :

I have the same problem. It came up because I upgraded from ubuntu 18.04 LTS to 22.04 LTS (via 20.04 and then immediately upgrading again). Mainly leaving my comment here to indicate that this is an issue that affects many users.

The github issue at https://github.com/openzfs/zfs/issues/13709#issuecomment-1292880434 contains a workaround to make the system boot again after the zfs version upgrade, but imo this needs to be fixed urgently because:

* Not everyone will find that workaround
* Even after that workaround, old snapshots are still inaccessible
* The fix seems to be easy for the repo maintainers, since the issue has been fixed in the latest version of zfs already ( as mentioned in https://github.com/openzfs/zfs/issues/13709#issuecomment-1433505102 )

@Walter thank you for the "How to apply the patch" guidance. I will try it once I have more time, since I noticed I already have zfs-dkms installed, but also a leftover kmod from 2020 and I don't feel confident that I won't break anything on accident.
Does it still make sense to apply your patch, or would I rather figure out how to install zfs directly from the source from github? As far as I understand, dkms builds the kernel modules from source anyway?

Cheers,
lucid

Revision history for this message
Felix Gertz (g-ubuntu-8) wrote :

Affected as well, after upgrading from Ubuntu 20.04 to 22.04.
Also an encrypted dataset.
More then 1TB on our production backup server are not mountable. Very funny.
Thankfully we having more backups servers (without ZFS), so I am chilled. :)

Will try the patch tomorrow and report if it is working for us too.

Just want to chime in to say this definitely needs HIGH severity!

Cheers,
Felix

Revision history for this message
Felix Gertz (g-ubuntu-8) wrote :

Applying the patch to zfs 2.1.5 helped! I am able to mount the dataset. Thank you very much for your description @Walter !

Revision history for this message
Walter (wdoekes) wrote :

By the way, this is fixed in zfs-2.1.7:

$ git log -1 fa7d572a8a3298d446fc4f64a263c125c325b7c8
commit fa7d572a8a3298d446fc4f64a263c125c325b7c8
Author: Rich Ercolani <email address hidden>
Date: Tue Nov 15 17:44:12 2022 -0500

    Handle and detect #13709's unlock regression (#14161)

$ git tag --contains fa7d572a8a3298d446fc4f64a263c125c325b7c8 | sort -V | head -n1
zfs-2.1.7

Jammy is currently (still) at 2.1.5.

Revision history for this message
Jordi de Wal (jdwal) wrote :

Hi,

Attached an up-to-date patch for 2.1.5-1ubuntu6~22.04.1 to be used in Walter his steps in #4

(editing in launchpad is hard :-( )

Revision history for this message
Walter (wdoekes) wrote :

And, step 3 should be:

=== Step 3: confirm that the module is loaded ===

# cat /sys/module/zfs/version
2.1.5-1ubuntu6~22.04.1+ossomacfix

=== Summarizing for 2.1.5 ===
```
cd /tmp

wget https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1987190/+attachment/5704587/+files/zfs-dkms-2.1.5-1-fix-zero-mac-io-error.patch

apt-get install zfs-dkms

dkms remove -m zfs -v 2.1.5 -k $(uname -r)

dkms status

cd /usr/src/zfs-2.1.5

patch -p1 < /tmp/zfs-dkms-2.1.5-1-fix-zero-mac-io-error.patch

dkms install -m zfs -v 2.1.5 -k $(uname -r)

update-initramfs -uk $(uname -r)

#reboot
```

Revision history for this message
Richard Brooksby (rptb1) wrote :

This caused me a major headache and a lot of server downtime. I have written a detailed description of the situation and how I worked around it on zfs-discuss here https://zfsonlinux.topicbox.com/groups/zfs-discuss/Ta4f2dccd6172681c/partial-pool-loss-on-ubuntu-22-upgrade and it may help others. I have not applied this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.