Ubuntu
zfs-linux package

zpool scrub malfunction after kernel upgrade

Bug #1860182 reported by Eric Brown on 2020-01-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	zfs-linux (Ubuntu)	Fix Released	Critical	Colin Ian King
	Bionic	Fix Released	Undecided	Unassigned

Bug Description

== SRU Request [BIONIC] ==

The HWE kernel on bionic provides zfs 0.8.1 driver which includes an improved scrub however, the progress stats reported by the kernel are incompatible to the 0.7.x zfs driver.

== Fix ==

Use the new zfs 8.x pool_scan_stat_t extra fields to calculate
the scan progress when using zfs 8.x kernel drivers. Add detection of the kernel module version and use an approximation to the zfs 0.8.0 progress and rate reporting for newer kernels.

For 0.7.5 we can pass the larger 8.x port_scan_stat_t to 0.7.5
zfs w/o problems and ignore these new fields and continue
to use the 0.7.5 rate calculations.

== Test ==

Install the HWE kernel on Bionic, create some large ZFS pools and populate with a lot of data. Issue:

sudo zpool scrub poolname
and then look at the progress using

sudo zpool status

Without the fix, the progress stats are incorrect. With the fix the duration and rate stats as a fairly good approximation of the progress. Since the newer 0.8.x zfs does scanning now in two phases the older zfs tools will only report accurate stats for phase #2 of the scan to keep it roughly compatible with the 0.7.x zfs utils output.

== Regression Potential ==

This is a userspace reporting fix so the zpool status output is only affected by this fix when doing a scrub, so the impact of this fix is very small and limited.

------------------------------------

I ran a zpool scrub prior to upgrading my 18.04 to the latest HWE kernel (5.3.0-26-generic #28~18.04.1-Ubuntu) and it ran properly:

eric@eric-8700K:~$ zpool status
pool: storagepool1
state: ONLINE
scan: scrub repaired 1M in 4h21m with 0 errors on Fri Jan 17 07:01:24 2020
config:

NAME STATE READ WRITE CKSUM
storagepool1 ONLINE 0 0 0
   mirror-0 ONLINE 0 0 0
     ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M3YFRVJ3 ONLINE 0 0 0
     ata-ST2000DM001-1CH164_Z1E285A4 ONLINE 0 0 0
   mirror-1 ONLINE 0 0 0
     ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M1DSASHD ONLINE 0 0 0
     ata-ST2000DM006-2DM164_Z4ZA3ENE ONLINE 0 0 0

I ran zpool scrub after upgrading the kernel and rebooting, and now it fails to work properly. It appeared to finish in about 5 minutes but did not, and says it is going slow:

eric@eric-8700K:~$ sudo zpool status
pool: storagepool1
state: ONLINE
scan: scrub in progress since Fri Jan 17 15:32:07 2020
1.89T scanned out of 1.89T at 589M/s, (scan is slow, no estimated time)
0B repaired, 100.00% done
config:

errors: No known data errors

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: zfsutils-linux 0.7.5-1ubuntu16.7
ProcVersionSignature: Ubuntu 5.3.0-26.28~18.04.1-generic 5.3.13
Uname: Linux 5.3.0-26-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Fri Jan 17 16:22:01 2020
InstallationDate: Installed on 2018-03-07 (681 days ago)
InstallationMedia: Ubuntu 17.10 "Artful Aardvark" - Release amd64 (20180105.1)
SourcePackage: zfs-linux
UpgradeStatus: Upgraded to bionic on 2018-08-02 (533 days ago)
modified.conffile..etc.sudoers.d.zfs: [inaccessible: [Errno 13] Permission denied: '/etc/sudoers.d/zfs']

See original description

Tags:

Revision history for this message

Eric Brown (eberic) wrote on 2020-01-17:

Dependencies.txt Edit (2.4 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.3 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (137 bytes, text/plain; charset="utf-8")

Colin Ian King (colin-king) on 2020-01-17

Changed in zfs-linux (Ubuntu):
assignee:	nobody → Colin Ian King (colin-king)
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Eric Brown (eberic) wrote on 2020-01-17:

Note: rebooting and running the older kernel (4.15.0-74-generic #84-Ubuntu) results in the scrub proceeding at the expected rate.

Revision history for this message

Richard Laager (rlaager) wrote on 2020-01-17:

You original scrub took just under 4.5 hours. Have you let the second scrub run anywhere near that long? If not, start there.

The new scrub code uses a two-phase approach. First it works through metadata determining what (on-disk) blocks to scrub. Second, it does the actual scrub. This allows ZFS to coalesce the blocks and do large, sequential reads in the second phase. This dramatically speeds up the total scrub time. In contrast, the original scrub code is doing a lot of small, random reads.

You might just be seeing the first phase completing in 5 minutes, but the second phase still needs to occur. Or, maybe it did part of the first phase but hit the RAM limit and needed to start the second phase.

If you've let it run for 4.5 hours and it's still showing that status, then I'd say something is wrong.

Revision history for this message

Eric Brown (eberic) wrote on 2020-01-17:

Okay thank you. I will let the scrub run to completion on kernel 4.15, then I will reboot back into 5.3 and scrub again and will allow it time to see what happens. Even if it does end up completing, the text of zpool status was still incorrect: as it was running it said 5 mins remaining, then said it was 100% done, but also that scan is slow (as above, and below).

scan: scrub in progress since Fri Jan 17 15:32:07 2020
1.89T scanned out of 1.89T at 589M/s, (scan is slow, no estimated time)
0B repaired, 100.00% done

Revision history for this message

Eric Brown (eberic) wrote on 2020-01-18:

Update: The scrub back on 4.15 went as expected. I rebooted with 5.3 kernel and tried the scrub again. I saw the same behaviour: the scrub started, with a very fast rate and short time estimate (3-5 minutes), and then reached 100% after about that time. However it continued to say it was in progress, despite saying 100.00% done. It did not calcualte a time estimate, saying the rate was slow, however, it did complete after 3h36 (so fast than it used to on kernel 4.15).

eric@eric-8700K:~$ zpool status
pool: storagepool1
state: ONLINE
scan: scrub repaired 0B in 3h36m with 0 errors on Sat Jan 18 14:50:16 2020
config:

errors: No known data errors

I started a scrub again to see whether it was just the first time with this kernel, but again the % complete seems inaccurate, and the time remaining is definitely inaccurate (0h0m shortly after starting).

So at the very least it is clear that zpool status is not accurately reporting the scrub process on kernel 5.3.

Colin Ian King (colin-king) on 2020-02-07

Changed in zfs-linux (Ubuntu):
importance:	High → Medium

Revision history for this message

Gerald Drouillard (gerald-drouillard) wrote on 2020-02-08:

I believe there was something on the zfs user list that had a similar problem. Let the scrub run as long on the HWE kernel as the other and it looks like it will complete.

Revision history for this message

Eric Brown (eberic) wrote on 2020-02-08:

Yes, @gerald-drouillard, that's what I found. It will complete, but the progress and details reported by `zpool status` are incorrect.

Revision history for this message

Colin Ian King (colin-king) wrote on 2020-02-08:

OK, I'll look into this sometime this week. Thanks for the information.

Changed in zfs-linux (Ubuntu):
importance:	Medium → High
status:	Triaged → In Progress

Revision history for this message

Andrea Righi (arighi) wrote on 2020-02-20:

I'm trying to reproduce this issue, but scrub always completes in a reasonable amount of time in my case and `zpool status` seems to report the correct statistics. I've tried with different pool sizes (all smaller than 256GB). Are you using any special configuration for the zpool?

I was wondering if this change can be related to this problem:
https://github.com/zfsonlinux/zfs/pull/7738/commits/a6f89b0168499335fb4f6a11e18ac1832e07cbd1

As a test, could you try to install zfs-dkms from this ppa?
https://launchpad.net/~arighi/+archive/ubuntu/zfs-linux

I've changed the code touched by commit a6f89b0168499335fb4f6a11e18ac1832e07cbd1 to be more similar to the code before the change, without reverting the fix.

Colin Ian King (colin-king) on 2020-02-27

description:

updated

Revision history for this message

Colin Ian King (colin-king) wrote on 2020-02-27:

#10

I've uploaded a fixed package, it's now going to proceed via the normal SRU process.

Revision history for this message

Eric Brown (eberic) wrote on 2020-03-07:

#11

Thanks for identifying and fixing the bug! I'll test it when it lands.

Revision history for this message

Colin Ian King (colin-king) wrote on 2020-03-10:

#12

@SRU folk, can this expedited as this does affect ZFS scrubs with HWE kernels and it is a relatively simple userspace fix?

Changed in zfs-linux (Ubuntu):
importance:	High → Critical

Timo Aaltonen (tjaalton) on 2020-03-20

Changed in zfs-linux (Ubuntu Bionic):
status:	New → Fix Committed

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2020-03-20: Autopkgtest regression report (zfs-linux/0.7.5-1ubuntu16.9)

#13

All autopkgtests for the newly accepted zfs-linux (0.7.5-1ubuntu16.9) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

lxd/3.0.3-0ubuntu1~18.04.1 (arm64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#zfs-linux

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message

Colin Ian King (colin-king) wrote on 2020-03-23:

#14

The autopkgtests are working fine now after a re-trigger.

Revision history for this message

Colin Ian King (colin-king) wrote on 2020-03-25:

#15

autopkgtests tests passed - all looks sane.

Colin Ian King (colin-king) on 2020-04-04

tags:

added: verification-done-bionic

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-04-07:

#16

This bug was fixed in the package zfs-linux - 0.7.5-1ubuntu16.9

---------------
zfs-linux (0.7.5-1ubuntu16.9) bionic; urgency=medium

  * Fix scan rate reporting statistics (LP: #1860182)
    - 3322-add-scrub-compat-fixes.patch
    - 2200-add-zfs-0.6.x-ioctl-compat-shim.patch (refreshed)
    Use the zfs 8.x pool_scan_stat_t extra fields to calculate
    the scan progress when using zfs 8.x kernel drivers. For
    0.7.5 we can pass the larger 8.x port_scan_stat_t to 0.7.5
    zfs w/o problems and ignore these new fields and continue
    to use the 0.7.5 rate calculations.
    Refresh the ioctl compat shim to export the zfs module
    version for the scrub version check.

-- Colin Ian King <email address hidden> Thu, 27 Feb 2020 17:09:12 +0000

Changed in zfs-linux (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-04-07: Update Released

#17

The verification of the Stable Release Update for zfs-linux has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Eric Brown (eberic) wrote on 2020-04-08:

#18

I just updated and the problem is fixed. Thanks everyone.

Colin Ian King (colin-king) on 2020-04-16

Changed in zfs-linux (Ubuntu):
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuzfs-linux package

zpool scrub malfunction after kernel upgrade

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
zfs-linux package