zsysd uses more than half of CPU and disk IO

Bug #1870461 reported by Ayaz Agayev on 2020-04-03
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
zsys (Ubuntu)
High
Didier Roche
Focal
High
Didier Roche

Bug Description

[Impact]
 * Infinite loop during GC phase can trigger (high CPU usage and disk IO). This has a no side-effect, just looping withouth any modifications.
 * The fix is to cover this case (a dataset that the GC can‘t destroy, put in a black list for the next loop state). The issue is that due to the order of if/else if, the else if "in blacklist" was never evaluated.
 * The new code is covered by a specific test and coverage confirms this is now run.

[Test Case]

 1. Create a dataset clone on an userdataset. This clone should have a snapshot or a snapshot/clone to not be automatically deletable.
 2. Run zsysctl service gc
-> no more infinite loop (the commands exits after a while with no timeout) with the fixed package.

[Regression Potential]
 * GC is a separate, timer-based process and don’t impact boot or other operations.
 * The 30 existing tests on the GC didn’t regress and we now have some additional ones around those use case.

---------

I was installed Ubuntu 20.04 with zfs yesterday, it was OK until a few updates, after that the zsysd was started draining CPU and disk IO. How can I fix it to back to normal? Is it possible to rollback recent system state? If Yes, just how?

NOTE: I was stopped and disabled zsysd service and removed zsysd.socket file, after that it worked normally for now. And there is this error occured when I am installing or removing a software package via apt:

ERROR couldn't connect to zsys daemon: connection error: desc = "transport: Error while dialing dial unix /run/zsysd.sock: connect: no such file or directory"

I think this is not a good approach

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: zsys 0.4.2
ProcVersionSignature: Ubuntu 5.4.0-21.25-generic 5.4.27
Uname: Linux 5.4.0-21-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu22
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Fri Apr 3 05:37:29 2020
InstallationDate: Installed on 2020-04-01 (1 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Alpha amd64 (20200330)
SourcePackage: zsys
UpgradeStatus: No upgrade log present (probably fresh install)

Ayaz Agayev (ayazgv) wrote :
Ayaz Agayev (ayazgv) on 2020-04-03
description: updated
description: updated
Ayaz Agayev (ayazgv) on 2020-04-03
description: updated
Didier Roche (didrocks) wrote :

Thanks for reporting this bug and sorry you have an issue with zsys.

It seems that zsys is going crazy in an infinite loop, I would be interested to know why:
1. can you try running manually "sudo zsysd -vv" and see paste the logs here? Do you have the spinning loops in that way as well?
2. if you don’t have the CPU going crazy, keep zsysd running as above and please run in another terminal "zsysctl service gc". Is the CPU going crazy now? If so, please attach the logs to this report.

Can you paste as well the result of:
- zpool list
- zfs list -t all

Disabling the socket and service is fine for now to prevent you going into this loop (but you won’t be able to revert as it’s zsys handling this). The error message just tell that it can’t save automatically the state which is expected as it’s zsysd doing it. This doesn’t hurt apart from not saving and not breaking your apt state.

Note that we have important (not zsys but zfs layout related) fixes in the incoming beta image (due today), I doubt though this is what is causing your issue.

Thanks in advance!

Ayaz Agayev (ayazgv) wrote :

Thanks for reply Didier!

1. I ran the sudo zsysd -vv and zfs list -t all, there are too many logs I attached them in logs.tar.gz file. After all commands there is no cpu load

2. zsysctl service gc command's result is:
ERROR couldn't connect to zsys daemon: connection error: desc = "transport: Error while dialing dial unix /run/zsysd.sock: connect: no such file or directory"

3. zpool list command's result is:
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 218G 23,8G 194G - - 2% 10% 1.00x ONLINE -

Also I moved the zsysd.socket file so I can restore this file if needed, I'm here to help to fix this issue :)

Ayaz Agayev (ayazgv) wrote :
  • logs Edit (48.4 KiB, application/x-tar)

I didn't see the logs file, re attaching it

Didier Roche (didrocks) wrote :

Excellent, thanks!

3. shows up that you are having a broken installation (because you installed in the bad period of previous days before beta). Today’s image is good, but before you reinstall (reverting to a good state is quite challenging, see my instructions on https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1867007/comments/6), let’s try to see why you have this zsysd CPU issue if you don’t mind!

1. All good then, apart from missing bpool due to what I explained above, the rest looks good and it can build the machines as expected. :)

2. Maybe I wasn’t clear: you need to have zsysd manually running (as the .socket unit is moved out) and I need the zsysd output, so what you need to do is:
- sudo ./zsysd -vv -> you well get the same output as before
- in another terminal, before zsysd exits (which happens after a minute of no activity), run "zsysctl service gc -vv"
-> you can paste both logs of zsysd and zsysctl commands and tell me if you see this crazy CPU looping while the gc is running.
Thanks!

Ayaz Agayev (ayazgv) wrote :
  • logs Edit (129.6 KiB, application/x-tar)

I ran these commands as you explained so yes CPU is going crazy while gc is running. I'm attaching the logs file to the comment and going to revert to good state as you explained in this link https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1867007/comments/6
After that I will tell here the result
Thanks!

Didier Roche (didrocks) wrote :

Great! I'll look on Monday.

But if you don't mind, please keep the broken state in case further testing is needed. As long as you keep zsys disabled, you should be fine!

Ayaz Agayev (ayazgv) wrote :

Is there any possibility for data loss?

Didier Roche (didrocks) wrote :

None, just don’t upgrade at worst before doing the operation on Monday :)

Ayaz Agayev (ayazgv) wrote :

Then, We are good to go :)

Didier Roche (didrocks) wrote :

Hum, I’m puzzled, the command ended correctly, right?
---
DEBUG [[c16b1021:d77ab6fb]] No more user states left for pass #2.
DEBUG [[c16b1021:d77ab6fb]] Unassociated user datasets GC
DEBUG Reset idle timeout to 1m0s
DEBUG Idle timeout expired
DEBUG Stopping daemon requested. Wait for active requests to close
DEBUG All connections closed
---

And no more trace, are you sure the CPU is still at 100% once the GC command ended?
Also, does zsysd exits after a one minute timeout after running this command? (like, no more work for it to do)

Ayaz Agayev (ayazgv) wrote :

Hey :),
Maybe I wasn't clear: the CPU working at 100 percent while GC and zsysd running. Then, after a minute both of them are stopped and the CPU functioning normally (at 1 to 5 percent) since them stopped

Ayaz Agayev (ayazgv) wrote :

Should I provide any additional information?

@ayazgv, your log file doesn't exhibit anything wrong and ends correctly.

Just to clarify is zsysd pegging your CPU for a long duration (tens of seconds or minutes) or for a short while?

It is not unexpected for the garbage collector to consume a fair but reasonable amount of CPU cycles to calculate the dependencies of the dataset and actually perform the removal operations. Besides, this operation is executed only once a day.

It'd also be interesting to know if the CPU is still running at 100% after the GC finished its job and until zsysd exits?

Didier Roche (didrocks) wrote :

Setting to incomplete until we get more feedback

Changed in zsys (Ubuntu):
status: New → Incomplete

I am experiencing a very similar, if not the same, issue.

From time to time, zsysd CPU and I/O usage spikes and stays high for at least 15 minutes. After that, my laptop starts overheating, and I manually restart zsysd using systemctl.

After this restart, a manual run of `zsysctl service gc -vv` cleans old snapshots as expected and takes a very short time, not even a minute. It seems that the previous run that caused the high CPU usage got stuck on some loop.

I am uploading the output of both `zpool` and `zfs list -t all` for the time being.

I have also added this drop-in to the zsysd unit to collect verbose logs for you when this issue reproduces on my side.

# /etc/systemd/system/zsysd.service.d/override.conf
[Service]
ExecStart=
ExecStart=/sbin/zsysd -vv

Is there anything else I could do to help you with this?

Didier Roche (didrocks) wrote :

Thanks for the head’s up baudlord!

I think there are 2 things:
- the high CPU usage on garbage collection
- you have a lot of datasets (probably related to docker, can you confirm?)

Definitively, your override.conf will be of help to have logs the next time you get this high CPU so that we can see what is exactly happening / when this (inifite?) loops occurs. We still can’t reproduce here. I think it’s related to the amount of datasets created by docker, but in any case, it will be good to fix the CPU issue.

FYI, on the docker side, we fixed it yesterday on groovy and expect to SRU it to 20.04 LTS in the coming days (https://bugs.launchpad.net/ubuntu/+source/docker.io/+bug/1879473). However, if possible, it would be good to have your high CPU logs before you test this fix, so that we can analyze and fix it, as probably other setup can create this.

Keep us posted, and thanks again!

Hello Didier:

I'm afraid I haven't been able to reproduce the bug last week. I'll see if I have any luck this week so I can provide you with logs.

It might be related to Docker, as last week I did not do any development with containers.

I'll try to reproduce this without updating this week.

Thank you for your work, Didier. :)

Didier Roche (didrocks) wrote :

No worry! Keep us posted :)

Didier Roche (didrocks) wrote :

I was able to reproduce it in a VM and a fix is in the way (committed in master now with tests)!

Changed in zsys (Ubuntu):
status: Incomplete → Triaged
Changed in zsys (Ubuntu):
importance: Undecided → High
assignee: nobody → Didier Roche (didrocks)

Awesome Didier!

I'm afraid I couldn't reproduce it on my side. Do you still need me to get the logs if I can reproduce it?

Didier Roche (didrocks) wrote :

Hey!

No, there is a high chance that the issue I spotted and yours were the same. However, once we SRU the fix (once this bug is closed), and you experience it again, do not hesitate to file another bug report!

Thanks again :)

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zsys - 0.5.0

---------------
zsys (0.5.0) groovy; urgency=medium

  [ Jean-Baptiste Lallement ]
  [ Didier Roche ]
  * Fix infinite GC loop (LP: #1870461)
  * Enhance timeout handling to avoid error rpc error: code = DeadlineExceeded
    desc = context deadline exceeded while the daemon is doing work
    (LP: #1875564)
  * Stop taking automated or manual snapshot when there is less than 20% of
    free disk space (LP: #1876334)
  * Enable trim support for upgrading users (LP: #1881540)
  * Only clean up previously linked user datasets when unlinked under USERDATA
    (LP: #1881538)
  * Strategy for deleted user datasets via a new hidden command called by
    userdel (LP: #1870058)
  * Get better auto snapshots message when integrated to apt (LP: #1875420)
  * Update LastUsed on shutdown via a new hidden command service call
    (LP: #1881536)
  * Prevent segfault immediately after install when zfs kernel module isn't
    loaded (LP: #1881541)
  * Don’t try to autosave gdm user (and in general non system user), even if
    systemd --user is started for them. (LP: #1881539)
  * Prevent apt printing errors when zsys is removed without purge
    (LP: #1881535)
  * Some tests enhancements:
    - new tests for all the above
    - allow setting a different local socket for debugging/tests purposes only
    - ascii order datasets in golden files
  * Typos and messages fixes. Direct prints are not prefixed with INFO
    anymore.
  * Refreshed po and readme with the above.

 -- Didier Roche <email address hidden> Mon, 01 Jun 2020 09:26:52 +0200

Changed in zsys (Ubuntu):
status: Triaged → Fix Released
Didier Roche (didrocks) on 2020-06-09
description: updated
Didier Roche (didrocks) on 2020-06-09
Changed in zsys (Ubuntu Focal):
importance: Undecided → High
assignee: nobody → Didier Roche (didrocks)
Didier Roche (didrocks) on 2020-06-09
description: updated

Hello Ayaz, or anyone else affected,

Accepted zsys into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zsys/0.4.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in zsys (Ubuntu Focal):
status: New → Fix Committed
tags: added: verification-needed verification-needed-focal

SRU verification for Focal:
I have reproduced the problem with zsys 0.4.5 in focal and have verified that the version of zsys 0.4.6 in -proposed fixes the issue.

Marking as verification-done

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zsys - 0.4.6

---------------
zsys (0.4.6) focal; urgency=medium

  [ Jean-Baptiste Lallement ]
  [ Didier Roche ]
  * Fix infinite GC loop (LP: #1870461)
  * Enhance timeout handling to avoid error rpc error: code = DeadlineExceeded
    desc = context deadline exceeded while the daemon is doing work
    (LP: #1875564)
  * Stop taking automated or manual snapshot when there is less than 20% of
    free disk space (LP: #1876334)
  * Enable trim support for upgrading users (LP: #1881540)
  * Only clean up previously linked user datasets when unlinked under USERDATA
    (LP: #1881538)
  * Strategy for deleted user datasets via a new hidden command called by
    userdel (LP: #1870058)
  * Get better auto snapshots message when integrated to apt (LP: #1875420)
  * Update LastUsed on shutdown via a new hidden command service call
    (LP: #1881536)
  * Prevent segfault immediately after install when zfs kernel module isn't
    loaded (LP: #1881541)
  * Don’t try to autosave gdm user (and in general non system user), even if
    systemd --user is started for them. (LP: #1881539)
  * Prevent apt printing errors when zsys is removed without purge
    (LP: #1881535)
  * Some tests enhancements:
    - new tests for all the above
    - allow setting a different local socket for debugging/tests purposes only
    - ascii order datasets in golden files
  * Typos and messages fixes. Direct prints are not prefixed with INFO
    anymore.
  * Refreshed po and readme with the above.

 -- Didier Roche <email address hidden> Mon, 01 Jun 2020 09:26:52 +0200

Changed in zsys (Ubuntu Focal):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for zsys has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers