Comment 9 for bug 1779156

Revision history for this message
Paride Legovini (paride) wrote :

Hi,

I hit this issue on Bionic, Disco and Eoan. Our (server-team) Jenkins nodes are often filled by stale LXD containers which are left there because of "fails to destroy ZFS filesystem" errors.

Some thoughts and qualitative observations:

0. This is not a corner case, I see the problem all the time.

1. There is probably more than one issue involved here, even we get similar error messages when trying to delete a container.

2. One issue is about mount namespaces: stray mounts that prevent to the container to be deleted. This issue can be worked around by entering the namespace and unmounting. The container can then be deleted. When this happens retrying `lxd delete` doesn't help. This is described in [0]. I think the newer versions of LXD are way less prone to end up in this case.

3. In other cases `lxc delete --force` fails with the "ZFS dataset is busy" error, but the deletion succeeds if the delete is retried immediately after. In my case I don't even need to wait for a single second: the second delete in `lxc delete --force <x> ; lxc delete <x>` already works. Stopping and deleting the container as separate operations also works.

4. It has been suggested in [0] that LXD could retry the "delete" operation if it fails. stgraber wrote that LXD *already* retries the operation 20 times over 10 seconds, but the outcome is still a failure. It is not clear to me how retrying manually works, while LXD auto-retrying does not.

5. Some time ago (weeks) the error message changed from "Failed to destroy ZFS filesystem: dataset is busy" to "Failed to destroy ZFS filesystem:" with no other detail. I can't tell which specific upgrade triggered this change.

6. I see this problem in both file-backed and device-backed zpools.

7. I'm not sure system load plays a role: I often hit the problem on my lightly loaded laptop.

8. I don't have clear steps to reproduce the problem, but I personally see it happening most of the time. While I don't have steps to reproduce with 100% probability, I'm seeing this more times than I don't. But see the next point.

9. In my experience a system can be in a "bad state" (the problem always happens), or in a "good state" (the problem never happens). When the system is in a "good state" we can `lxc delete` hundreds of containers with no errors. I can't tell what makes a system switch from a good to a bad state. I almost certain I also saw systems switching from a bad to a good state.

10. The lxcfs package it not installed in the systems where I hit this issue

That's it for the moment. Thanks for looking into this!

Paride

[0] https://github.com/lxc/lxd/issues/4656