Ubuntu
lvm2 package

during shutdown libvirt-guests gets stopped after file system unmount

Bug #1832859 reported by Erlend Slettevoll on 2019-06-14

This bug affects 1 person

	Status	Importance	Assigned to
lvm2	New	Unknown	auto-github-lvmteam-lvm2 #18
libvirt (Ubuntu)	Incomplete	Undecided	Unassigned
lvm2 (Fedora)	In Progress	High	redhat-bugs #1701234
lvm2 (Ubuntu)	New	Undecided	Unassigned

Bug Description

When using automatic suspend at reboot/shutdown, it makes sense to store the suspend data on a separate partition to ensure there is always enough available space. However this does not work, as the partition gets unmounted before or during libvirt suspend.

Steps to reproduce:

1. Use Ubuntu 18.04.02 LTS
2. Install libvirt + qemu-kvm
3. Start a guest
4. Set libvirt-guests to suspend at shutdown/reboot by editing /etc/default/libvirt-guests
5. Create a fstab entry to mount a separate partition to mount point /var/lib/libvirt/qemu/save. Then run sudo mount /var/lib/libvirt/qemu/save to mount the partition.
6. Reboot

Expected result:
The guest suspend data would be written to the /var/lib/libvirt/qemu/save, resulting in the data being stored at the partition specified in fstab. At boot, this partition would be mounted as specified in fstab and libvirt-guest would be able to read the data and restore the guests.

Actual result:
The partition gets unmounted before libvirt-guests suspends the guests, resulting in the data being stored on the partition containing the root file system. During boot, the empty partition gets mounted over the non-empty /var/lib/libvirt/qemu/save directory, resulting in libvirt-guests being unable to read the saved data.

As a side effect, the saved data is using up space on the root partition even if the directory appears empty.

Here is some of the relevant lines from the journal:

Jun 14 00:00:04 libvirt-host blkdeactivate[4343]: Deactivating block devices:
Jun 14 00:00:04 libvirt-host systemd[1]: Unmounted /var/lib/libvirt/qemu/save.
Jun 14 00:00:04 libvirt-host blkdeactivate[4343]: [UMOUNT]: unmounting libvirt_lvm-suspenddata (dm-3) mounted on /var/lib/libvirt/qemu/save... done

Jun 14 00:00:04 libvirt-host libvirt-guests.sh[4349]: Running guests on default URI: vps1, vps2, vps3
Jun 14 00:00:04 libvirt-host blkdeactivate[4343]: [MD]: deactivating raid1 device md1... done
Jun 14 00:00:05 libvirt-host libvirt-guests.sh[4349]: Suspending guests on default URI...
Jun 14 00:00:05 libvirt-host libvirt-guests.sh[4349]: Suspending vps1: ...
Jun 14 00:00:05 libvirt-host blkdeactivate[4343]: [LVM]: deactivating Volume Group libvirt_lvm... skipping

Jun 14 00:00:10 libvirt-host libvirt-guests.sh[4349]: Suspending vps1: 5.989 GiB
Jun 14 00:00:15 libvirt-host libvirt-guests.sh[4349]: Suspending vps1: ...
Jun 14 00:00:20 libvirt-host libvirt-guests.sh[4349]: Suspending vps1: ...

Revision history for this message

In Red Hat Bugzilla #1701234, rmetrich (rmetrich-redhat-bugs) wrote on 2019-04-18:

#14

Download full text (3.4 KiB)

Description of problem:

The blk-availability.service unit is activated automatically when multipathd is enabled, even if multipathd is finally not used.
This leads to the blk-availability service to unmount file systems too early, breaking unit ordering and leading to shutdown issues of custom services requiring some mount points.

Version-Release number of selected component (if applicable):

device-mapper-1.02.149-10.el7_6.3.x86_64

How reproducible:

Always

Steps to Reproduce:

1. Enable multipathd even though there is no multipath device

# yum -y install device-mapper-multipath
# systemctl enable multipathd --now

2. Create a custom mount point "/data"

  # lvcreate -n data -L 1G rhel
  # mkfs.xfs /dev/rhel/data
  # mkdir /data
  # echo "/dev/mapper/rhel-data /data xfs defaults 0 0" >> /etc/fstab
  # mount /data

3. Create a custom service requiring mount point "/data"

# cat > /etc/systemd/system/my.service << EOF
[Unit]
RequiresMountsFor=/data

[Service]
ExecStart=/bin/bash -c 'echo "STARTING"; mountpoint /data; true'
ExecStop=/bin/bash -c 'echo "STOPPING IN 5 SECONDS"; sleep 5; mountpoint /data; true'
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=default.target
EOF
# systemctl daemon-reload
# systemctl enable my.service --now

4. Set up persistent journal and reboot

  # mkdir -p /var/log/journal
  # systemctl restart systemd-journald
  # reboot

5. Check the previous boot's shutdown

# journalctl -b -1 -o short-precise -u my.service -u data.mount -u blk-availability.service

Actual results:

-- Logs begin at Thu 2019-04-18 12:48:12 CEST, end at Thu 2019-04-18 13:35:50 CEST. --
Apr 18 13:31:46.933571 vm-blkavail7 systemd[1]: Started Availability of block devices.
Apr 18 13:31:48.452326 vm-blkavail7 systemd[1]: Mounting /data...
Apr 18 13:31:48.509633 vm-blkavail7 systemd[1]: Mounted /data.
Apr 18 13:31:48.856228 vm-blkavail7 systemd[1]: Starting my.service...
Apr 18 13:31:48.894419 vm-blkavail7 bash[2856]: STARTING
Apr 18 13:31:48.930270 vm-blkavail7 bash[2856]: /data is a mountpoint
Apr 18 13:31:48.979457 vm-blkavail7 systemd[1]: Started my.service.
Apr 18 13:35:02.544999 vm-blkavail7 systemd[1]: Stopping my.service...
Apr 18 13:35:02.547811 vm-blkavail7 systemd[1]: Stopping Availability of block devices...
Apr 18 13:35:02.639325 vm-blkavail7 bash[3393]: STOPPING IN 5 SECONDS
Apr 18 13:35:02.760043 vm-blkavail7 blkdeactivate[3395]: Deactivating block devices:
Apr 18 13:35:02.827170 vm-blkavail7 blkdeactivate[3395]: [SKIP]: unmount of rhel-swap (dm-1) mounted on [SWAP]
Apr 18 13:35:02.903924 vm-blkavail7 systemd[1]: Unmounted /data.
Apr 18 13:35:02.988073 vm-blkavail7 blkdeactivate[3395]: [UMOUNT]: unmounting rhel-data (dm-2) mounted on /data... done
Apr 18 13:35:02.988253 vm-blkavail7 blkdeactivate[3395]: [SKIP]: unmount of rhel-root (dm-0) mounted on /
Apr 18 13:35:03.083448 vm-blkavail7 systemd[1]: Stopped Availability of block devices.
Apr 18 13:35:07.693154 vm-blkavail7 bash[3393]: /data is not a mountpoint
Apr 18 13:35:07.696330 vm-blkavail7 systemd[1]: Stopped my.service.

--> We can see the following:
- blkdeactivate runs, unmounting /data, even though my.service is still running (hence the unexpected ...

Description of problem:

Version-Release number of selected component (if applicable):

device-mapper-1.02.149-10.el7_6.3.x86_64

How reproducible:

Always

Steps to Reproduce:

1. Enable multipathd even though there is no multipath device

# yum -y install device-mapper-multipath
  # systemctl enable multipathd --now

2. Create a custom mount point "/data"

# lvcreate -n data -L 1G rhel
  # mkfs.xfs /dev/rhel/data
  # mkdir /data
  # echo "/dev/mapper/rhel-data /data xfs defaults 0 0" >> /etc/fstab
  # mount /data

3. Create a custom service requiring mount point "/data"

# cat > /etc/systemd/system/my.service << EOF
[Unit]
RequiresMountsFor=/data

[Service]
ExecStart=/bin/bash -c 'echo "STARTING"; mountpoint /data; true'
ExecStop=/bin/bash -c 'echo "STOPPING IN 5 SECONDS"; sleep 5; mountpoint /data; true'
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=default.target
EOF
  # systemctl daemon-reload
  # systemctl enable my.service --now

4. Set up persistent journal and reboot

# mkdir -p /var/log/journal
  # systemctl restart systemd-journald
  # reboot

5. Check the previous boot's shutdown

# journalctl -b -1 -o short-precise -u my.service -u data.mount -u blk-availability.service

Actual results:

--> We can see the following:
- blkdeactivate runs, unmounting /data, even though my.service is still running (hence the unexpected message "/data is not a mountpoint")

Expected results:

- my.service gets stopped
- then "data.mount" gets stopped
- finally blkdeactivate runs

Additional info:

I understand there is some chicken-and-egg problem here, but it's just not possible to blindly unmount file systems and ignore expected unit ordering.

Revision history for this message

In Red Hat Bugzilla #1701234, prajnoha (prajnoha-redhat-bugs) wrote on 2019-04-23:

#15

Normally, I'd add Before=local-fs-pre.target into blk-availability.service so on shutdown its ExecStop would execute after all local mount points are unmounted.

The problem might be with all the dependencies like iscsi, fcoe and rbdmap services where we need to make sure that these are executed *after* blk-availability. So I need to find a proper target that we can hook on so that it also fits all the dependencies. It's possible we need to create a completely new target so we can properly synchronize all the services on shutdown. I'll see what I can do...

Revision history for this message

In Red Hat Bugzilla #1701234, rmetrich (rmetrich-redhat-bugs) wrote on 2019-04-23:

#16

Indeed, wasn't able to find a proper target, none exists.
I believe blk-availability itself needs to be modified to only deactivate non-local disks (hopefully there is a way to distinguish).

Paride Legovini (paride) on 2019-06-17

tags:

added: server-triage-discuss

Paride Legovini (paride) on 2019-06-17

tags:

removed: server-triage-discuss

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

Thanks for the report.
One might first think yeah lets add

Requires=local-fs.target
After=local-fs.target

But in fact it might need even more than that.
I wonder if (per [1]) the following might be even better.

Requires=sysinit.target
After=sysinit.target

Let me do some tests and maybe suggest it upstream then ...

[1]: https://www.freedesktop.org/software/systemd/man/bootup.html

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

Requires would be wrong actually, there can be cases where these are not strictly required.
But after should be fine

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

This is reported for issues with suspend and the example of /var/lib/libvirt/qemu/save.
But IMHO this would be true for the more common case of guest shutdown (the default) and /var being on an extra partition while the default image paths are under /var.
I know this would not be an immediate crash, since the running guests would block the unmount as long as needed, but having that more explicitly ordered might be good for that as well.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

Repro:
$ sudo mkdir /var/test
$ echo "/var/test /var/lib/libvirt/qemu/save none bind 0 0" | sudo tee -a /etc/fstab
$ sudo sed -i -e 's/#ON_SHUTDOWN=shutdown/ON_SHUTDOWN=suspend/' /etc/default/libvirt-guests
$ sudo sed -i -e 's/#ON_BOOT=ignore/ON_BOOT=start/' /etc/default/libvirt-guests
$ sudo reboot
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=eoan
$ uvt-kvm create --password ubuntu eoan arch=amd64 release=eoan label=daily

It seemed rather clear from the report, but I wanted to try myself.
First I tried with a bind mount as shown above as that would make the setup even simpler (doesn't need an extra disk).

But I found that this works.
[ 226.602702] libvirt-guests.sh[1727]: Running guests on default URI: eoan
[ 226.667247] libvirt-guests.sh[1727]: Suspending guests on default URI...
[ 226.696513] libvirt-guests.sh[1727]: Suspending eoan: ...
[ 242.906340] libvirt-guests.sh[1727]: Suspending eoan: 66.909 MiB
[ 243.910964] libvirt-guests.sh[1727]: Suspending eoan: done
...
[ OK ] Stopped target Local File Systems.

And the shutdown took quite a few seconds, so on shutdown the ordering was not an issue.
Note: Even worked after I changed it to use an extra disk.

This is due to:
libvirt-guests.service
After=libvirtd.service
which again is:
After=local-fs.target

So this actually works fine on shutdown and restart for me.
Ordering seems correct.

I retried it a few times but it worked every time - the ordering was strict.

@Erlend - could you show the journal content of e.g. the failing shutdown&startup.
It really should always be:
boot: local-fs.target -> libvirtd.service -> libvirt-guests.service
=> the FS should be up when needed.
shutdown: libvirt-guests.service -> libvirtd.service -> local-fs.target
=> the FS should go down after no more being needed

Changed in libvirt (Ubuntu):
status:	New → Incomplete

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

reboot.log Edit (44.7 KiB, text/plain)

I did some retries with bigger guests (slower) and such.
But all worked.

Attached an example of such a reboot cycle (main console).
You see shutdown waiting with FS unmount about 2 minutes for libvirt-guests suspend.
Then on reboot it clearly is after FS mounts are up and therefore is able to find and start the guest.

Revision history for this message

Erlend Slettevoll (erlendsl) wrote on 2019-06-17:

journal.txt Edit (147.0 KiB, text/plain)

Thank you for looking into this. I have attached the journal of one shutdown and boot. It seems like during boot, everything happens in the correct order. As you pointed out, this is as expected due to the "After" parameter in the service-files.

However, as you can see from the shutdown.txt file, blkdeactivate gets called before libvirt-guests. It is entirely possible that the issue is not due to libvirt-guests, but I'm struggling to understand what is causing this. The host is set up with software raid where root, boot and swap are on primary partitions, and the remaining space is a LVM volume group used by the libvirt guests. The volume group is also containing the volume mounted on /var/lib/libvirt/qemu/save.

The errors from suspending the second and third guest is probably due to the root file system being full, as this is not dimensioned to be able to hold all the suspend data.

Revision history for this message

Erlend Slettevoll (erlendsl) wrote on 2019-06-17:

(I wrote the comment before realizing that I could only attach one file
. Journal.txt contains both shutdown and boot sequence.)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-18:

yeah I think I found the issue in your log.
Consider [1] again and look at
/lib/systemd/system/blk-availability.service

This is what calls blkdeactivate on shutdown.

It only has a
WantedBy=sysinit.target
But no ordering Before anything.
It even has
DefaultDependencies=no
which will make it start early only waiting for the listed
After=lvm2-activation.service lvm2-lvmetad.service iscsi-shutdown.service iscsi.service iscsid.service fcoe.service

But in reverse that also means it has to wait on NOTHING to start "stopping" which is the call to blkdeactivate.

IMHO that is a bug in lvm2, that should ensure that this probably should be ordered
somewhere:
- closely after local-fs.target (on shutdown); I mean FS should be gone before deactivating devices right?
- the startup ordering isn't important as it is calling just /bin/true

My (uneducated) suggestion would be to add
Before=local-fs.target
to
/lib/systemd/system/blk-availability.service
Then run
$ systemctl daemon-reload

(maybe needs another reboot then, not sure).
From there I'd assume that the shutdown ordering (beig the inverse) would order it AFTER local-fs which IMHO is what we'd want.

Please give this a try and I'll add an LVM task here.

[1]: https://www.freedesktop.org/software/systemd/man/bootup.html

Revision history for this message

Erlend Slettevoll (erlendsl) wrote on 2019-06-18:

shutdown.txt Edit (22.6 KiB, text/plain)

I can confirm that adding Before=local-fs.target to blk-availability.service solved the issue for me. Now everything happens in the correct order. I attached the shutdown journal for reference.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-19:

#10

Thanks Erlend for the log.
I'm glad that the suggestion worked - now we need to check in which scope the file is defined and ask for a change there - this should also ensure we are not missing any hidden requirement that would be thwarted by that change (I could think of it starting too late for other things then, or even to create a dependency loop as it has quite some After dependencies).

The file is installed since late 2017
196 lvm2 (2.02.173-2) unstable; urgency=medium
...
200 * Install and enable blk-availability service.

Its actual content is from upstream, so this is the place to discuss it.
scripts/blk_availability_systemd_red_hat.service.in

I have found that newer versions have added one dependency.
=> https://github.com/lvmteam/lvm2/issues/17
But that only ensures it doesn't reach shutdown.target before this is done.
It does -not- ensure devices aren't unmounted before the FS on them is finished.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-19:

#11

Reported: https://github.com/lvmteam/lvm2/issues/18
Suggested: https://github.com/lvmteam/lvm2/pull/19

Bug Watch Updater (bug-watch-updater) on 2019-06-19

Changed in lvm2:
status:	Unknown → New

Revision history for this message

In Red Hat Bugzilla #1701234, rmetrich (rmetrich-redhat-bugs) wrote on 2019-06-19:

#17

Hi Peter,

Could you explain why blk-availability is needed when using multipath or iscsi?
With systemd ordering dependencies in units, is that really needed?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-21:

#12

Linked the related RH bug that was mentioned upstream https://bugzilla.redhat.com/show_bug.cgi?id=1701234

Revision history for this message

In Red Hat Bugzilla #1701234, prajnoha (prajnoha-redhat-bugs) wrote on 2019-06-21:

#18

(In reply to Renaud Métrich from comment #4)
> Hi Peter,
>
> Could you explain why blk-availability is needed when using multipath or
> iscsi?
> With systemd ordering dependencies in units, is that really needed?

It is still needed because otherwise there wouldn't be anything else to properly deactivate the stack. Even though, the blk-availability.service with blkdeactivate call is still not perfect, it's still better than nothing and letting systemd to shoot down the devices on its own within its "last-resort" device deactivation loop that happens in shutdown initramfs (here, the iscsi/fcoe and all the other devices are already disconnected anyway, so anything else on top can't be properly deactivated).

We've just received related report on github too (https://github.com/lvmteam/lvm2/issues/18).

I'm revisiting this problem now. The correct solution requires more patching - this part is very fragile at the moment (...easy to break other functionality).

Revision history for this message

In Red Hat Bugzilla #1701234, prajnoha (prajnoha-redhat-bugs) wrote on 2019-06-21:

#19

(In reply to Renaud Métrich from comment #3)
> I believe blk-availability itself needs to be modified to only deactivate
> non-local disks (hopefully there is a way to distinguish).

It's possible that we need to split the blk-availability (and the blkdeactivate) in two because of this... There is a way to distinguish I hope (definitely for iscsi/fcoe), but there currently isn't a central authority to decide on this so it must be done manually (checking certain properties in sysfs "manually").

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-21:

#13

It seems - somewhat expected - that the solution needs to be a bit bigger due to further service dependencies that might create a loop. Until that solution is sorted out upstream we should not implement another custom emergency fix that might cause more trouble.

Instead while we wait on the final solution by upstream I'd recommend that everybody affected needs to evaluate his own system if the modification suggested in comment #8 would suffice as a workaround for you.

Revision history for this message

In Red Hat Bugzilla #1701234, rmetrich (rmetrich-redhat-bugs) wrote on 2019-06-21:

#20

I must be missing something. This service is used to deactivate "remote" block devices requiring the network, such as iscsi or fcoe.
Why aren't these services deactivating the block devices by themselves?
That way systemd won't kill everything abruptly.

Revision history for this message

In Red Hat Bugzilla #1701234, prajnoha (prajnoha-redhat-bugs) wrote on 2019-06-21:

#21

(In reply to Renaud Métrich from comment #7)
> I must be missing something. This service is used to deactivate "remote"
> block devices requiring the network, such as iscsi or fcoe.

Nope, ALL storage, remote as well as local, if possible. We need to look at the complete stack (e.g. device-mapper devices which are layered on top of other layers, are set up locally)

> Why aren't these services deactivating the block devices by themselves?

Well, honestly, because nobody has ever solved that :)

At the beginning, it probably wasn't that necessary and if you just shut your system down and let the devices as they are (unattached, not deactivated), it wasn't such a problem. But now, with various caching layers, thin pools... it's getting quite important to deactivate the stack properly to also properly flush any metadata or data.

Of course, we still need to count with the situation where there's a power outage and the machine is not backed by any other power source so you'd have your machine shot down immediately (for that there are various checking and fixing mechanism). But it's certainly better to avoid this situation as you could still lose some data.

Systemd's loop in the shutdown initramfs is really the last-resort thing to execute, but we can't rely on that (it's just a loop on device list with limited loop count, it doesn't look at the real nature of that layer in the stack).

Revision history for this message

In Red Hat Bugzilla #1701234, rmetrich (rmetrich-redhat-bugs) wrote on 2019-06-21:

#22

OK, then we need a "blk-availability-local" service and "blk-availability-remote" service and maybe associated targets, similar to "local-fs.target" and "remote-fs.target".
Probably this should be handled by systemd package itself, typically by analyzing the device properties when a device shows up in udev.

Bug Watch Updater (bug-watch-updater) on 2019-06-22

Changed in lvm2 (Fedora):
importance:	Unknown → High
status:	Unknown → Confirmed

Revision history for this message

In Red Hat Bugzilla #1701234, prajnoha (prajnoha-redhat-bugs) wrote on 2020-02-11:

#25

Based on the report here, this affects only setups with custom services/systemd units. Also, the blk-availability/blkdeactivate has been in RHEL7 since 7.0 and this seems to be the only report we have received so far (therefore, I don't expect much users to be affected by this issue).

Also, I think it's less risk adding the extra dependency as already described here https://access.redhat.com/solutions/4154611 than splitting the blk-availability / blkdeactivate into (at least) two parts running at different times. Also, if we did this, we'd need to introduce a new synchronization point (like a systemd target) that other services would need to depend on (and so it would require much more changes in various other components which involves risks).

In future, we'll try to cover this shutdown scenario in a more proper way with new Storage Instantiation Daemon (SID).

Revision history for this message

Yuri Weinstein (yuri-weinstein) wrote on 2020-06-25:

#23

This seems like a new problem to my after 6/24/20 update

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-06-26:

#24

Hi Yuri,
there was no recent change in regard to lvm handling in libvirt or lvm itself that would obviously be related. Which Ubuntu release are you on - and between which package versions did you upgrade?

Bug Watch Updater (bug-watch-updater) on 2020-08-25