Container file system corruption on libvirtd restart

Bug #1680997 reported by Eugen Rieck on 2017-04-08
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
High
Unassigned

Bug Description

A data corruption bug exists in the LXC driver for libvirt, that has just cost me a MySQL server.

Steps to reproduce:
- (for visualization only) In virt-manager add a connection to local lxc://
- create an LXC container, that has a loop-mounted image file and start it
- (for visualization only) the container shows as running in virt-manager
- systemctl stop libvirtd ; sleep 2 ; sync ; systemctl start libvirtd
- (for visualization only) the container shows as shut off in virt-manager
- The container no longer responds to network requests, has no attachable console
- The loop mount does no longer show up on host-side "mount" output
      BUT: losetup -a reveals, that a loop device is still attached to the image file
      BUT: In reality this loop device is still mounted, processes in the container still access the file system
      BUT: There is no way to unmount or free it - losetup -d ends without an error but does nothing
- restart the container (virsh -c lxc:// start name-of-container or via virt-manager)
      THIS SHOULD NOT BE ALLOWED
- The image file is now twice mounted and corruption starts creeping in
- Depending on how long this state persists (in terms of IO), the damage can be significant

When finally discovering the problem, the only way to unstick the container is a reboot. This is the final nail in the coffin: The hidden instance syncs AFTER the new instance, effectivly pushing back the past.

This can be quite nasty, if a libvirt restart results from an unattended upgrade.

I do understand, that libvirt/LXC is deprecated - this strikes me as a rather unsubtle way to push users to the newest incarnation, though.
In non-enterprisy environments (read SMB or NGO) virt-manager is often used as a "power user" tool, and those end users are unwilling if not unable to use different toolsets for containers and full-fledged VMs. And disabling unattended upgrades in such an environment is inviting trouble.

Eugen Rieck (w-eugen) on 2017-04-08
affects: udev (Ubuntu) → libvirt (Ubuntu)
Joshua Powers (powersj) on 2017-04-10
Changed in libvirt (Ubuntu):
status: New → Triaged
importance: Undecided → High
Serge Hallyn (serge-hallyn) wrote :

Hi,

this would be deemed a high priority bug for upstream libvirt, but Ubuntu has always, back to 2010, supported lxc, then lxd, instead of libvirt-lxc. (So it's not that libvirt-lxc is deprecated, rather it was never supported in Ubuntu)

Which version of Ubuntu are you using?

Can you reliably reproduce it? If you can give a recipe for "start with a clean ubuntu cloud VM image; set up a container "like this", do that, then it dies", then we may be able to nail down the cause and/or talk to upstream.

If at all possible to migrate you to using lxd containers, that would be ideal, but I assume you have control software written around libvirt's api making that untenable.

Eugen Rieck (w-eugen) wrote :

The steps outlined in the initial bug report reliably (100%) reproduce the problem for me on Ubuntu 16.04, it is tested in different Environments (1xAMD, ca. 10xIntel).
Here's the short way to get there:

- Install a basic Ubuntu 16.04 Server
- apt-get install virt-manager (installing the GUI pulls in the heavy lifting components)
- create a libvirt/lxc container of something like
<domain type='lxc'>
  <name>AnyName</name>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64'>exe</type>
    <init>/sbin/init</init>
  </os>
  <features>
    <privnet/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <filesystem type='file' accessmode='passthrough'>
      <driver type='loop' format='raw'/>
      <source file='/path/to/image.raw'/>
      <target dir='/'/>
    </filesystem>
    <interface type='bridge'>
      <mac address='00:16:3e:34:ea:4b'/>
      <source bridge='br1'/>
      <target dev='vnet2'/>
      <guest dev='eth0'/>
    </interface>
    <console type='pty' tty='/dev/pts/3'>
      <source path='/dev/pts/3'/>
      <target type='lxc' port='0'/>
      <alias name='console0'/>
    </console>
    <hostdev mode='capabilities' type='misc'>
      <source>
        <char>/dev/net/tun</char>
      </source>
    </hostdev>
  </devices>
</domain>

(I have experimented quite a lot, and it boils down to the loop-mounted file system)

- Start the container via virsh or virt-manager
- Restart libvirtd
- Examine state of the container in virsh or virt-manager vs. the state of the loop device via losetup

The important parts are:
- The container is shown as stopped
- The container dosen't reply to network requests or console connection requests (i.e. it seems truly dead)
- The loop device doesn't show up in host-side "mount | grep loop"

- libvirtd allows to (re-)start the container, ending up with a double-mounted file system

Migrating to lxd is not feasable in many environments, in addition to that i am totally aware (and not critisizing!), that libvirt-lxc was/is unsupported. For me the real bug is, that this scenario is possible: If Ubuntu were to just exclude libvirt's lxc driver, that would be not really fine, but at least fool-proof.

The blocker to lxd adoption is not on the admin side (me), but on the end user side: Virt-manager is the favorite toy for SMB/NGO local admins, typically run via XQuartz on a Mac or XMing on Windows.

Please let me know, if and when I can be of further help - I am willing to test and have quite a few testbeds at hand, where I can easily create throw-away containers and ruin them. Since I tripped over this, I migrated around to have one node running no containers at every single customer, just to do exactly that.

[...]
> Migrating to lxd is not feasable in many environments, in addition to that i am totally aware >
> (and not critisizing!), that libvirt-lxc was/is unsupported.

To bad as lxd really is the great way to go in this case.
But I want to state that I really appreciate your understanding.

> For me the real bug is, that this scenario is possible: If Ubuntu were to just exclude
> libvirt's lxc driver, that would be not really fine, but at least fool-proof.

Nobody was brave enough to do so yet, but there were discussions about doing exactly that last cycle and probably in a few weeks for the next release.
We had other issues which ended up in the "well but it is not meant to be supported" track, having you as a users/admin voice in that is really helpful for those discussions.

> The blocker to lxd adoption is not on the admin side (me), but on the end user side:
> Virt-manager is the favorite toy for SMB/NGO local admins, typically run via XQuartz on a
> Mac or XMing on Windows.

Would that in your opinion more about being a graphical clicky solution or just about being simple in general. If it would be only the latter then lxd very much simplified the semantics and can manage remote LXDs - yet I fail to see to

> Please let me know, if and when I can be of further help
> [...]

I really think as outlined by Serge the step you really could help is filing a bug with upstream libvirt and link it here.
That can be driven separate to the "do we drop lxc support" discussion.

Yet given that upstream deprecated it even their interest might diminish :-/

Note, the last I heard it was not deprecated by *upstream*, but by redhat.

Eugen Rieck (w-eugen) wrote :

Closed by upstream for insufficient data :-/
Setting incomplete here as well

Changed in libvirt (Ubuntu):
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.