nova delete lxc-instance umounts the wrong rootfs

Bug #971621 reported by Eric Dodemont
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Pádraig Brady
Essex
Fix Released
High
Pádraig Brady
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

When I launch more than one LXC instance, and I try to delete one (the first one for example), the wrong rootfs is umounted and disconnected from its nbd device (the last one for example).

E.g.

Before:

name status nbd --> rootfs_path veth_if --> IP cgroup
---------------------------------------------------------------------------------------
instance1 ACTIVE nbd15 --> .../instance-00000001/rootfs veth0 --> 10.0.0.2 V
instance2 ACTIVE nbd14 --> .../instance-00000002/rootfs veth1 --> 10.0.0.3 V
instance3 ACTIVE nbd13 --> .../instance-00000003/rootfs veth2 --> 10.0.0.4 V

nova delete instance1

After:

name status nbd --> rootfs_path veth_if --> IP cgroup
---------------------------------------------------------------------------------------
instance1 SHUTOFF nbd15 --> .../instance-00000001/rootfs X --> X X
instance2 ACTIVE nbd14 --> .../instance-00000002/rootfs veth1 --> 10.0.0.3 V
instance3 SHUTOFF X --> X veth2 --> 10.0.0.4 X

Specifications:

- Host OS: ubuntu precise beta2
- Guest OS: ubuntu precise beta2 cloud image (from http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-amd64.tar.gz)
- OpenStack version: 2012.1 (essex)
- Virtualization: LXC (LinuX Container)

Related branches

Eric Dodemont (dodeeric)
description: updated
description: updated
Eric Dodemont (dodeeric)
description: updated
description: updated
Eric Dodemont (dodeeric)
description: updated
description: updated
Revision history for this message
Eric Dodemont (dodeeric) wrote :

This problem happens with:

a) qcow2 root disk images (nbd block devices) ==> As described above

disk: qcow2 format

instance1 .../instance-00000001/disk --> nbd15 --> .../instance-00000001/rootfs
instance2 .../instance-00000002/disk --> nbd14 --> .../instance-00000002/rootfs
instance3 .../instance-00000003/disk --> nbd13 --> .../instance-00000003/rootfs

but also with:

b) raw root disk images (loop block devices)

nova.conf: use_cow_images=False

disk: raw format

instance1 .../instance-00000001/disk --> loop0 --> .../instance-00000001/rootfs
instance2 .../instance-00000002/disk --> loop1 --> .../instance-00000002/rootfs
instance3 .../instance-00000003/disk --> loop2 --> .../instance-00000003/rootfs

Revision history for this message
Eric Dodemont (dodeeric) wrote :

Traces:

nova delete i1 (uuid: 9b1... / veth0 / loop0 / instance-0000000a)

Name IP veth loop rootfs uuid
--------------------------------------------
i1 4.3 0 0 i-a 9b1...
i2 4.4 1 1 i-b af2...

NOK: loop1 is umounted/disconneced!!!

AMQP: req-d29...

user_uuid: cad... (dodeeric)
project_uuid: c1a... (lc2)

-----

OK (i-a / instance-0000000a):

2012-04-07 10:26:52 INFO nova.virt.libvirt.connection [req-d29c2ab4-cd57-4121-93c2-27e9d56cf1fb cadc85c18c6
a4aee928c19653f48dd45 c1ab5bc8097b48278ef41db02ecf82eb] [instance: 9b1533e4-f973-43ef-a1b5-96ce52ae5db1] Deleting instance files /var/lib/nova/instances/instance-0000000a

NOK (loop1):

2012-04-07 10:26:52 DEBUG nova.utils [req-d29c2ab4-cd57-4121-93c2-27e9d56cf1fb cadc85c18c6a4aee928c19653f48
dd45 c1ab5bc8097b48278ef41db02ecf82eb] Running cmd (subprocess): sudo nova-rootwrap umount /dev/loop1 from
(pid=31501) execute /usr/lib/python2.7/dist-packages/nova/utils.py:219

2012-04-07 10:26:52 DEBUG nova.utils [req-d29c2ab4-cd57-4121-93c2-27e9d56cf1fb cadc85c18c6a4aee928c19653f48
dd45 c1ab5bc8097b48278ef41db02ecf82eb] Running cmd (subprocess): sudo nova-rootwrap losetup --detach /dev/l
oop1 from (pid=31501) execute /usr/lib/python2.7/dist-packages/nova/utils.py:219

Revision history for this message
Eric Dodemont (dodeeric) wrote :

To summarize the bug: when you delete a LXC instance, it always umount the rootfs of the latest LXC instance!

Example:

We have three running LXC instances:

i1 ../instance-00000001/rootfs
i2 ../instance-00000002/rootfs
i3 ../instance-00000003/rootfs

If you run:

$ nova delete i3 ==> will umount ../instance-00000003/rootfs ==> OK

then:

$ nova delete i2 ==> will umount ../instance-00000003/rootfs ==> NOK

then:

$ nova delete i1 ==> will umount ../instance-00000003/rootfs ==> NOK

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Thierry Carrez (ttx)
tags: added: lxc
Changed in nova:
assignee: nobody → Muharem Hrnjadovic (al-maisan)
status: Confirmed → In Progress
Revision history for this message
Chuck Short (zulcss) wrote :

This should be an issue once the instances have been converted to uuids in nova/virt/libvirt/driver.py

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote : Re: [Bug 971621] Re: nova delete lxc-instance umounts the wrong rootfs

On 07/12/2012 03:15 PM, Chuck Short wrote:
> This should be an issue once the instances have been converted to uuids
> in nova/virt/libvirt/driver.py
Thanks for the hint, Chuck! Will look that way.

Changed in nova:
assignee: Muharem Hrnjadovic (al-maisan) → nobody
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → Pádraig Brady (p-draigbrady)
Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

Just getting to this now.

Hmm, this is quite awkward.
There are 3 cases to consider for correctly cleaning up the LXC mount dirs

1. Multiple LXC instances (the orig bug)
virt/driver.py stores a global object tracking the mount,
and so will always clean the last LXC instance created.

Now you could let the operating system maintain state and just
`umount /the/lxc/instance/mount/dir`.
Unfortunately a simple umount wont suffice because
depending on the mount method we may need to
`fuesrmount -u` or cleaup nbd devices etc.
So we'd need to store some state as to the
type of mount we performed.
Now maybe things could be arranged so that `umount ...` always works,
though that seems invasive and hacky at first glance.

2. Now because these are long lived mounts, nova could be restarted,
requiring state info for the umount to be maintained outside of nova.
We'd have to be careful to create this atomically, i.e. as a transaction
around the mount operation.

3. If the system is restarted, then the mounts would be too,
and so the persisted mount information should be ignored.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/10655

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/10655
Committed: http://github.com/openstack/nova/commit/a5184d5dbf67630dac3abb69b1678b60807cfce7
Submitter: Jenkins
Branch: master

commit a5184d5dbf67630dac3abb69b1678b60807cfce7
Author: Pádraig Brady <email address hidden>
Date: Wed Aug 1 14:26:54 2012 +0100

    fix unmounting of LXC containers

    There were two issues here.

    1. There was a global object stored for all instances,
    thus the last mounted instance was always unmounted.

    2. Even if there was only a single LXC instance in use,
    the global object would be lost on restart of Nova.

    Therefore we reset the internal state for the mount object,
    by passing in the mount point to destroy_container(),
    and querying the device in use for that mount point.

    Fixes bug: 971621
    Change-Id: I5442442f00d93f5e8b82f492d62918419db5cd3b

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/10962

Changed in nova:
milestone: none → folsom-3
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/10962
Committed: http://github.com/openstack/nova/commit/272b98d718a68f5f714543ed2948d49ffe052ca5
Submitter: Jenkins
Branch: stable/essex

commit 272b98d718a68f5f714543ed2948d49ffe052ca5
Author: Pádraig Brady <email address hidden>
Date: Tue Aug 7 15:17:08 2012 +0100

    fix unmounting of LXC containers

    There were two issues here.

    1. There was a global object stored for all instances,
    thus the last mounted instance was always unmounted.

    2. Even if there was only a single LXC instance in use,
    the global object would be lost on restart of Nova.

    Therefore we reset the internal state for the mount object,
    by passing in the mount point to destroy_container(),
    and querying the device in use for that mount point.

    Fixes bug: 971621
    Change-Id: I5442442f00d93f5e8b82f492d62918419db5cd3b
    Cherry-picked: a5184d5dbf67630dac3abb69b1678b60807cfce7

Dave Walker (davewalker)
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/10655
Stable review: https://review.openstack.org/10962

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-3 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.