OpenStack Compute (Nova)

cannot umount guestfs

Reported by Vasyl Khomenko on 2012-06-15
80
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Pádraig Brady
Essex
High
Pádraig Brady
nova (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned

Bug Description

I run nova under redhat 6.1
There is a problem with spawning instances:

If i use default libguestfs-1.7.17-17 - i get instances stucked in BUILD state. Proccess hangs on line
 sudo nova-rootwrap guestmount --rw -a /var/lib/nova/instances/instance-0000001f/disk -m /dev/sda1 /tmp/tmp1kv9F
I updated libguestfs to latest libguestfs-1.7.17-26 and instances can be spawned but ssh keys are not injected.
Here is a log:

2012-06-15 11:20:21 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap guestmount --rw -a /var/lib/nova/instances/instance-0000001f/disk -m /dev/sda1 /tmp/tmp1kv9Fb from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap mkdir -p /tmp/tmp1kv9Fb/root/.ssh from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap chown root /tmp/tmp1kv9Fb/root/.ssh from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap chmod 700 /tmp/tmp1kv9Fb/root/.ssh from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap tee -a /tmp/tmp1kv9Fb/root/.ssh/authorized_keys from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmp1kv9Fb/etc/passwd /tmp/tmpCShB4Z from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmp1kv9Fb/etc/shadow /tmp/tmp3KsCTU from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmpCShB4Z /tmp/tmp1kv9Fb/etc/passwd from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:27 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmp3KsCTU /tmp/tmp1kv9Fb/etc/shadow from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:28 WARNING nova.virt.libvirt.connection [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] [instance: ba13ce83-0573-49eb-bb1b-5557e41913fe] Ignoring error injecting data into image 840d18da-dba7-4ef9-8ec5-73348e232dbc ([Errno 16] Device or resource busy: '/tmp/tmp1kv9Fb')
2012-06-15 11:20:29 DEBUG nova.virt.libvirt.connection [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] [instance: ba13ce83-0573-49eb-bb1b-5557e41913fe] Instance is running from (pid=8594) spawn /usr/lib/python2.6/site-packages/nova/virt/libvirt/connection.py:923
2012-06-15 11:20:29 DEBUG nova.utils [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] Running cmd (subprocess): sudo nova-rootwrap tee /sys/class/net/vnet0/brport/hairpin_mode from (pid=8594) execute /usr/lib/python2.6/site-packages/nova/utils.py:219
2012-06-15 11:20:29 DEBUG nova.compute.manager [req-2b11c94d-d990-4e3e-aea4-4d878c07e005 ae3806d96146403d918895671ecda1a3 0256500b8311479da2fa3e45b80c1fa8] [instance: ba13ce83-0573-49eb-bb1b-5557e41913fe] Checking state from (pid=8594) _get_power_state /usr/lib/python2.6/site-packages/nova/compute/manager.py:260
2012-06-15 11:20:29 INFO nova.virt.libvirt.connection [-] [instance: ba13ce83-0573-49eb-bb1b-5557e41913fe] Instance spawned successfully.

After instance spawned, i see /tmp/tmp1kv9Fb still mounted and it can be unmounted manually.

Fix proposed to branch: master
Review: https://review.openstack.org/8604

Changed in nova:
assignee: nobody → Alessio Ababilov (aababilov)
status: New → In Progress

I don't think the fix is correct.

There is a general problem with Linux which is that you always have
to be prepared to retry the 'fusermount -u' program, since other
daemons (particularly from udev rules) jump into newly mounted
filesystems and keep them open.

In the newer mount-local API which should replace external guestmount
we provide a 'retry' flag to do this automatically.

Here's the retry logic used in umount-local:

https://github.com/libguestfs/libguestfs/blob/60a2f8706b305b4068df020598abee9566022e20/src/fuse.c#L1049

Unfortunately FUSE seems like a good idea, but in reality there are several problems.

Fix proposed to branch: master
Review: https://review.openstack.org/8608

Alessio Ababilov (aababilov) wrote :

Ok, let's repeat fusermount with `attempts` parameter:
https://review.openstack.org/#/c/8608/

Alessio Ababilov (aababilov) wrote :

attempts doesn't help in any case. Sometimes fusermount -u unmounts after the first attempt but the filesystem remains unsynchronized and the instance canot be accessed using RSA key.

However, sync works always. It would be better to leave it.

Pádraig Brady (p-draigbrady) wrote :

OK we've reproduced here I think on rh6.3
Or something very similar at least.

It seems that if you guestmount an image right after a _successful_ `fusermount -u`, the changes from the first mount can be discarded.

This mount, umount, mount, umount pattern can be seen with injected files are specified as they were implemented as a separate mount session to other injected stuff like admin_pass and .ssh/authorized_keys.

Multiple mounts is inefficient at least so we've tested the attached essex patch that merges the writes to a single mounted session. This was seen to avoid the issue and be much more efficient also.

Note we also tried adding a sync instead, but that had no effect.

Vasyl Khomenko (vkhomenko) wrote :

Hi. I applied your patch and it didn't help.. No changes. It still asks for password. Solution with "sync" works fine.
Do you need anything from me? logs?

Everyone seems to be testing different versions and reporting different bugs.

Firstly, libguestfs 1.7.17 & RHEL 6.1 are both obsolete. Please try libguestfs 1.16.19 and RHEL 6.3. There are preview packages available for RHEL 6.2 here:

http://people.redhat.com/~rjones/libguestfs-RHEL-6.3-preview/

or you can use RHEL 6.3 which was pushed to RHN yesterday.

Secondly, I cannot reproduce any of these bugs. I need a standalone script which would demonstrate the problem clearly.

Attached is my script which *does not* reproduce the problem, but could be a good starting point.

Joshua Harlow (harlowja) wrote :

I have also seen this, although its not that easy to reproduce. I think I even saw it using the packages in that REPO for 6.3, but its very transient....

Pádraig Brady (p-draigbrady) wrote :

OK Vasyl, thanks for the info.
Looks like my patch avoids a different so :(

I'm still very wary about doing the sync.
Hopefully as we dig into the related issue we can reproduce here,
we may find the cause of the possible RHEL 6.1 specific issue.

Alessio Ababilov (aababilov) wrote :

We have tested simple "sync" on another hardware with libguestfs 1.16.19 - the error is the same (successful umount but keys and passwords are not really updated).

The fix was to add sleep for 3 seconds before and after 'fusermount -u'. Very stupid, but it works.

Does fusermount -u actually unmount the filesystem? Does guestmount return?

Does the test (comment 9) work for you? Can you modify the test to make it fail?

We are really nowhere near a stage where we have even characterized this bug.

Pádraig Brady (p-draigbrady) wrote :

looked at this here a little today
sleep(3) before/after `fusermount -u` does avoid the issue.

We also noticed strange interactions with the cache manager, and setting this option also avoided our issue:

image_cache_manager_interval = 0

Pádraig Brady (p-draigbrady) wrote :

Actually image_cache_manager_interval = 0 has no effect.
Sorry for the noise.

Reviewed: https://review.openstack.org/8608
Committed: http://github.com/openstack/nova/commit/d28104350e0f3db191d0b6b1222f795a6026a842
Submitter: Jenkins
Branch: master

commit d28104350e0f3db191d0b6b1222f795a6026a842
Author: Alessio Ababilov <email address hidden>
Date: Fri Jun 15 18:33:26 2012 +0300

    do sync before fusermount to avoid busyness

    Fix bug #1013689

    Change-Id: I8ae786e105c08e1b5011a1b2a1dac048a710a1cb

Changed in nova:
status: In Progress → Fix Committed
Joshua Harlow (harlowja) wrote :

Would this also be useful,

Running udevadm settle??

libguestfs already does udevadm settle when it's required.

Step one is to analyze the bug. Why are there NO debug logs at all? Why has zero root cause analysis been done on this bug? ... and a fix [broken] already committed!

Thierry Carrez (ttx) on 2012-07-04
Changed in nova:
milestone: none → folsom-2
status: Fix Committed → Fix Released

It seems likely that we've got to the true reason for this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=835466#c9

Fix proposed to branch: master
Review: https://review.openstack.org/9592

Changed in nova:
assignee: Alessio Ababilov (aababilov) → Pádraig Brady (p-draigbrady)
status: Fix Released → In Progress

Reviewed: https://review.openstack.org/9592
Committed: http://github.com/openstack/nova/commit/3d1fa15d716635c75897bcc49536feae60460a6a
Submitter: Jenkins
Branch: master

commit 3d1fa15d716635c75897bcc49536feae60460a6a
Author: Pádraig Brady <email address hidden>
Date: Tue Jul 10 15:22:39 2012 +0100

    ensure libguestfs has completed before proceeding

    This is a more correct fix than the previous sync(1)
    which probably only changed timings, masking the real issue.

    The FUSE issue is referenced here:
    http://code.google.com/p/s3ql/issues/detail?id=159
    which in turn references this thread on the subject from 2006:
    http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/3903
    And another case from 2008:
    http://permalink.gmane.org/gmane.comp.file-systems.fuse.devel/6502

    Fixes bug: 1013689

    Change-Id: I36fd11d5f01562f65a6b6f07e759ea066490b067

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
status: Fix Committed → In Progress

Reviewed: https://review.openstack.org/9685
Committed: http://github.com/openstack/nova/commit/5eea8879b6f3268dcaba326193bef0ed75470bb7
Submitter: Jenkins
Branch: master

commit 5eea8879b6f3268dcaba326193bef0ed75470bb7
Author: Pádraig Brady <email address hidden>
Date: Wed Jul 11 23:45:28 2012 +0100

    ensure libguestfs mounts are cleaned up

    Make a few attempts to umount the libguestfs
    mounts, in case other processes are accessing
    the mounts (like udev etc.)

    If we still can't umount after 5 attempts
    (with average delay of about 1s between each),
    then initiate a lazy umount so that the mounts
    might be automatically cleaned up at some stage.
    We wait a further 10s after initiating the lazy
    umount, before raising an exception.

    Addresses the original issue in bug 1013689

    Change-Id: Ib5ede9f705c833825a19308c140f99c5bf3a776f

Changed in nova:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/9470
Committed: http://github.com/openstack/nova/commit/0d166cae0113cd28f9253ff12450d47b71fe0a40
Submitter: Jenkins
Branch: master

commit 0d166cae0113cd28f9253ff12450d47b71fe0a40
Author: Pádraig Brady <email address hidden>
Date: Mon Jun 25 13:52:46 2012 +0100

    only mount guest image once when injecting files

    Previously we could incur the costly guest mount operation twice,
    if injecting files with any of net, ssh keys, password, metadata.
    Instead now inject files in the same operation.

    Also cleanup the internal injection functions a little
    by removing the unused 'execute' parameter.

    Partly mitigates the impact of bug 1013689.
    Change-Id: Ib66d327e745593cf1e5897470750d609b2a58980

Reviewed: https://review.openstack.org/9618
Committed: http://github.com/openstack/nova/commit/1c1b858e919248d6e3fb11c1022532ba40e75844
Submitter: Jenkins
Branch: stable/essex

commit 1c1b858e919248d6e3fb11c1022532ba40e75844
Author: Pádraig Brady <email address hidden>
Date: Tue Jul 10 15:22:39 2012 +0100

    ensure libguestfs has completed before proceeding

    This avoids a FUSE issue referenced here:
    http://code.google.com/p/s3ql/issues/detail?id=159
    which in turn references this thread on the subject from 2006:
    http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/3903
    And another case from 2008:
    http://permalink.gmane.org/gmane.comp.file-systems.fuse.devel/6502

    The vast majority of this work was reproducing and classifying
    done by David Naori and Richard W.M. Jones at:
    https://bugzilla.redhat.com/show_bug.cgi?id=835466

    Fixes bug: 1013689
    Change-Id: I36fd11d5f01562f65a6b6f07e759ea066490b067

Reviewed: https://review.openstack.org/10510
Committed: http://github.com/openstack/nova/commit/3b4ac312b4ba7658f6ad4d2da674092c20e05aeb
Submitter: Jenkins
Branch: stable/essex

commit 3b4ac312b4ba7658f6ad4d2da674092c20e05aeb
Author: Pádraig Brady <email address hidden>
Date: Wed Jul 11 23:45:28 2012 +0100

    ensure libguestfs mounts are cleaned up

    Make a few attempts to umount the libguestfs
    mounts, in case other processes are accessing
    the mounts (like udev etc.)

    If we still can't umount after 5 attempts
    (with average delay of about 1s between each),
    then initiate a lazy umount so that the mounts
    might be automatically cleaned up at some stage.
    We wait a further 10s after initiating the lazy
    umount, before raising an exception.

    Addresses the original issue in bug 1013689

    Change-Id: Ib5ede9f705c833825a19308c140f99c5bf3a776f
    (cherry picked from commit 5eea8879b6f3268dcaba326193bef0ed75470bb7)

Dave Walker (davewalker) on 2012-08-24
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/9470
Stable review: https://review.openstack.org/10510

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx) on 2012-09-26
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-09-27
Changed in nova:
milestone: folsom-2 → 2012.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.