Failed to spawn lxc instance due to nbd mount in use or rootfs busy

Bug #1491563 reported by Matt Riedemann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
High
Unassigned

Bug Description

I have a devstack change up which runs tempest with nova + the lxc backend and it fails some instances on boot because the nbd device is in use or the rootfs mount is busy:

https://review.openstack.org/#/c/219448/

http://logs.openstack.org/48/219448/1/check/gate-tempest-dsvm-full/bc0fc2a/logs/screen-n-cpu.txt.gz?level=TRACE

It looks like we're leaking nbd devices and failing to tear down instances properly, e.g.:

2015-09-01 22:10:50.141 ERROR nova.virt.disk.api [req-7e5384a8-34f3-4e3a-98cd-b4bb544804a7 tempest-ServersAdminTestJSON-276960936 tempest-ServersAdminTestJSON-718419646] Failed to mount container filesystem '<nova.virt.disk.api._DiskImage object at 0x7f94ec58cd90>' on '/opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs':
--
Failed to mount filesystem: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf mount /dev/nbd4 /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs
Exit code: 32
Stdout: u''
Stderr: u'mount: /dev/nbd4 already mounted or /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs busy\n'
2015-09-01 22:10:50.142 ERROR nova.compute.manager [req-7e5384a8-34f3-4e3a-98cd-b4bb544804a7 tempest-ServersAdminTestJSON-276960936 tempest-ServersAdminTestJSON-718419646] [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Instance failed to spawn
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Traceback (most recent call last):
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/compute/manager.py", line 2138, in _build_resources
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] yield resources
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/compute/manager.py", line 2008, in _build_and_run_instance
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] block_device_info=block_device_info)
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2449, in spawn
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] block_device_info=block_device_info)
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4521, in _create_domain_and_network
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] block_device_info, disk_info):
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] return self.gen.next()
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4430, in _lxc_disk_handler
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] block_device_info, disk_info)
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4382, in _create_domain_setup_lxc
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] container_dir=container_dir)
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] File "/opt/stack/new/nova/nova/virt/disk/api.py", line 452, in setup_container
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] raise exception.NovaException(img.errors)
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] NovaException:
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] --
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Failed to mount filesystem: Unexpected error while running command.
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Command: sudo nova-rootwrap /etc/nova/rootwrap.conf mount /dev/nbd4 /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Exit code: 32
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Stdout: u''
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78] Stderr: u'mount: /dev/nbd4 already mounted or /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs busy\n'
2015-09-01 22:10:50.142 4548 ERROR nova.compute.manager [instance: a9317161-ad31-4948-9c2d-3d4899718a78]

2015-09-01 22:10:50.025 DEBUG nova.virt.disk.mount.api [req-7e5384a8-34f3-4e3a-98cd-b4bb544804a7 tempest-ServersAdminTestJSON-276960936 tempest-ServersAdminTestJSON-718419646] Failed to mount filesystem: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf mount /dev/nbd4 /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs
Exit code: 32
Stdout: u''
Stderr: u'mount: /dev/nbd4 already mounted or /opt/stack/data/nova/instances/a9317161-ad31-4948-9c2d-3d4899718a78/rootfs busy\n' mnt_dev /opt/stack/new/nova/nova/virt/disk/mount/api.py:237

etc

Tags: libvirt lxc nbd
Matt Riedemann (mriedem)
tags: added: libvirt lxc nbd
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/219859

Revision history for this message
Matt Riedemann (mriedem) wrote :

Some notes from a failed delete in

http://logs.openstack.org/48/219448/1/check/gate-tempest-dsvm-full/bc0fc2a/logs/screen-n-cpu.txt.gz

Mounted /dev/nbd13 here:

2015-09-01 22:10:36.697 DEBUG oslo_concurrency.processutils [req-5a3718b5-5042-4bd3-b82f-6868097f6f47 tempest-

AggregatesAdminTestJSON-1589136606 tempest-AggregatesAdminTestJSON-1584222786] CMD "sudo nova-rootwrap

/etc/nova/rootwrap.conf mount /dev/nbd13 /opt/stack/data/nova/instances/87a8fc91-5d8b-474c-aedf-36bf8300282e/rootfs"

returned: 0 in 0.066s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:260

Unmounted /dev/nbd13 here (should be from clean_lxc_namespace):

2015-09-01 22:10:42.569 DEBUG oslo_concurrency.processutils [req-5a3718b5-5042-4bd3-b82f-6868097f6f47 tempest-

AggregatesAdminTestJSON-1589136606 tempest-AggregatesAdminTestJSON-1584222786] CMD "sudo nova-rootwrap

/etc/nova/rootwrap.conf umount /dev/nbd13" returned: 0 in 0.555s execute /usr/local/lib/python2.7/dist-

packages/oslo_concurrency/processutils.py:260

Released the /dev/nbd13 device here (should be from _teardown_container):

2015-09-01 22:10:49.043 DEBUG oslo_concurrency.processutils [req-811e7df5-a137-41cb-8c3a-918712f61e46 tempest-

AggregatesAdminTestJSON-1589136606 tempest-AggregatesAdminTestJSON-1584222786] CMD "sudo nova-rootwrap

/etc/nova/rootwrap.conf qemu-nbd -d /dev/nbd13" returned: 0 in 0.103s execute /usr/local/lib/python2.7/dist-

packages/oslo_concurrency/processutils.py:260

Deleting the instance files fails here, is it still mounted?

2015-09-01 22:10:49.449 ERROR nova.virt.libvirt.driver [req-811e7df5-a137-41cb-8c3a-918712f61e46 tempest-

AggregatesAdminTestJSON-1589136606 tempest-AggregatesAdminTestJSON-1584222786] [instance: 87a8fc91-5d8b-474c-aedf-

36bf8300282e] Failed to cleanup directory /opt/stack/data/nova/instances/87a8fc91-5d8b-474c-aedf-36bf8300282e_del: [Errno

16] Device or resource busy: '/opt/stack/data/nova/instances/87a8fc91-5d8b-474c-aedf-36bf8300282e_del/rootfs'

destroy calls:
 _destroy calls:
  _teardown_container executes:
   qemu-nbd -d
 cleanup calls:
  delete_instance_files:
   This is what fails saying

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://lists.nongnu.org/archive/html/qemu-devel/2011-12/msg00734.html seems to suggest that qemu-nbd is asynchronous, which could mean that qemu-nbd -d is not done by the time we try to delete the rootfs.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

There is also quite a bit of history here:

https://bugs.launchpad.net/nova/+bug/973413

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Mark Doffman (mjdoffma)
Changed in nova:
assignee: nobody → Mark Doffman (mjdoffma)
Revision history for this message
Mark Doffman (mjdoffma) wrote :

This one is confusing to me. I can't think of the exact problems that would cause the errors shown in this log.

Will add some extra extra logging and re-run to try and track it down.

The 'Failed to cleanup directory' messages where the instance directory is being deleted shouldn't be caused by a failure to shut down the container or remove the nbd device. The rootfs is only temporarily mounted in the instance directory and is unmounted from there once the container is running and the device in mounted in the container namespace. (See virt/disk/api.py clean_lxc_namespace). Its possible I guess that there is a race condition here (Not accounted for) that means we are trying to delete the instance directory before the container is fully set up.

The '/dev/nbd4 already mounted' error shouldn't be related to a failure to disconnect previous nbd devices as the code in virt/disk/mount/nbd.py _find_unused should only get free nbd devices or error out.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/223125

Revision history for this message
Mark Doffman (mjdoffma) wrote :

Added logging to try and see what is happening in this directory that is causing it to be busy.

Logs available at http://logs.openstack.org/48/219448/4/check/gate-tempest-dsvm-full/a3552a1/logs/screen-n-cpu.txt.gz.

Added 'lsof' and 'mountpoint' commands.

Unfortunately / Fortunately neither show any cause for the busy error. The directory is not a mountpoint and doesn't seem to have any processes with open files. At least, not in the top level of the directory. Need to check if I have the 'mountpoint' command pointed at the wrong place.

Instead of 'instance_del/' perhaps 'instance_del/rootfs' is the correct directory to check if still mounted?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/219859
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b9e997617fd21e260bd532670b3e70a4f66dbf2a
Submitter: Jenkins
Branch: master

commit b9e997617fd21e260bd532670b3e70a4f66dbf2a
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 2 12:46:13 2015 -0700

    libvirt: add debug logging for lxc teardown paths

    There are races to setup and teardown lxc-backed instances using nbd
    mounts when running tempest which result in mount failing.

    This change adds some debug logging to help figure out what's going on
    to see where the race is happening.

    Related-Bug: #1491563

    Change-Id: Ief8edef0df8da1a6ee94aafa9ceaf96da5e3211b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/223125
Reason: The DNM testing changes I had are abandoned so let's drop this.

Changed in nova:
importance: Undecided → High
assignee: Mark Doffman (mjdoffma) → nobody
Revision history for this message
Jay Pipes (jaypipes) wrote :

@johnthetubaguy: Any reason this is set to High? Seems like an LXC-specific bug, which IMHO, should make it nothing higher than Medium...

Also, is the bug condition even reproducible any more?

Going to set the bug to Incomplete until we can get a reproduction on a modern Nova.

-jay

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.