_cleanup_failed_start aggressively removes local instance files when handling plug_vif failures

Bug #1660647 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Lee Yarwood
Newton
Fix Released
High
Lee Yarwood

Bug Description

Description
===========
Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1 [1] ensured that _cleanup_failed_start is always called if we encounter VIF plugging failures in _create_domain_and_network. However this currently leads to any local instance files being removed as cleanup is called with destroy_disks=True.

As such any failures when resuming or restarting an instance will lead to these files being removed and the instance left in an unbootable state. IMHO these files should only be removed when cleaning up after errors hit while initially spawning an instance.

Steps to reproduce
==================
- Boot an instance using local disks
- Stop the instance
- Start the instance, causing a timeout or other failure during plug_vifs
- Attempt to start the instance again

Expected result
===============
The local instance files are left on the host if instances are rebooting or resuming.

Actual result
=============
The local instance files are removed from the host if _cleanup_failed_start is called.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   list for all releases: http://docs.openstack.org/releases/

   $ pwd
   /opt/stack/nova
   $ git rev-parse HEAD
   42222969a21ee28ef4a68bd5ab1ec8a12c4ad126

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

$ nova boot --image cirros-0.3.4-x86_64-uec --flavor 1 test-boot
[..]
$ nova stop test-boot
$ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
total 18596
-rw-rw-r--. 1 root root 16699 Jan 31 09:30 console.log
-rw-r--r--. 1 root root 10289152 Jan 31 09:30 disk
-rw-r--r--. 1 stack libvirtd 257 Jan 31 09:29 disk.info
-rw-rw-r--. 1 qemu qemu 4979632 Jan 31 09:29 kernel
-rw-rw-r--. 1 qemu qemu 3740163 Jan 31 09:29 ramdisk

I used the following change to artificially recreate an issue plugging the VIFs :

$ git diff
diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 33e3157..248e960 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -5015,6 +5015,7 @@ class LibvirtDriver(driver.ComputeDriver):
         pause = bool(events)
         guest = None
         try:
+ raise exception.VirtualInterfaceCreateException()
             with self.virtapi.wait_for_instance_event(
                     instance, events, deadline=timeout,
                     error_callback=self._neutron_failed_callback):

$ nova start test-boot
Request to start server test-boot has been accepted.
$ nova list
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
| be6cb386-e005-4fb2-8332-7e0c375ee452 | test-boot | SHUTOFF | - | Shutdown | public=172.24.4.8, 2001:db8::9 |
+--------------------------------------+-----------+---------+------------+-------------+--------------------------------+
$ ll ../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/
ls: cannot access '../data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/': No such file or directory

Future attempts to start the instance will fail as a result :

$ nova start test-boot
Request to start server test-boot has been accepted.
$ vi ../logs/n-cpu.log
[..]
5353 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server Traceback (most recent call last):
5354 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
5355 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
5356 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
5357 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
5358 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
5359 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server result = func(ctxt, **new_args)
5360 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 75, in wrapped
5361 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server function_name, call_dict, binary)
5362 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5363 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server self.force_reraise()
5364 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5365 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
5366 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 66, in wrapped
5367 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return f(self, context, *args, **kw)
5368 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 188, in decorated_function
5369 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server LOG.warning(msg, e, instance=instance)
5370 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5371 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server self.force_reraise()
5372 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5373 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
5374 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 157, in decorated_function
5375 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
5376 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/utils.py", line 685, in decorated_function
5377 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
5378 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 216, in decorated_function
5379 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
5380 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
5381 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server self.force_reraise()
5382 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
5383 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
5384 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 204, in decorated_function
5385 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
5386 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2524, in start_instance
5387 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server self._power_on(context, instance)
5388 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2494, in _power_on
5389 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server block_device_info)
5390 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2494, in power_on
5391 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server self._hard_reboot(context, instance, network_info, block_device_info)
5392 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2373, in _hard_reboot
5393 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server block_device_info)
5394 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 6931, in _get_instance_disk_info
5395 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server dk_size = int(os.path.getsize(path))
5396 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server File "/usr/lib64/python2.7/genericpath.py", line 57, in getsize
5397 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server return os.stat(filename).st_size
5398 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server OSError: [Errno 2] No such file or directory: '/opt/stack/data/nova/instances/be6cb386-e005-4fb2-8332-7e0c375ee452/disk'
5399 2017-01-31 09:35:59.117 TRACE oslo_messaging.rpc.server

Tags: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/427267

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
Matt Riedemann (mriedem)
tags: added: ocata-rc-potential
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/427267
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=67aa277b4ef623c9877b97bfd7952f0bb1d80a81
Submitter: Jenkins
Branch: master

commit 67aa277b4ef623c9877b97bfd7952f0bb1d80a81
Author: Lee Yarwood <email address hidden>
Date: Tue Jan 31 15:26:38 2017 +0000

    libvirt: Limit destroying disks during cleanup to spawn

    Iab5afdf1b5b now ensures that cleanup is always called when VIF plugging
    errors are encountered by _create_domain_and_network. At present cleanup
    is always called with destroy_disks=True leading to any local instance
    files being removed from the host.

    _create_domain_and_network itself has various callers such as resume and
    hard_reboot that assume these files will persist any such failures. As a
    result the removal of these files will leave instances in an unbootable
    state.

    In order to correct this an additional destroy_disks_on_failures kwarg
    is now provided to _create_domain_and_network and passed down into
    cleanup. This kwarg defaults to False and is only enabled when
    _create_domain_and_network is used to spawn a new instance.

    Closes-bug: #1660647
    Change-Id: I38c969690fedb71c5b5ec4418c1b0dd53df733ec

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/428671

Matt Riedemann (mriedem)
tags: added: libvirt
removed: ocata-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0rc1

This issue was fixed in the openstack/nova 15.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/428671
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f8d15dcc1b4db9b1116d19b6af0f6cac15ff9fe
Submitter: Jenkins
Branch: stable/newton

commit 9f8d15dcc1b4db9b1116d19b6af0f6cac15ff9fe
Author: Lee Yarwood <email address hidden>
Date: Tue Jan 31 15:26:38 2017 +0000

    libvirt: Limit destroying disks during cleanup to spawn

    Iab5afdf1b5b now ensures that cleanup is always called when VIF plugging
    errors are encountered by _create_domain_and_network. At present cleanup
    is always called with destroy_disks=True leading to any local instance
    files being removed from the host.

    _create_domain_and_network itself has various callers such as resume and
    hard_reboot that assume these files will persist any such failures. As a
    result the removal of these files will leave instances in an unbootable
    state.

    In order to correct this an additional destroy_disks_on_failures kwarg
    is now provided to _create_domain_and_network and passed down into
    cleanup. This kwarg defaults to False and is only enabled when
    _create_domain_and_network is used to spawn a new instance.

    Closes-bug: #1660647
    Change-Id: I38c969690fedb71c5b5ec4418c1b0dd53df733ec
    (cherry picked from commit 67aa277b4ef623c9877b97bfd7952f0bb1d80a81)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.4

This issue was fixed in the openstack/nova 14.0.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.