Graceful shutdown of nova-compute service fails

Bug #1438183 reported by Roman Podoliaka
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Dan Smith

Bug Description

nova-compute doesn't shutdown gracefully on SIGTERM, e.g. booting a VM fails with:

09:29:18 AUDIT nova.compute.manager [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286
df8] Starting instance...
09:29:18 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting
...
09:29:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Started (Lifecycle Event)
09:29:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Paused (Lifecycle Event)
...
09:34:37 WARNING nova.virt.libvirt.driver [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] Timeout waiting for vif plugging callback for instance 7ea3e761-6b85-49db-8dcc-79f6f2286df8
09:34:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Stopped (Lifecycle Event)
09:34:38 INFO nova.virt.libvirt.driver [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Deleting instance files /var/lib/nova/instances/7ea3e761-6b85-49db-8dcc-79f6f2286df8
09:34:38 ERROR nova.compute.manager [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Instance failed to spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Traceback (most recent call last):
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1773, in _spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] block_device_info)
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2299, in spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] block_device_info)
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3745, in _create_domain_and_network
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] raise exception.VirtualInterfaceCreateException()
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VirtualInterfaceCreateException: Virtual Interface creation failed
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8]

Tags: compute
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, so my current understanding of the problem is described on this sequence diagram - http://goo.gl/QAfcKU

Basically, the way graceful shutdown is implemented for RPC servers like nova-compute is that upon receiving of SIGTERM, RPC threads pool is resized to 0, no new RPC requests are accepted, nova-compute waits until all RPC threads end. At the same time, nova-compute relies on receiving of a notification from nova-api, that neutron has finished plugging in VIFs, so nova-compute is stuck waiting for a message it cannot handle and exits after graceful shutdown timeout (300s) leaving a VM in paused state. After restarting nova-compute all VMs in half-provisioned state are put into ERROR state.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Roman, is the below a fair reproducer? If not, can you write how you intend to reproduce it?

I'm running a minimal DevStack environment with Neutron, all with today's git, my setup details here[*].

(1) Boot a VM:

    $ nova boot --flavor 1 --key_name oskey1 --image cirros-0.3.3-x86_64-disk vm2

(2) Send SIGTERM, right after invoking the above

    $ kill -s SIGTERM `pidof -x nova-compute`

(3) Monitor n-cpu.log w/ debug logs

[. . .]
2015-03-30 13:35:09.073 DEBUG nova.virt.disk.vfs.guestfs [req-2a598f59-4777-4815-85e4-93b7bbcbdd8c None None] Setting up appliance for /home/stack/src/cloud/data/nova/instances/28e48e03-25a4-41db-bde4-330b4709e4d4/disk qcow2 from (pid=4573) setup /home/stack/src/cloud/nova/nova/virt/disk/vfs/guestfs.py:169
2015-03-30 13:35:13.366 INFO nova.openstack.common.service [req-fe2ee29c-be4b-4886-b2b0-731d9a5a97b0 None None] Caught SIGTERM, exiting

In this case, I don't see anything beyond the above, though.

[*] http://kashyapc.com/2015/03/17/minimal-devstack-with-openstack-neutron-networking/

tags: added: compute
Dan Smith (danms)
Changed in nova:
importance: Undecided → Critical
importance: Critical → Medium
status: New → Triaged
milestone: none → kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169056

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/169057

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/169056
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=274924461f537fd1b59eee7ad7978ed2d66d76a6
Submitter: Jenkins
Branch: master

commit 274924461f537fd1b59eee7ad7978ed2d66d76a6
Author: Dan Smith <email address hidden>
Date: Mon Mar 30 11:57:31 2015 -0700

    Cancel all waiting events during compute node shutdown

    Right now, if there are any threads waiting for an external event when
    we start a graceful shutdown of nova-compute, we will sever our RPC inbound
    connection and wait for those threads to complete. Since those threads will
    not complete until they receive something over RPC, we're waiting for
    something that will not come.

    This patch explicitly cancels those threads by delivering "failed" events
    to them, allowing them to complete their task as if the actual thing had failed.

    Change-Id: I609c3a9e636ead4b41bccaceee0daa2463569148
    Partial-Bug: #1438183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/169057
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e030ba4eb029877e7e4f61c5a1006346b60e24f3
Submitter: Jenkins
Branch: master

commit e030ba4eb029877e7e4f61c5a1006346b60e24f3
Author: Dan Smith <email address hidden>
Date: Mon Mar 30 12:37:21 2015 -0700

    Prevent scheduling new external events when compute is shutdown

    During graceful shutdown, we should make sure that any threads that
    we are waiting for completion don't try to schedule new events after
    we have canceled all the inflight ones.

    This patch should ensure that:

    1. We don't wait on those
    2. We fast-fail with typical error behavior
    3. We still run the code they're expecting to run

    Change-Id: I5ac47935a9ef841000f0a8954d78a4a98644844e
    Closes-Bug: #1438183

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.