Graceful shutdown of nova-compute service fails

Bug #1433605 reported by Peter Zhurba
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Roman Podoliaka
6.0.x
Fix Released
High
Alex Ermolov
6.1.x
Fix Released
High
Roman Podoliaka
7.0.x
Won't Fix
Medium
MOS Nova
8.0.x
Won't Fix
Medium
MOS Nova
9.x
Won't Fix
Medium
MOS Nova

Bug Description

Upstream bug: https://bugs.launchpad.net/nova/+bug/1438183

When we try stop nova-compute service with TERM (-15) signal
and at the same time it is busy by "booting VM process". Service will be trying to complete task according https://blueprints.launchpad.net/nova/+spec/graceful-shutdown , so we expect that VM will be booted successful. But after some time (5 min) nova compute ends unsuccessful.

it was reproduced every time with boot-vm script in attachment on ubuntu environment deployed by fuel

  release: "6.0"
  api: "1.0"
  build_number: "58"
  build_id: "2014-12-26_14-25-46"

and

on custom 5.1 release

initctl stop wasn't used because it kill service with KILL (-9) signal after 3 sec.

Revision history for this message
Peter Zhurba (pzhurba) wrote :
Changed in mos:
assignee: nobody → MOS Nova (mos-nova)
tags: added: nova
Changed in mos:
status: New → Confirmed
importance: Undecided → Medium
milestone: none → 6.1
summary: - Graceful shutdown of nova-compute service fails.
+ Graceful shutdown of nova-compute service fails
description: updated
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

We've got the following trace:
09:29:18 AUDIT nova.compute.manager [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286
df8] Starting instance...
09:29:18 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting
...
09:29:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Started (Lifecycle Event)
09:29:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Paused (Lifecycle Event)
...
09:34:37 WARNING nova.virt.libvirt.driver [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] Timeout waiting for vif plugging callback for instance 7ea3e761-6b85-49db-8dcc-79f6f2286df8
09:34:37 INFO nova.compute.manager [-] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VM Stopped (Lifecycle Event)
09:34:38 INFO nova.virt.libvirt.driver [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Deleting instance files /var/lib/nova/instances/7ea3e761-6b85-49db-8dcc-79f6f2286df8
09:34:38 ERROR nova.compute.manager [req-9cdbba9c-af3b-4845-9deb-c68bffe63d75 None] [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Instance failed to spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] Traceback (most recent call last):
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1773, in _spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] block_device_info)
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2299, in spawn
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] block_device_info)
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3745, in _create_domain_and_network
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] raise exception.VirtualInterfaceCreateException()
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8] VirtualInterfaceCreateException: Virtual Interface creation failed
09:34:38 TRACE nova.compute.manager [instance: 7ea3e761-6b85-49db-8dcc-79f6f2286df8]

tags: added: customer-found
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, so my current understanding of the problem is described on this sequence diagram - http://goo.gl/QAfcKU

Basically, the way graceful shutdown is implemented for RPC servers like nova-compute is that upon receiving of SIGTERM, RPC threads pool is resized to 0, no new RPC requests are accepted, nova-compute waits until all RPC threads end. At the same time, nova-compute relies on receiving of a notification from nova-api, that neutron has finished plugging in VIFs, so nova-compute is stuck waiting for a message it cannot handle and exits after graceful shutdown timeout (300s) leaving a VM in paused state. After restarting nova-compute all VMs in half-provisioned state are put into ERROR state.

Revision history for this message
Jay Pipes (jaypipes) wrote :

Yes, Roman is correct about the root cause of this problem. A possible solution to this in Nova/oslo.service might be to do some sort of in-flight conntracking in the RPC service daemon. Basically, at the time of receipt of the SIGTERM, the service would know the in-flight operations and only accept RPC messages with a request ID matching one of the in-flight operations, along with a global timeout value?

Revision history for this message
Jay Pipes (jaypipes) wrote :

BTW, this should be added as a bug in upstream Nova.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Can messaging experts comment on https://bugs.launchpad.net/mos/+bug/1433605/comments/4 ? Looking at http://docs.openstack.org/developer/oslo.messaging/server.html I'm not sure if that's possible

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Dan Smith <email address hidden>
Review: https://review.fuel-infra.org/6226

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Dan Smith <email address hidden>
Review: https://review.fuel-infra.org/6227

Revision history for this message
OSCI Robot (oscirobot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/6226

Revision history for this message
OSCI Robot (oscirobot) wrote :

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/6227

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (openstack-ci/fuel-6.1/2014.2)

Reviewed: https://review.fuel-infra.org/6226
Submitter: Roman Podoliaka <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 895090bcaf7c78fbe9d0455daafee6a8ffa50db9
Author: Dan Smith <email address hidden>
Date: Mon Apr 27 08:27:52 2015

Cancel all waiting events during compute node shutdown

Right now, if there are any threads waiting for an external event when
we start a graceful shutdown of nova-compute, we will sever our RPC inbound
connection and wait for those threads to complete. Since those threads will
not complete until they receive something over RPC, we're waiting for
something that will not come.

This patch explicitly cancels those threads by delivering "failed" events
to them, allowing them to complete their task as if the actual thing had failed.

Partial-Bug: #1433605

Change-Id: I609c3a9e636ead4b41bccaceee0daa2463569148

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Reviewed: https://review.fuel-infra.org/6227
Submitter: Roman Podoliaka <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 8038494f69592905948686108c684a377acb1c8a
Author: Dan Smith <email address hidden>
Date: Mon Apr 27 08:27:52 2015

Prevent scheduling new external events when compute is shutdown

During graceful shutdown, we should make sure that any threads that
we are waiting for completion don't try to schedule new events after
we have canceled all the inflight ones.

This patch should ensure that:

1. We don't wait on those
2. We fast-fail with typical error behavior
3. We still run the code they're expecting to run

Closes-Bug: #1433605

Change-Id: I5ac47935a9ef841000f0a8954d78a4a98644844e

Revision history for this message
OSCI Robot (oscirobot) wrote :

Reviewed: https://review.fuel-infra.org/6226
Committed: https://review.fuel-infra.org/gitweb?p=openstack/nova.git;a=commitdiff;h=895090bcaf7c78fbe9d0455daafee6a8ffa50db9
Submitter: Roman Podoliaka
Branch: openstack-ci/fuel-6.1/2014.2

commit 895090bcaf7c78fbe9d0455daafee6a8ffa50db9
Author: Roman Podoliaka <email address hidden>

Cancel all waiting events during compute node shutdown

Right now, if there are any threads waiting for an external event when
we start a graceful shutdown of nova-compute, we will sever our RPC inbound
connection and wait for those threads to complete. Since those threads will
not complete until they receive something over RPC, we're waiting for
something that will not come.

This patch explicitly cancels those threads by delivering "failed" events
to them, allowing them to complete their task as if the actual thing had failed.

Partial-Bug: #1433605

Change-Id: I609c3a9e636ead4b41bccaceee0daa2463569148

Revision history for this message
OSCI Robot (oscirobot) wrote :

Reviewed: https://review.fuel-infra.org/6227
Committed: https://review.fuel-infra.org/gitweb?p=openstack/nova.git;a=commitdiff;h=8038494f69592905948686108c684a377acb1c8a
Submitter: Roman Podoliaka
Branch: openstack-ci/fuel-6.1/2014.2

commit 8038494f69592905948686108c684a377acb1c8a
Author: Roman Podoliaka <email address hidden>

Prevent scheduling new external events when compute is shutdown

During graceful shutdown, we should make sure that any threads that
we are waiting for completion don't try to schedule new events after
we have canceled all the inflight ones.

This patch should ensure that:

1. We don't wait on those
2. We fast-fail with typical error behavior
3. We still run the code they're expecting to run

Closes-Bug: #1433605

Change-Id: I5ac47935a9ef841000f0a8954d78a4a98644844e

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-6.0-updates/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Alex Ermolov <email address hidden>
Review: https://review.fuel-infra.org/6551

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (openstack-ci/fuel-6.0-updates/2014.2)

Reviewed: https://review.fuel-infra.org/6551
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: e11b61beafbb2f89a9ae0bfdb945e69bc15e27da
Author: Alex Ermolov <email address hidden>
Date: Tue May 12 09:46:34 2015

Cancel all waiting events during compute node shutdown

Right now, if there are any threads waiting for an external event when
we start a graceful shutdown of nova-compute, we will sever our RPC inbound
connection and wait for those threads to complete. Since those threads will
not complete until they receive something over RPC, we're waiting for
something that will not come.

This patch explicitly cancels those threads by delivering "failed" events
to them, allowing them to complete their task as if the actual thing had failed.

Partial-Bug: #1433605

Change-Id: I609c3a9e636ead4b41bccaceee0daa2463569148

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-6.0-updates/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Alex Ermolov <email address hidden>
Review: https://review.fuel-infra.org/6640

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (openstack-ci/fuel-6.0-updates/2014.2)

Reviewed: https://review.fuel-infra.org/6640
Submitter: Alex Ermolov <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: 272444e205cf5aa3173ce26a9ca0e30df932e210
Author: Alex Ermolov <email address hidden>
Date: Wed May 13 09:19:48 2015

Prevent scheduling new external events when compute is shutdown

During graceful shutdown, we should make sure that any threads that
we are waiting for completion don't try to schedule new events after
we have canceled all the inflight ones.

This patch should ensure that:

1. We don't wait on those
2. We fast-fail with typical error behavior
3. We still run the code they're expecting to run

Closes-Bug: #1433605

Change-Id: I5ac47935a9ef841000f0a8954d78a4a98644844e

Revision history for this message
Alexander Gubanov (ogubanov) wrote :

AFAIK, community decided (https://review.openstack.org/#/c/169056/) to cancel wait for an external event when we start a graceful shutdown of nova-compute, so this is normal behavior http://paste.mirantis.net/show/421/
I've verified it on MOS 6.1 (build 395)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

There was a duplicate filed recently:

https://bugs.launchpad.net/mos/+bug/1545146

I'll respond here.

Sergey, I'm afraid this behavior you see is by design: if you want to restart/shutdown nova-compute properly you'll need to disable it first by the means of nova service-disable. To prevent scheduling of new VMs, wait until all in-flight boot requests are completed and only then send a SIGTERM.

Please see comments to this bug for details, the main problem is that we wait for a event from Neutron, but have already shut down the RPC server.

tags: added: area-nova
removed: nova
tags: added: docs
Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :

Roman, according to the postinst there is no restart of nova-compute on package upgrade. And there is no mention in OS docs[1] or MOS docs[2] on disabling nova-compute during update.

So I have only 2 questions:
1. Is this disabling of nova stated anywhere? Is it possible that customers start to complain on that behavior?
2. Is update of nova-compute package restart service or it has to be restarted manually?

Thanks in advance!

[1] http://docs.openstack.org/openstack-ops/content/ops_upgrade-process.html
[2] https://docs.mirantis.com/openstack/fuel/fuel-7.0/maintenance-updates.html#mos70mu-how-to-update

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.