Status does not update to "Shutoff" when instance shuts down itself

Bug #1753550 reported by Raoul Hidalgo Charman on 2018-03-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Queens
Medium
Matt Riedemann

Bug Description

For instances that shut down themselves, the status remains Active, even though the power state updates to Shutdown. If however the shutdown signal is sent via openstack it does change the status to "SHUTOFF".

After creating an instance, the states are:
+------------------------+---------+
| Field | Value |
+------------------------+---------+
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| status | ACTIVE |
+------------------------+---------+

Using openstack server stop results in:
+------------------------+----------+
| Field | Value |
+------------------------+----------+
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | stopped |
| status | SHUTOFF |
+------------------------+----------+

However logging into the instance and using the poweroff command results in:
+------------------------+----------+
| Field | Value |
+------------------------+----------+
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| status | ACTIVE |
+------------------------+----------+

This results in being unable to use the openstack server start command on it fails and returns:

# openstack server start test_shutdown
Cannot 'start' instance 8881bebb-efbd-45e6-a052-7d23b9b63222 while it is in vm_state active (HTTP 409) (Request-ID: req-e473415d-d6e0-4342-b1f6-267efa934dc0)

despite the virtual machine being powered off. You can work around this by running openstack server stop, and then openstack server start.

This is also an issue for external applications that check the status (and how I noticed it to begin with).

This on devstack with commit id: 5d2add74534719c5670b29152964a60e8f23b42b

Not sure if the following is useful, but
- Using hypervisor libvirt+qemu/kvm:
# virsh version
Compiled against library: libvirt 3.2.0
Using library: libvirt 3.2.0
Using API: QEMU 3.2.0
Running hypervisor: QEMU 2.9.0
- Using ephemeral storage with ex4
- Neutron with OpenVSwitch

melanie witt (melwitt) on 2018-03-20
tags: added: compute
melanie witt (melwitt) wrote :

There is a periodic task in nova-compute controlled by the config option [DEFAULT]sync_power_state_interval which defaults to 600 seconds (10 minutes). The periodic task will update the power_state in the database to reflect the power state returned by the virt driver. That's the power state update that you see.

Now, looking at that same periodic task, it's also supposed to be calling the compute "stop" API on the instance if the power state from the virt driver is found to be "shutdown". So once that occurs, you should see the instance with a vm_state of "stopped" and a status of SHUTOFF. Do you ever see that after the [DEFAULT]sync_power_state_interval of 10 minutes?

There should be a DEBUG log message in nova-compute like this when it finds a powered off instance:

  "Instance shutdown by itself. Calling the stop API"

and

  "Stopping instance; current vm_state: ..."

Do you see those log messages or do you find any ERROR log messages in nova-compute around them?

Changed in nova:
status: New → Incomplete
melanie witt (melwitt) wrote :

Here's where the code should be updating the instance power state to "shutdown" during the periodic task:

  https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L7127

Here's where it should be calling the "stop" API:

  https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L7153

and here's where it should be updated to vm_state STOPPED upon the "stop" API being called:

 https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L2606

Raoul Hidalgo Charman (raoulhc) wrote :
Download full text (3.6 KiB)

I do get both those messages followed by an exception error message:

EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']

With this traceback from oslo.messaging.server:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/opt/stack/nova/nova/exception_wrapper.py", line 76, in wrapped
    function_name, call_dict, binary)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/exception_wrapper.py", line 67, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 186, in decorated_function
    "Error: %s", e, instance=instance)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 156, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/utils.py", line 985, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 214, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 202, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 2616, in stop_instance
    do_stop_instance()
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    return f(*args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 2598, in do_stop_instance
    "power_off.start")
  File "/opt/stack/nova/nova/compute/manager.py", line 1647, in _notify_about_instance_usage
    extra_usage_info=extra_usage_info, fault=fault)
  File "/opt/stack/nova/nova/compute/utils.py", line 325, in notify_about_instance_usage
    network_info, system_metadata, **extra_usage_info)
  File "/opt/stack/nova/nova/notifications/base.py", line 398, in info_from_instance
    context)
  File "/opt/stack/nova/nova/image/api.py", line 65, in generate_image_url
    return "%s/images/%s" % (next(glance.get_api_servers(con...

Read more...

Raoul Hidalgo Charman (raoulhc) wrote :

Also no it doesn't get updated eventually, I've left a server for a while and it remains like that. Having a look through the log files it gets the same error and traceback every time it's called.

melanie witt (melwitt) wrote :

Thank you for adding that trace -- it shows why the intended behavior to set the instance to STOPPED fails:

  "File "/opt/stack/nova/nova/compute/manager.py", line 2598, in do_stop_instance
    "power_off.start")"

It's failing to send the notification for beginning the "stop" routine, so it's not reaching any of the code that actually stops the instance and sets the status to STOPPED.

I'm not familiar with the error:

  "EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']"

but I'll find out and update this.

Matt Riedemann (mriedem) wrote :

What's your [glance] config in nova.conf?

See the 'valid_interfaces' option here:

https://docs.openstack.org/nova/latest/configuration/config.html#glance

Matt Riedemann (mriedem) wrote :

Also, see https://github.com/openstack/nova/commit/62ef6cfcf01d84813f71d1e8252b86c170ee39f0 - do you have [service_user] configured in nova.conf? Do you have that fix? ^

melanie witt (melwitt) wrote :

Matt, in comment 0 they mentioned they're using devstack on commit id: 5d2add74534719c5670b29152964a60e8f23b42b so they should have the fix you mentioned.

Raoul, are you running glance in your devstack environment where you've reproduced this problem?

We looked deeper into the code that is failing and it's a call to keystone looking for the glance service in the service catalog during the attempt to send a notification (it falls back on this behavior if [glance]api_servers is not set in nova.conf).

Raoul Hidalgo Charman (raoulhc) wrote :

I've not a glance config section so it's all default options.
I've got that commit, and don't have service_user configured.

And yep, it's devstack all in one server set up.

melanie witt (melwitt) wrote :

Okay, I was able to reproduce the problem locally with devstack by:

  1. Create an instance
  2. virsh shutdown <domain>
  3. Wait for nova-compute to try to "stop" the instance and it fails to do that

I found that the "EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']" error originates from a "exceptions.EmptyCatalog('The service catalog is empty.')" raised from keystoneauth1/access/service_catalog.py(362)endpoint_data_for(). That originates from the fact that the RequestContext.service_catalog is an empty list [] when we make the call.

The "sync power states" periodic task is the routine responsible for updating the instance power state and status to reflect the state reported by the virt driver, and our periodic tasks all run using an anonymous RequestContext we get by calling nova.context.get_admin_context(). This type of context is not going to have service_catalog populated, so we will fail if we try to do anything with the service_catalog on such a RequestContext.

Currently, it looks like the first thing we need to do is determine whether we really need to be trying to do anything with image urls or glance or the service catalog when we're sending notifications. gibi said he will start investigating that tomorrow.

For the moment, other than removing the dependency on glance image urls in notifications, it's not clear how we can fix the periodic task.

Matt Riedemann (mriedem) on 2018-03-20
Changed in nova:
status: Incomplete → Triaged
importance: Undecided → Medium
melanie witt (melwitt) wrote :

Setting this to High because any periodic task that will result in notifications being sent, will hit this problem if CONF.glance.api_servers is not set.

Changed in nova:
importance: Medium → High
status: Triaged → Confirmed
melanie witt (melwitt) wrote :

Midair collision. I was thinking between High and Medium, so Medium it is.

The current idea is to handle EndpointNotFound in the notifications code and set the image_ref_url to the image id.

Changed in nova:
importance: High → Medium
status: Confirmed → Triaged
Matt Riedemann (mriedem) wrote :

The problem is that the periodic doesn't have a token, and [glance]/api_servers isn't configured, so when the notification code is trying to build a payload, it's trying to get the image API endpoint URL to build a link to the image for the notification. Getting the endpoint URL without a token fails.

This is the code that matters:

https://github.com/openstack/nova/blob/7833ada4fd8e42a773cb6849449708fcb20cbc0c/nova/notifications/base.py#L397

https://github.com/openstack/nova/blob/7833ada4fd8e42a773cb6849449708fcb20cbc0c/nova/image/api.py#L65

https://github.com/openstack/nova/blob/7833ada4fd8e42a773cb6849449708fcb20cbc0c/nova/image/glance.py#L120

If [glance]/api_servers is configured, then we don't go through keystoneauth1 to get the service catalog from keystone.

So options seem to be:

1. Add the ability to specify an admin or service user credentials in nova.conf for the [glance] section which we can use during periodics. This is a bit heavy weight when we've never needed it before.

2. Handle the EndpointNotFound error when building the notification payload and just set image_ref_url to the image id. Yes this is technically a change in the payload, but it's better than breaking everything. Long-term, the versioned notification payload could drop image_ref_url and just pass the image id, which is probably more useful anyway since the consumer might not be able to even reach the URL that's in the payload. We can't change the legacy notification payload, but we can eventually deprecate that in favor of the versioned notifications.

Fix proposed to branch: master
Review: https://review.openstack.org/554703

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/554703
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6482165bb1f44f5c98d9361153d737c22c92112d
Submitter: Zuul
Branch: master

commit 6482165bb1f44f5c98d9361153d737c22c92112d
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 20 17:15:42 2018 -0400

    Handle EndpointNotFound when building image_ref_url in notifications

    Change I4e755b9c66ec8bc3af0393e81cffd91c56064717 made the
    [glance]/api_servers option optional. If not set, we attempt
    to get the image service endpoint via keystoneauth adapter and
    the service catalog via the request context.

    Periodic tasks run without an actual token so there is no way
    to get the service catalog and the KSA adapter code to get the
    endpoint raises EndpointNotFound when trying to build the
    "image_ref_url" entry in the legacy instance notification payload.

    This change simply handles the EndpointNotFound and sets the
    image_ref_url to the instance.image_ref, which for non-volume-backed
    instances is the image id (for volume-backed instances it's an empty
    string).

    This doesn't affect versioned notifications since those don't use the
    "image_ref_url" entry in the payload that is created, they just have
    an "image_uuid" entry in the versioned notification payload which is
    populated via instance.image_ref.

    An upgrade impact release note is added in the case that some consuming
    service is actually relying on that legacy notification field being
    a URL and parsing it as such. The thinking here, however, is this is
    better than not sending the field at all, or setting it to None.

    Long-term this code all gets replaced with versioned notifications.

    Change-Id: Ie23a9c922776b028da0720c939846cba233ac472
    Closes-Bug: #1753550

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/554963
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e55a5f412bf6bc343eb8bb11348afe57c2012066
Submitter: Zuul
Branch: stable/queens

commit e55a5f412bf6bc343eb8bb11348afe57c2012066
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 20 17:15:42 2018 -0400

    Handle EndpointNotFound when building image_ref_url in notifications

    Change I4e755b9c66ec8bc3af0393e81cffd91c56064717 made the
    [glance]/api_servers option optional. If not set, we attempt
    to get the image service endpoint via keystoneauth adapter and
    the service catalog via the request context.

    Periodic tasks run without an actual token so there is no way
    to get the service catalog and the KSA adapter code to get the
    endpoint raises EndpointNotFound when trying to build the
    "image_ref_url" entry in the legacy instance notification payload.

    This change simply handles the EndpointNotFound and sets the
    image_ref_url to the instance.image_ref, which for non-volume-backed
    instances is the image id (for volume-backed instances it's an empty
    string).

    This doesn't affect versioned notifications since those don't use the
    "image_ref_url" entry in the payload that is created, they just have
    an "image_uuid" entry in the versioned notification payload which is
    populated via instance.image_ref.

    An upgrade impact release note is added in the case that some consuming
    service is actually relying on that legacy notification field being
    a URL and parsing it as such. The thinking here, however, is this is
    better than not sending the field at all, or setting it to None.

    Long-term this code all gets replaced with versioned notifications.

    Change-Id: Ie23a9c922776b028da0720c939846cba233ac472
    Closes-Bug: #1753550
    (cherry picked from commit 6482165bb1f44f5c98d9361153d737c22c92112d)

This issue was fixed in the openstack/nova 17.0.2 release.

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers