Cannot run "openstack server list" with instance stuck in scheduling state

Bug #1775934 reported by Jim Rollenhagen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann
Pike
New
Undecided
Unassigned
Queens
In Progress
Undecided
melanie witt
Rocky
In Progress
Undecided
melanie witt
Stein
Fix Released
Undecided
melanie witt
Train
Fix Released
Undecided
melanie witt

Bug Description

Seeing this in Ocata at 125dd1f30fdaf50182256c56808a5199856383c7.

Running `openstack server list --project 9c28d07207a54c78848fd7b4f85779d5` results in a 500 error:

RESP BODY: {"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'nova.exception.InstanceNotFound'>", "code": 500}}

Traceback in nova-api: http://paste.openstack.org/show/6YrSmjMSo0lIxyFjbPIz/

some data on the instance: http://paste.openstack.org/show/6PSa35HvdxZCQnVQ2sQU/

Looks like lazy-loading the flavor is failing because it's looking in the wrong database.

Tags: cells
Revision history for this message
melanie witt (melwitt) wrote :

Same instance data plus instance_extra for the instance from penick: http://paste.openstack.org/show/u61jzSxy3y3k9ri496Ms/

^ Shows that the embedded flavor is present (I was concerned the instance was old and had not had its flavor migrated).

The lazy-load of flavor at the REST API level (in nova/api/openstack/compute/servers.py) failing makes sense in that at the REST API level in Ocata, we are not targeted to any cell and thus lazy-loads like this would fail. HOWEVER, we should be pre-loading attributes like 'flavor' (a join with the instance_extra table) when we get_all instances in the compute/api before this instance.get_flavor() call, so the flavor should be cached already and not looked for in the database.

But a lazy-load is definitely happening in the traceback. I don't yet find how this is happening. Continuing to investigate.

Revision history for this message
melanie witt (melwitt) wrote :

This is where we will send 'flavor' as one of the attributes to pre-load when querying for instances in all cells:

            # merge our expected attrs with what the view builder needs for
            # showing details
            expected_attrs = self._view_builder.get_show_expected_attrs(
                                                                expected_attrs)

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/api/openstack/compute/servers.py#L350

melanie witt (melwitt)
tags: added: cells
Revision history for this message
Matt Riedemann (mriedem) wrote :

Why would the vm and task state still be set when the instance is in cell0? Shouldn't it be in ERROR vm_state with task_state=None? Might indicate something about why the instance mapping isn't updated to have a cell mapping:

mysql> select * from nova_api.instance_mappings where instance_uuid='4f59f9f3-27a1-4340-9341-d162a3b66bc3'\G
*************************** 1. row ***************************
   created_at: 2018-06-08 03:54:22
   updated_at: NULL
           id: 18073
instance_uuid: 4f59f9f3-27a1-4340-9341-d162a3b66bc3
      cell_id: NULL
   project_id: 9c28d07207a54c78848fd7b4f85779d5
1 row in set (0.00 sec)

Because when it's put in cell0, the instance mapping should be updated to point at cell0.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
melanie witt (melwitt) wrote :

I was just looking at this bug again and I think I have a theory how this happened, based on code inspection and a recent IRC discussion with mriedem and dansmith about how instances end up in cell0 without their instance mapping cell_id set to cell0.

There was mention that if the _set_vm_state_and_notify call after instance.create() fails for any reason, we would never update the instance mapping and would also not delete the build request:

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/conductor/manager.py#L849

Though, looking at set_vm_state_and_notify, I'm not sure how we could fail before the ERROR state update, unless it's possible for rpc.get_notifier to raise in the event of MQ trouble? Maybe we can ask gibi.

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/scheduler/utils.py#L95

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/rpc.py#L216

So, if set_vm_state_and_notify fails, we won't put the instance into error state, the instance mapping will have cell_id = NULL, and the build request will still be around for the same instance.

THEN, when a user does a server list, both build requests and instances will be obtained from the databases (there will be one build request and one instance record in cell0 for an instance that failed a set_vm_state_and_notify), then the API tries to access the flavor attribute for each item in the list, and the flavor does *not* exist on a build request, so the lazy-load of the flavor fails and the server list fails with the 500 error.

Revision history for this message
melanie witt (melwitt) wrote :

I'm going to rewrite my comment because I think the last paragraph I wrote is wrong.

I just did a bit of testing with functional tests and found that normally, we can lazy-load a flavor based on a build request just fine (assuming the context is targeted to the correct cell where the instance_extra for the instance record is). But, in Ocata, we aren't targeted to a cell for the instance.get_flavor() call in the nova/api/openstack/compute/views/servers.py code, so we will fail the lazy-load of the flavor.

The key part of this bug is that the build request is being listed in the server list, and a build request will always require a lazy-load of the flavor because the table join on the flavor column will only apply to instance records that are part of the server list.

There are two issues I see:

1. Under normal conditions, a build request will not co-exist with an instance record, and if we need to lazy-load a flavor in the REST API, that will fail because build requests don't have flavors. (Though maybe we could add that?)

2. We aren't handling the possibility of a _set_vm_state_and_notify failure that prevents update of the instance mapping and deletion of the build request.

Revision history for this message
melanie witt (melwitt) wrote :

After even more code examination, I find that even build requests should not need to lazy-load a flavor because the serialized instance object primitive is stored in the build request 'instance' field in the database. So when we look up build requests to show in a server list, they should all have flavor attributes ready to go, and not need to lazy-load. So I'm back to the drawing board as to why a lazy-load is happening in your environment.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Per last comment #8, looks like the bug is legit.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
melanie witt (melwitt) wrote :

A few months ago, I saw a duplicate of the root cause of this issue downstream [1] where a failure during _set_vm_state_and_notify (example: MessagingTimeout) prevents us from setting the cell_id for an InstanceMapping and destroying the BuildRequest.

The InstanceMapping cell_id part of it was solved (from an MQ failure perspective) by a patch that landed in Ussuri:

https://review.opendev.org/683730

and I think this is the root cause of the problem described in this bug.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1848737#c1

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/752279
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=efc35b1c5293c7c6c85f8cf9fd9d8cd8de71d1d5
Submitter: Zuul
Branch: stable/train

commit efc35b1c5293c7c6c85f8cf9fd9d8cd8de71d1d5
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 20 17:07:35 2019 -0400

    Sanity check instance mapping during scheduling

    mnaser reported a weird case where an instance was found
    in both cell0 (deleted there) and in cell1 (not deleted
    there but in error state from a failed build). It's unclear
    how this could happen besides some weird clustered rabbitmq
    issue where maybe the schedule and build request to conductor
    happens twice for the same instance and one picks a host and
    tries to build and the other fails during scheduling and is
    buried in cell0.

    To avoid a split brain situation like this, we add a sanity
    check in _bury_in_cell0 to make sure the instance mapping is
    not pointing at a cell when we go to update it to cell0.
    Similarly a check is added in the schedule_and_build_instances
    flow (the code is moved to a private method to make it easier
    to test).

    Worst case is this is unnecessary but doesn't hurt anything,
    best case is this helps avoid split brain clustered rabbit
    issues.

    Closes-Bug: #1775934

    Change-Id: I335113f0ec59516cb337d34b6fc9078ea202130f
    (cherry picked from commit 5b552518e1abdc63fb33c633661e30e4b2fe775e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/756404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/756404
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c895d3e6bca562225d70e8f81255f38970f7fcda
Submitter: Zuul
Branch: stable/stein

commit c895d3e6bca562225d70e8f81255f38970f7fcda
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 20 17:07:35 2019 -0400

    Sanity check instance mapping during scheduling

    mnaser reported a weird case where an instance was found
    in both cell0 (deleted there) and in cell1 (not deleted
    there but in error state from a failed build). It's unclear
    how this could happen besides some weird clustered rabbitmq
    issue where maybe the schedule and build request to conductor
    happens twice for the same instance and one picks a host and
    tries to build and the other fails during scheduling and is
    buried in cell0.

    To avoid a split brain situation like this, we add a sanity
    check in _bury_in_cell0 to make sure the instance mapping is
    not pointing at a cell when we go to update it to cell0.
    Similarly a check is added in the schedule_and_build_instances
    flow (the code is moved to a private method to make it easier
    to test).

    Worst case is this is unnecessary but doesn't hurt anything,
    best case is this helps avoid split brain clustered rabbit
    issues.

    Closes-Bug: #1775934

    Change-Id: I335113f0ec59516cb337d34b6fc9078ea202130f
    (cherry picked from commit 5b552518e1abdc63fb33c633661e30e4b2fe775e)
    (cherry picked from commit efc35b1c5293c7c6c85f8cf9fd9d8cd8de71d1d5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/762899

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/762903

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/762903
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova stein-eol

This issue was fixed in the openstack/nova stein-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/762899
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.