nova client servers.list crashes with bad marker

Bug #1773945 reported by Kevin de Berk on 2018-05-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
python-novaclient
Medium
Matt Riedemann

Bug Description

We have a python script that called servers.list() on an instance of novaclient.v2.client.Client . Sometimes that raises a "BadRequest marker not found" exception:

Our call:

  client = nova_client.Client("2", session=some_session)
  client.servers.list()

Observed Stacktrace:

  File "/usr/lib/python2.7/site-packages/XXXX/XXXX.py", line 630, in XXXX:
    all_servers = self.nova.servers.list()
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 854, in list
    "servers")
  File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 257, in _list
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 304, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 83, in request
    raise exceptions.from_response(resp, body, url, method)
BadRequest: marker [6a91d602-ab6e-42e0-929e-5ec33df2ddef] not found (HTTP 400) (Request-ID: req-78827725-801d-4514-8cc8-e4b94f15c191)

Discussion:

We have a lot of stacks and we sometimes create multiple stacks at the same time. We've noticed that that the stacks with the mentioned UUIDs were created just before these errors occur. It seems that when a newly-created stack appears at a certain location in the server list, its UUID is used as a marker, but the code that validates the marker does
not recognize such stacks.

Relevant versions:

- python-novaclient (9.1.0)
- nova (16.0.0)

Kevin de Berk (dberk) on 2018-05-29
description: updated
jichenjc (jichenjc) wrote :

seems this is a python nova client issue ?

affects: nova → python-novaclient
Changed in python-novaclient:
assignee: nobody → Surya Seetharaman (tssurya)

My guess is as follows:

In nova,
if the record of the marker VM instance exists in the cell
but the 'cell_mapping' in the InstanceMapping of the marker VM instance is null (None),
the issue occurs.

https://github.com/openstack/nova/blob/f902e0d5d87fb05207e4a7aca73d185775d43df2/nova/compute/instance_list.py#L56-L73

--------------------------------------------------------------------------
    def get_marker_record(self, ctx, marker):
        try:
            im = objects.InstanceMapping.get_by_instance_uuid(ctx, marker)
        except exception.InstanceMappingNotFound:
            raise exception.MarkerNotFound(marker=marker)

        elevated = ctx.elevated(read_deleted='yes')
        with context.target_cell(elevated, im.cell_mapping) as cctx:
            try:
                # NOTE(danms): We query this with no columns_to_join()
                # as we're just getting values for the sort keys from
                # it and none of the valid sort keys are on joined
                # columns.
                db_inst = db.instance_get_by_uuid(cctx, marker,
                                                  columns_to_join=[])
            except exception.InstanceNotFound:
                raise exception.MarkerNotFound(marker=marker) <------ Here
    return db_inst
--------------------------------------------------------------------------

In nova-conductor, the record of the VM instance exists in the cell
but the 'cell_mapping' in the InstanceMapping of the VM instance is null (None)
between *1 and *2.

*1: https://github.com/openstack/nova/blob/f902e0d5d87fb05207e4a7aca73d185775d43df2/nova/conductor/manager.py#L1172-L1175
*2: https://github.com/openstack/nova/blob/f902e0d5d87fb05207e4a7aca73d185775d43df2/nova/conductor/manager.py#L1236-L1239

tags: added: cells
affects: python-novaclient → nova
Changed in nova:
assignee: Surya Seetharaman (tssurya) → nobody
assignee: nobody → Surya Seetharaman (tssurya)
Surya Seetharaman (tssurya) wrote :

I also reproduced the same error when doing nova list --marker "uuid" for a uuid that has NULL CellMapping in the InstanceMapping. @natsume-takashi, looks like your guess is right.

Surya Seetharaman (tssurya) wrote :

However, NULL cell_mapping should not be an issue because for those instances who have an instance record in the the cell DB, but their instance_mapping has a NULL cell_mapping, meaning instances which are not yet scheduled, will have build_requests and as per (https://github.com/openstack/nova/blob/f902e0d5d87fb05207e4a7aca73d185775d43df2/nova/compute/api.py#L2402-L2408) the marker is set to None if its found in the BuildRequest table. So I don't fully understand what is the issue. @dberk, could you please attach the n-api logs if possible?

Kevin de Berk (dberk) wrote :

We don't have those log files anymore, sorry.

Matt Riedemann (mriedem) wrote :

I hit something similar last week in python-openstackclient when listing servers, I was getting a marker not found error even though I wasn't passing a marker, but something in the CLI (or nova python API binding code) was. I ran into this after rabbitmq crashed during scheduling a large number of instances in a single request, and I think I also noticed that there were instance_mappings records with no cell_id set.

I believe the reason the CLI was passing a marker to the server, even though I wasn't specifying a marker (or a limit) was because of this code in python-novaclient:

https://github.com/openstack/python-novaclient/blob/10.2.0/novaclient/v2/servers.py#L863

That logic says, on the first pass to the server, don't pass a marker or a limit, but if you get results back, keep hitting the server and this time use a marker from the last result set, and that's where I was getting the marker not found on my corrupted instances.

The default paging behavior with 'nova list' was changed recently which might have introduced this regression:

https://review.openstack.org/#/c/534222/

Matt Riedemann (mriedem) wrote :

I never understood the need for change https://review.openstack.org/#/c/534222/ and think we should probably revert it. If users want to list all instances, they can pass nova list --limit -1 to do that. The author of that change even said in the queens backport that it wasn't a problem for them anymore so they weren't sure what changed:

https://review.openstack.org/#/c/543968/

"I'm not sure how this happened but I can't reproduce the problem now. It seems to work as expected with `--limit -1`."

Matt Riedemann (mriedem) wrote :
tags: added: api
Changed in python-novaclient:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) wrote :

For this code:

https://github.com/openstack/nova/blob/f902e0d5d87fb05207e4a7aca73d185775d43df2/nova/compute/instance_list.py#L63

If we find an instance mapping but no cell mapping for that instance mapping, then something blew up in the server and we likely need to log an error and return a 500 to the user since it's going to require operator intervention to cleanup the busted instance_mappings record.

Reviewed: https://review.openstack.org/572539
Committed: https://git.openstack.org/cgit/openstack/python-novaclient/commit/?id=abaa86fd86d4beca0082ff1768d5306e5e86302e
Submitter: Zuul
Branch: master

commit abaa86fd86d4beca0082ff1768d5306e5e86302e
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 5 19:57:49 2018 +0000

    Revert "Fix listing of instances above API max_limit"

    This reverts commit eff607ccef91d09052d58f6798f68d67404f51ce.

    There was no apparent need for the change being reverted since
    user can list all servers by specifying --limit=1 when running
    the nova list command.

    The change introduced a problem whereby the first pass to
    list instances from the server would get up to
    [api]/max_limit (default 1000) results and then call again
    with a marker. If the last instance in the list (the marker)
    is corrupted in the instance_mappings table in the API DB
    by not having an associated cell mapping, listing instances
    will always fail with a MarkerNotFound error even though
    the CLI user is not passing a marker nor specifying
    --limit=-1. The corrupted instance mapping record resulting
    in the MarkerNotFound error is something else that should
    be fixed on the server side (and likely result in a 500) but
    the change in behavior of the CLI makes it always fail
    if you hit this even if you're not passing a marker.

    Change-Id: Ibb43f500a74733b85bd3242592d36985bfb45856
    Closes-Bug: #1773945

Changed in python-novaclient:
status: In Progress → Fix Released
Matt Riedemann (mriedem) wrote :

From the cells v2 meeting today about the nova side of this bug (comment 9):

(12:02:39 PM) dansmith: yeah, I think it's legit to log an error if the mapping is null so that the operator can fix it up
(12:02:46 PM) mriedem: and raise 500
(12:02:48 PM) mriedem: ?
(12:02:49 PM) dansmith: yeah
(12:02:50 PM) mriedem: ok
(12:02:58 PM) dansmith: because it's legit internal "we screwed up" error

Changed in nova:
status: New → Triaged
importance: Undecided → Medium

This issue was fixed in the openstack/python-novaclient 10.3.0 release.

Matt Riedemann (mriedem) on 2018-06-14
Changed in nova:
assignee: Surya Seetharaman (tssurya) → nobody
assignee: nobody → Matt Riedemann (mriedem)

Fix proposed to branch: master
Review: https://review.openstack.org/575556

Changed in nova:
status: Triaged → In Progress

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/575556
Reason: I'm going to drop this in favor of https://review.opendev.org/#/c/666438/ and https://review.opendev.org/#/c/655908/. There could still be unaccounted for cases where the build request is gone, the instance is in a cell and the instance mapping is not pointing at a cell (maybe the instance mapping update was dropped by the DB or something - these aren't atomic operations). We can try to shore up more of that flow, e.g. https://review.opendev.org/#/c/586713/ but I'm going to drop this in the meantime since it was really just for debug purposes.

Matt Riedemann (mriedem) wrote :

Marking this as incomplete since I'm not sure if we have an action to take.

Changed in nova:
status: In Progress → Incomplete
assignee: Matt Riedemann (mriedem) → nobody

Reviewed: https://review.opendev.org/666438
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ff029c1c3792f53865c6bdb3dce8d2c51b73ca7
Submitter: Zuul
Branch: master

commit 1ff029c1c3792f53865c6bdb3dce8d2c51b73ca7
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 19 16:26:07 2019 -0400

    Delete InstanceMapping in conductor if BuildRequest is already deleted

    The BuildRequest represents a server in the API until the scheduler
    picks a host in a cell and we create the instance record in that cell
    and update the instance mapping to point at the cell. If the user
    deletes the BuildRequest before the instance record is created in a
    cell, the conductor schedule_and_build_instances method cleans up the
    resource allocations created by scheduler and then continues to the
    next instance (if it's a multi-create request). The point is the instance
    does not get created in a cell, the BuildRequest is gone, and the
    instance mapping is left pointing at no cell - effectively orphaned.

    Furthermore, the archive_deleted_rows command change
    I483701a55576c245d091ff086b32081b392f746e to cleanup instance mappings
    for archived instances will not catch and cleanup the orphan instance
    mapping because there never was an instance record to delete and archive
    (the BuildRequest was deleted before the instance record was created, and
    the BuildRequest is hard deleted so there is no archive).

    This change simply deletes the InstanceMapping record in case the
    BuildRequest is already gone by the time we finish scheduling and we
    do not create the instance record in any cell.

    Change-Id: Ia03577ae41f010b449e47ff5b69b432d74f8467b
    Related-Bug: #1784074
    Related-Bug: #1773945

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers