Conflict when deleting allocations for an instance that hasn't finished building

Bug #1836754 reported by Artom Lifshitz
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Unassigned

Bug Description

Description
===========

When deleting an instance that hasn't finished building, we'll sometimes get a 409 from placement as such:

Failed to delete allocations for consumer 6494d4d3-013e-478f-9ac1-37ca7a67b776. Error: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\\\\n\\\\n Inventory and/or allocations changed while attempting to allocate: Another thread concurrently updated the data. Please retry your update ", "code": "placement.concurrent_update", "request_id": "req-6dcd766b-f5d3-49fa-89f3-02e64079046a"}]}

Steps to reproduce
==================

1. Boot an instance
2. Don't wait for it to become active
3. Delete it immediately

Expected result
===============

The instance deletes successfully.

Actual result
=============

Nova bubbles up that error from Placement.

Logs & Configs
==============

This is being hit at a low rate in various CI tests, logstash query is here:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Inventory%20and%2For%20allocations%20changed%20while%20attempting%20to%20allocate%3A%20Another%20thread%20concurrently%20updated%20the%20data%5C%22%20AND%20filename%3A%5C%22job-output.txt%5C%22

Eric Fried (efried)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: placement
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: api gate-failure
Revision history for this message
Chris Dent (cdent) wrote :

Matt's logstash is happening because the host is being deleted which causes the api server to call delete_resource_provider (on the report client) with cascade=True, which deletes allocations but has no retry handling.

At least it seems that way, I'm rushing or I'd leave more info. Hope this is a clue.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Yup, melwitt reported the same thing recently. We should be able to fix this with a simple retry in the local delete case if we get a 409 concurrent update generation conflict from placement.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

This rarely hits in CI because of the small window where we can fail here.

We GET the consumer allocations with a generation here:

https://github.com/openstack/nova/blob/149327a3abb12418cdf65316e7c1d4924767bfdf/nova/scheduler/client/report.py#L1986

Zero out the allocations and then PUT them back here:

https://github.com/openstack/nova/blob/149327a3abb12418cdf65316e7c1d4924767bfdf/nova/scheduler/client/report.py#L2010

So between the GET and PUT the consumer generation changed which is an extremely tight window.

I'm trying to think of a way to write a functional test to recreate this type of failure and it's hard since I'd have to not only wait until the scheduler claims resources to create the allocations I'd have to also block the server delete until after the GET. So something like:

1. create server
2. block on the claim_resources method
3. delete the server
4. block after the GET /allocations call
5. resume claim_resources to PUT allocations during scheduling
6. resume server delete to PUT empty allocations with a stale generation

Changed in nova:
importance: Medium → Low
Revision history for this message
Matt Riedemann (mriedem) wrote :

I can't come up with a good functional recreate test, this was my attempt:

http://paste.openstack.org/show/784002/

It's hacky enough that we're probably better off with just a simple unit test and a retry decorator on delete_allocation_for_instance.

Revision history for this message
Eric Fried (efried) wrote :

> a retry decorator on delete_allocation_for_instance

I would rather not put this at the report client level; it should be customizable per caller (e.g. via a "force" kwarg, or a different method altogether).

The forced delete can use a retry -- or just use the DELETE route, which is generation-agnostic.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688802

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

This goes back to Stein because https://review.opendev.org/#/c/591597/ changed the method from using DELETE /allocations/{consumer_id} to the GET/PUT dance.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/688802
Reason: I'm going to drop this so someone else can work on it if they want.

Matt Riedemann (mriedem)
Changed in nova:
status: In Progress → Confirmed
assignee: Matt Riedemann (mriedem) → nobody
Lee Yarwood (lyarwood)
Changed in nova:
status: Confirmed → Incomplete
no longer affects: nova/stein
no longer affects: nova/train
Changed in nova:
status: Incomplete → In Progress
Revision history for this message
melanie witt (melwitt) wrote :

The uptick of appearance of this bug in the gate seems to correlate with the timing of the merging of consumer types in placement [1].

Looking through, the only thing that seems could be causing the increase in hits is this part that groups all of the consumer database writes during _set_allocations_for_consumer into a single database transaction [2]:

[...]

    # NOTE(melwitt): Group all of the database updates in a single transaction
    # so that updates get rolled back automatically in the event of a consumer
    # generation conflict.
    _set_allocations_for_consumer_same_transaction(
        context, consumer_uuid, data, allocation_data, want_version)

    req.response.status = 204
    req.response.content_type = None
    return req.response

@db_api.placement_context_manager.writer
def _set_allocations_for_consumer_same_transaction(context, consumer_uuid,
                                                   data, allocation_data,
                                                   want_version):

[...]

Is it possible this the transaction takes just long enough to increase the chances of the race to update a consumer?

[1] https://review.opendev.org/c/openstack/placement/+/679441
[2] https://github.com/openstack/placement/commit/37721325798dda64d96d174f9e316620639fc9c9#diff-8278dbebda82d59bb97c4b864896d53161108f8238ff6ab00b63b7757e7d9043R413-R425

Revision history for this message
Chris Dent (cdent) wrote :

The error message that's being provided basically says exactly what's supposed to happen here: retry the operation, looks like Matt had a solution along those lines and Eric had a refinement that was less complicated but more aggressive. Since the instance is supposed to be gone, I would think the aggressive form is probably fine.

If there's been an uptick of the conflict happening that does suggest that transaction has delayed visibility of change a bit. Am I right in recalling that these are instances that are being destroyed very soon after creation?

Revision history for this message
melanie witt (melwitt) wrote :

Eyeballing the logstash query [1], the majority look to occur in the DeleteServersTestJSON [2]. There is a test in there that deletes a server while it is building.

But it looks like there are cases where we fail to delete a server after it goes into VERIFY_RESIZE state [3][4]:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_delete_server.py", line 106, in test_delete_server_while_in_verify_resize_state
    waiters.wait_for_server_termination(self.client, server['id'])
  File "/opt/stack/tempest/tempest/common/waiters.py", line 124, in wait_for_server_termination
    raise lib_exc.DeleteErrorException(
tempest.lib.exceptions.DeleteErrorException: Resource %(resource_id)s failed to delete and is in ERROR status
Details: Server 2f13b96e-201b-49a1-b2e1-08c12b8527b1 failed to delete and is in ERROR status

In this case, I don't yet understand how the conflict is occurring.

[1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Unable%20to%20delete%20allocation%20for%20instance%5C%22%20AND%20message:%5C%22%5C%5C%5C%22code%5C%5C%5C%22:%20%5C%5C%5C%22placement.concurrent_update%5C%5C%5C%22%5C%22%20AND%20tags:%5C%22screen-n-api.txt%5C%22%20AND%20voting:1%20AND%20build_status:%5C%22FAILURE%5C%22&from=864000s
[2] https://github.com/openstack/tempest/blob/8e76271b5cf031901c29bea851eec8b44d1e40f3/tempest/api/compute/servers/test_delete_server.py#L28
[3] http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_474/802056/12/check/nova-ceph-multistore/4747c0b/testr_results.html
[4] https://github.com/openstack/tempest/blob/8e76271b5cf031901c29bea851eec8b44d1e40f3/tempest/api/compute/servers/test_delete_server.py#L99

Revision history for this message
melanie witt (melwitt) wrote :

I lean toward the more aggressive approach for most if not all delete requests because IMHO a delete request from a user should never be rejected.

I updated Matt's WIP patch to address the existing comments about a month ago when I noticed this failure popping up on the consumer types set:

https://review.opendev.org/c/openstack/nova/+/688802/2

Based on the discussion on the review, in most cases in ^ it was appropriate to "force" a delete, with the exception being during reclaim of SOFT_DELETED instances if an operator has configured deferred deletes. Based on this, it seems like our prior state of DELETE /allocations was more appropriate than what we have currently since the vast majority of the time we should be DELETEing for our purposes.

An alternative approach brought up on the WIP patch review [1] was to just revert:

https://review.opendev.org/c/openstack/nova/+/591597

I'm going to upload a revert of ^ as an option and see what the consensus is among everyone on which way to go.

[1] https://review.opendev.org/c/openstack/nova/+/688802/2#message-8be9a96ce27a2bf1f1118bce6cc387096210ffda

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/806292

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/806293

Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "melanie witt <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/806293
Reason: Consensus was to use https://review.opendev.org/c/openstack/nova/+/688802

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "melanie witt <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/806292
Reason: Consensus was to use https://review.opendev.org/c/openstack/nova/+/688802

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/688802
Committed: https://opendev.org/openstack/nova/commit/c09d98dadb6cd69e294420ba7ecea0f9b9cfcd71
Submitter: "Zuul (22348)"
Branch: master

commit c09d98dadb6cd69e294420ba7ecea0f9b9cfcd71
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 15 15:49:55 2019 -0400

    Add force kwarg to delete_allocation_for_instance

    This adds a force kwarg to delete_allocation_for_instance which
    defaults to True because that was found to be the most common use case
    by a significant margin during implementation of this patch.
    In most cases, this method is called when we want to delete the
    allocations because they should be gone, e.g. server delete, failed
    build, or shelve offload. The alternative in these cases is the caller
    could trap the conflict error and retry but we might as well just force
    the delete in that case (it's cleaner).

    When force=True, it will DELETE the consumer allocations rather than
    GET and PUT with an empty allocations dict and the consumer generation
    which can result in a 409 conflict from Placement. For example, bug
    1836754 shows that in one tempest test that creates a server and then
    immediately deletes it, we can hit a very tight window where the method
    GETs the allocations and before it PUTs the empty allocations to remove
    them, something changes which results in a conflict and the server
    delete fails with a 409 error.

    It's worth noting that delete_allocation_for_instance used to just
    DELETE the allocations before Stein [1] when we started taking consumer
    generations into account. There was also a related mailing list thread
    [2].

    Closes-Bug: #1836754

    [1] I77f34788dd7ab8fdf60d668a4f76452e03cf9888
    [2] http://lists.openstack.org/pipermail/openstack-dev/2018-August/133374.html

    Change-Id: Ife3c7a5a95c5d707983ab33fd2fbfc1cfb72f676

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers