Conflict when deleting allocations for an instance that hasn't finished building

Bug #1836754 reported by Artom Lifshitz on 2019-07-16
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned
Stein
Undecided
Unassigned
Train
Undecided
Unassigned

Bug Description

Description
===========

When deleting an instance that hasn't finished building, we'll sometimes get a 409 from placement as such:

Failed to delete allocations for consumer 6494d4d3-013e-478f-9ac1-37ca7a67b776. Error: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\\\\n\\\\n Inventory and/or allocations changed while attempting to allocate: Another thread concurrently updated the data. Please retry your update ", "code": "placement.concurrent_update", "request_id": "req-6dcd766b-f5d3-49fa-89f3-02e64079046a"}]}

Steps to reproduce
==================

1. Boot an instance
2. Don't wait for it to become active
3. Delete it immediately

Expected result
===============

The instance deletes successfully.

Actual result
=============

Nova bubbles up that error from Placement.

Logs & Configs
==============

This is being hit at a low rate in various CI tests, logstash query is here:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Inventory%20and%2For%20allocations%20changed%20while%20attempting%20to%20allocate%3A%20Another%20thread%20concurrently%20updated%20the%20data%5C%22%20AND%20filename%3A%5C%22job-output.txt%5C%22

Eric Fried (efried) on 2019-07-16
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: placement
Chris Dent (cdent) wrote :

Matt's logstash is happening because the host is being deleted which causes the api server to call delete_resource_provider (on the report client) with cascade=True, which deletes allocations but has no retry handling.

At least it seems that way, I'm rushing or I'd leave more info. Hope this is a clue.

Matt Riedemann (mriedem) wrote :

Yup, melwitt reported the same thing recently. We should be able to fix this with a simple retry in the local delete case if we get a 409 concurrent update generation conflict from placement.

Matt Riedemann (mriedem) wrote :

This rarely hits in CI because of the small window where we can fail here.

We GET the consumer allocations with a generation here:

https://github.com/openstack/nova/blob/149327a3abb12418cdf65316e7c1d4924767bfdf/nova/scheduler/client/report.py#L1986

Zero out the allocations and then PUT them back here:

https://github.com/openstack/nova/blob/149327a3abb12418cdf65316e7c1d4924767bfdf/nova/scheduler/client/report.py#L2010

So between the GET and PUT the consumer generation changed which is an extremely tight window.

I'm trying to think of a way to write a functional test to recreate this type of failure and it's hard since I'd have to not only wait until the scheduler claims resources to create the allocations I'd have to also block the server delete until after the GET. So something like:

1. create server
2. block on the claim_resources method
3. delete the server
4. block after the GET /allocations call
5. resume claim_resources to PUT allocations during scheduling
6. resume server delete to PUT empty allocations with a stale generation

Changed in nova:
importance: Medium → Low
Matt Riedemann (mriedem) wrote :

I can't come up with a good functional recreate test, this was my attempt:

http://paste.openstack.org/show/784002/

It's hacky enough that we're probably better off with just a simple unit test and a retry decorator on delete_allocation_for_instance.

Eric Fried (efried) wrote :

> a retry decorator on delete_allocation_for_instance

I would rather not put this at the report client level; it should be customizable per caller (e.g. via a "force" kwarg, or a different method altogether).

The forced delete can use a retry -- or just use the DELETE route, which is generation-agnostic.

Fix proposed to branch: master
Review: https://review.opendev.org/688802

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Matt Riedemann (mriedem) wrote :

This goes back to Stein because https://review.opendev.org/#/c/591597/ changed the method from using DELETE /allocations/{consumer_id} to the GET/PUT dance.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/688802
Reason: I'm going to drop this so someone else can work on it if they want.

Matt Riedemann (mriedem) on 2019-12-17
Changed in nova:
status: In Progress → Confirmed
assignee: Matt Riedemann (mriedem) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers