Deleting stuck build instance may leak allocations

Bug #1859496 reported by Alexandre arents
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Alexandre arents

Bug Description

Description
===========

After issues in control plane during instance creation,
Instance may stay stuck in BUILD state.

Even after deleting them, placement allocation may remain,
and compute host log is complaining that:
Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.

Steps to reproduce
==================

On a fresh devstack master install

1) open a terminal that display entry in placement.allocations and nova_cell1.instances all seconds:
while true ; do date ; mysql -e "select * from placement.allocations" ; mysql -e "select * from nova_cell1.instances where deleted=0" ;sleep 1 ; done

2) Trigguer a spawn of 50 instances & kill rabbit after 5sec to simulate issue on control plane:
openstack server create --flavor m1.tiny --image cirros-0.4.0-x86_64-disk --nic net-id=private alex --min 50 --max 50 & sleep 5 ; sudo pkill rabbitmq-server

Note: To reach the bug, goal is to get instances Allocated by scheduler, but not let the time to conductor to create entry in nova_cell1.instances

You should see allocations appearing in allocations:
+---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
| created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used |
+---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
| 2020-01-13 11:02:51 | NULL | 1727 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 2 | 1 |
| 2020-01-13 11:02:51 | NULL | 1728 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 1 | 512 |
| 2020-01-13 11:02:51 | NULL | 1729 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 0 | 1 |
| 2020-01-13 11:02:51 | NULL | 1730 | 1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda | 2 | 1 |
| 2020-01-13 11:02:51 | NULL | 1731 | 1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda | 1 | 512 |
.....

instances are all stuck in BUILD at this stage

3) delete instances:
openstack server list | awk '/m1.tiny/ {print $2}' | xargs openstack server delete
4) service rabbitmq-server start
5) openstack server list
    <display nothing>
6) mysql -e "select count(*) from placement.allocations"
+----------+
| count(*) |
+----------+
| 150 |
+----------+
Allocation remains
7) nova-compute logs complaining that:
Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.

Expected result
===============
placement allocation of instance have to be cleanup after deletion

Actual result
=============
placement allocation of instance are leaked.

Environment
===========
At least stein to master seems impacted

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/702368

Changed in nova:
assignee: nobody → Alexandre arents (aarents)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/702368
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f35930eef8fa27ee972e87366abb38596839fdba
Submitter: Zuul
Branch: master

commit f35930eef8fa27ee972e87366abb38596839fdba
Author: Alexandre Arents <email address hidden>
Date: Mon Jan 13 15:53:24 2020 +0000

    Avoid allocation leak when deleting instance stuck in BUILD

    During instance build, conductor claim resources to scheduler
    and create instance DB entry in cell.

    If for any reason conductor is not able to complete a build after
    instance claim (ex: AMQP issues, conductor restart before build completes)
    and in the mean time user requests deletion of its stuck instance in BUILD,
    nova api will delete build_request but let allocation in place resulting
    in a leak.

    The change proposes that nova api ensures allocation cleanup is made
    in case of ongoing/incomplete build.
    Note that because build did not reach a cell, compute is not able to heal
    allocation during its periodic update_available_resource task.
    Furthermore, it ensures that instance mapping is also queued for deletion.

    Change-Id: I4d3193d8401614311010ed0e055fcb3aaeeebaed
    Closes-Bug: #1859496

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.