instance stuck in BUILD state if nova-compute is restarted

Bug #1833581 reported by Balazs Gibizer on 2019-06-20
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Balazs Gibizer
Pike
Low
Balazs Gibizer
Queens
Low
Elod Illes
Rocky
Low
Balazs Gibizer
Stein
Low
Balazs Gibizer
Train
Low
Balazs Gibizer

Bug Description

Description
===========
Instance stuck in BUILD state indefinitely if nova-compute service restarted in the mean time. Even after the instance_build_timeout the instance is not put into ERROR state.

Steps to reproduce
==================

1) Start 10 VMs in parallel to increase the chance of hitting the bug

$ for NUM in `seq 1 1 10`; do openstack server create --flavor c1 --image cirros-0.4.0-x86_64-disk --availability-zone nova:ubuntu vm$NUM & done

2) when the first instance reach the BUILD state restart the nova-compute service
$ sudo systemctl restart <email address hidden>

3) Observer that instance states after the compute is up again.

Expected result
===============

Instances either in ACTIVE or in ERROR state.

Actual result
=============
Some instance stuck in BUILD state.

Environment
===========

all in one devstack build from recent nova master 61558f274842b149044a14bbe7537b9f278035fd

Logs & Configs
==============

stack@ubuntu:~$ openstack server list
+--------------------------------------+------+--------+------------------------------------+--------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------+--------+------------------------------------+--------------------------+-----------+
| 9ee76601-4a61-4682-86f1-743dac2b05e6 | vm3 | BUILD | | cirros-0.4.0-x86_64-disk | cirros256 |
| e459beae-ccb5-4781-b938-2dff68e33bf7 | vm9 | ACTIVE | public=2001:db8::181, 172.24.4.44 | cirros-0.4.0-x86_64-disk | cirros256 |
| 562f44db-cd51-4516-bce9-598bd29c6310 | vm10 | ERROR | public=2001:db8::3a1, 172.24.4.196 | cirros-0.4.0-x86_64-disk | cirros256 |
| 73f1e2c6-78a1-44c5-b178-7adcf9bf58a0 | vm5 | ERROR | public=2001:db8::21, 172.24.4.177 | cirros-0.4.0-x86_64-disk | cirros256 |
| 1b01acfc-b798-48f9-b808-6cfd0d5cd3fb | vm6 | ERROR | public=2001:db8::3e1, 172.24.4.20 | cirros-0.4.0-x86_64-disk | cirros256 |
| c709e3bf-9c71-4f64-bad3-e9e07e911f62 | vm7 | ERROR | public=2001:db8::231, 172.24.4.46 | cirros-0.4.0-x86_64-disk | cirros256 |
| 538d2534-98f1-4e11-9bbb-b4e74bab8c65 | vm4 | ERROR | public=2001:db8::3e9, 172.24.4.157 | cirros-0.4.0-x86_64-disk | cirros256 |
| ed74eb32-00fe-4f24-9379-c57c04ce9af1 | vm2 | ERROR | public=2001:db8::f5, 172.24.4.53 | cirros-0.4.0-x86_64-disk | cirros256 |
| 582b5356-4f3d-42ed-937e-966580303af0 | vm8 | ERROR | public=2001:db8::92, 172.24.4.16 | cirros-0.4.0-x86_64-disk | cirros256 |
| ae36ffca-e4d6-4353-8e7e-41db500a5e0d | vm1 | ERROR | public=2001:db8::1cf, 172.24.4.203 | cirros-0.4.0-x86_64-disk | cirros256 |
+--------------------------------------+------+--------+------------------------------------+--------------------------+-----------+

stack@ubuntu:~$ openstack server show 9ee76601-4a61-4682-86f1-743dac2b05e6
+-------------------------------------+-----------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-0000004c |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2019-06-19T02:30:16Z |
| flavor | cirros256 (c1) |
| hostId | |
| id | 9ee76601-4a61-4682-86f1-743dac2b05e6 |
| image | cirros-0.4.0-x86_64-disk (8b88f518-ab48-4859-8e8c-6988911ce9bd) |
| key_name | None |
| name | vm3 |
| progress | 0 |
| project_id | 2fc0b14ea1e041998f420ec85a89314d |
| properties | |
| status | BUILD |
| updated | 2019-06-19T02:30:18Z |
| user_id | 262d29f5f0c3445abbde89723b5f01ee |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------------------------+
stack@ubuntu:~$

mysql> select uuid, host from instances where instances.uuid='9ee76601-4a61-4682-86f1-743dac2b05e6';
+--------------------------------------+------+
| uuid | host |
+--------------------------------------+------+
| 9ee76601-4a61-4682-86f1-743dac2b05e6 | NULL |
+--------------------------------------+------+
1 row in set (0.00 sec)

Logs for 9ee76601-4a61-4682-86f1-743dac2b05e6: http://paste.openstack.org/show/753228/

Balazs Gibizer (balazs-gibizer) wrote :

Different instance states after the compute restart:

* ERROR: the instance has already have the instance.host set in the db and therefore the compute startup detects it and push it to ERROR state
* ACTIVE: either the instance is already spawned successfully before the compute is stopped, or the build request still was in flight in AMQP when the compute stopped.
* BUILD: the build request reached the compute before it was stopped but instance.host wasn't set as the instance_claim did not finished before the compute is stopped. When the compute started again the compute does not detect this instance as it is not assigned to its host.

There is a periodic job in the compute that ERRORs out instances according to the instance_build_timeout config[1]. But it also only checks for instances assigned to the compute host so it does not push the stuck instance to ERROR.

[1]https://github.com/openstack/nova/blob/c18f7f47f628e266e5b69f4b9733a0f25ed4ffdd/nova/compute/manager.py#L1433

tags: added: compute
Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)

Fix proposed to branch: master
Review: https://review.opendev.org/666857

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → Low

Related fix proposed to branch: master
Review: https://review.opendev.org/667397

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: master
Review: https://review.opendev.org/667397

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: master
Review: https://review.opendev.org/667396

Matt Riedemann (mriedem) wrote :

I hit similar issues where the instance was stuck in BUILD status without a host set because of an overloaded cell conductor so I was getting MessagingTimeout errors during the build, I left the details in a comment on this change https://review.opendev.org/#/c/667913/. Anyway, just another data point.

Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2019-08-01
Changed in nova:
assignee: Matt Riedemann (mriedem) → Balazs Gibizer (balazs-gibizer)

Reviewed: https://review.opendev.org/667913
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d2e0bd81df6a732f9c78df29538db89dda37b246
Submitter: Zuul
Branch: master

commit d2e0bd81df6a732f9c78df29538db89dda37b246
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 17:13:31 2019 +0200

    Functional reproduce for bug 1833581

    Change-Id: Id112098ef7603d0e514120ac9b7ed861dfa32bd3
    Related-Bug: #1833581

Matt Riedemann (mriedem) wrote :

This is extremely latent but I've marked it going back to at least queens since that's currently our oldest non-extended maintenance branch.

Reviewed: https://review.opendev.org/666857
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a1a735bc6efa40d8277c9fc5339f3b74f968b58e
Submitter: Zuul
Branch: master

commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/687216
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=06fd7c730172190d7bf7d52bc9062eecba8d7d27
Submitter: Zuul
Branch: stable/train

commit 06fd7c730172190d7bf7d52bc9062eecba8d7d27
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581
    (cherry picked from commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e)

Reviewed: https://review.opendev.org/687534
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e4a5516098454775e1c5d5f631308bfa9abf7167
Submitter: Zuul
Branch: stable/stein

commit e4a5516098454775e1c5d5f631308bfa9abf7167
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 17:13:31 2019 +0200

    Functional reproduce for bug 1833581

    Change-Id: Id112098ef7603d0e514120ac9b7ed861dfa32bd3
    Related-Bug: #1833581
    (cherry picked from commit d2e0bd81df6a732f9c78df29538db89dda37b246)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/687535
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=530ad1ae8884e5c87761277bb64fdb17b286e968
Submitter: Zuul
Branch: stable/stein

commit 530ad1ae8884e5c87761277bb64fdb17b286e968
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Conflicts:
          nova/tests/unit/compute/test_compute_mgr.py

    Conflict due to Ia1b3ab0b66fdaf569f6c7a09510f208ee28725b2 is not in
    stable/stein

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581
    (cherry picked from commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e)
    (cherry picked from commit 06fd7c730172190d7bf7d52bc9062eecba8d7d27)

Reviewed: https://review.opendev.org/687564
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=19ca978bd48be1990db5a09fadbd0eea58f9d6b7
Submitter: Zuul
Branch: stable/rocky

commit 19ca978bd48be1990db5a09fadbd0eea58f9d6b7
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 17:13:31 2019 +0200

    Functional reproduce for bug 1833581

    Change-Id: Id112098ef7603d0e514120ac9b7ed861dfa32bd3
    Related-Bug: #1833581
    (cherry picked from commit d2e0bd81df6a732f9c78df29538db89dda37b246)
    (cherry picked from commit 48d066a4193940815094c2ab8299db543aa514e5)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/687565
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=13bb7ed701121955ba015103c2e44429927e78d4
Submitter: Zuul
Branch: stable/rocky

commit 13bb7ed701121955ba015103c2e44429927e78d4
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Conflicts:
          nova/tests/unit/compute/test_compute_mgr.py
          nova/compute/manager.py

    Conflict due to Ia1b3ab0b66fdaf569f6c7a09510f208ee28725b2 and
    I020e7dc47efc79f8907b7bfb753ec779a8da69a1 is not in stable/rocky

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581
    (cherry picked from commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e)
    (cherry picked from commit 06fd7c730172190d7bf7d52bc9062eecba8d7d27)
    (cherry picked from commit cb951cbcb246221e04a063cd7b5ae2e83ddfe6dd)

Reviewed: https://review.opendev.org/687877
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23d65bcf1e82ecdd4eff1a14cc8c8c8eb473d036
Submitter: Zuul
Branch: stable/queens

commit 23d65bcf1e82ecdd4eff1a14cc8c8c8eb473d036
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 17:13:31 2019 +0200

    Functional reproduce for bug 1833581

    Conflicts:
          nova/tests/functional/compute/test_init_host.py

    Conflict is due to Iea283322124cb35fc0bc6d25f35548621e8c8c2f is missing
    from stable/queens

    Change-Id: Id112098ef7603d0e514120ac9b7ed861dfa32bd3
    Related-Bug: #1833581
    (cherry picked from commit d2e0bd81df6a732f9c78df29538db89dda37b246)
    (cherry picked from commit 48d066a4193940815094c2ab8299db543aa514e5)
    (cherry picked from commit 19ca978bd48be1990db5a09fadbd0eea58f9d6b7)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/687878
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4164b96de9f62fdc35a12adf514d767460187d55
Submitter: Zuul
Branch: stable/queens

commit 4164b96de9f62fdc35a12adf514d767460187d55
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581
    (cherry picked from commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e)
    (cherry picked from commit 06fd7c730172190d7bf7d52bc9062eecba8d7d27)
    (cherry picked from commit cb951cbcb246221e04a063cd7b5ae2e83ddfe6dd)
    (cherry picked from commit 13bb7ed701121955ba015103c2e44429927e78d4)

This issue was fixed in the openstack/nova 20.0.1 release.

Reviewed: https://review.opendev.org/687917
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a84dbeab6245af38886c9da235872209e63a191e
Submitter: Zuul
Branch: stable/pike

commit a84dbeab6245af38886c9da235872209e63a191e
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 17:13:31 2019 +0200

    Functional reproduce for bug 1833581

    Conflicts:
          nova/tests/functional/compute/test_init_host.py
    Note: conflict is due to needed changes in Pike version of patch
    I107d842520c088b4859a3b36621ce6bd8e970475 (the missing last assert)

    Additional changes in test_init_host.py compared to Queens:
    * Notification handling is changed as
      Ie4676eed0039c927b35af7573f0b57fd762adbaa is not in Pike.

    Change-Id: Id112098ef7603d0e514120ac9b7ed861dfa32bd3
    Related-Bug: #1833581
    (cherry picked from commit d2e0bd81df6a732f9c78df29538db89dda37b246)
    (cherry picked from commit 48d066a4193940815094c2ab8299db543aa514e5)
    (cherry picked from commit 19ca978bd48be1990db5a09fadbd0eea58f9d6b7)
    (cherry picked from commit 23d65bcf1e82ecdd4eff1a14cc8c8c8eb473d036)

tags: added: in-stable-pike

Reviewed: https://review.opendev.org/687918
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e5892ed61b5f4f4f581384e245d5052e7bf840b2
Submitter: Zuul
Branch: stable/pike

commit e5892ed61b5f4f4f581384e245d5052e7bf840b2
Author: Balazs Gibizer <email address hidden>
Date: Fri Jun 21 16:48:14 2019 +0200

    Error out interrupted builds

    If the compute service is restarted while build requests are
    executing the instance_claim or waiting for the COMPUTE_RESOURCE_SEMAPHORE
    then those instances will be stuck forever in BUILDING state. If the instance
    already finished instance_claim then instance.host is set and when the
    compute restarts the instance is put to ERROR state.

    This patch changes compute service startup to put instances into
    ERROR state if they a) are in the BUILDING state, and b) have
    allocations on the compute resource provider, but c) do not have
    instance.host set to that compute.

    Note: changes in manager.py and test_compute_mgr.py compared to Queens:
    * the signature change of the get_allocations_for_resource_provider
      call is due to I7891b98f225f97ad47f189afb9110ef31c810717 is missing from
      stable/pike.
    * the VirtDriverNotReady exception does not exists in pike as
      Ib0ec1012b74e9a9e74c8879f3feed5f9332b711f is missing. In pike ironic
      returns an empty node list instead of raising an exception so the bugfix
      and the test is adapted accordingly.

    Change-Id: I856a3032c83fc2f605d8c9b6e5aa3bcfa415f96a
    Closes-Bug: #1833581
    (cherry picked from commit a1a735bc6efa40d8277c9fc5339f3b74f968b58e)
    (cherry picked from commit 06fd7c730172190d7bf7d52bc9062eecba8d7d27)
    (cherry picked from commit cb951cbcb246221e04a063cd7b5ae2e83ddfe6dd)
    (cherry picked from commit 13bb7ed701121955ba015103c2e44429927e78d4)
    (cherry picked from commit 4164b96de9f62fdc35a12adf514d767460187d55)

This issue was fixed in the openstack/nova 19.1.0 release.

This issue was fixed in the openstack/nova 18.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers