instance stuck in BUILD state if nova-compute is restarted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Low
|
Balazs Gibizer | ||
Pike |
Fix Released
|
Low
|
Balazs Gibizer | ||
Queens |
Fix Released
|
Low
|
Elod Illes | ||
Rocky |
Fix Committed
|
Low
|
Balazs Gibizer | ||
Stein |
Fix Committed
|
Low
|
Balazs Gibizer | ||
Train |
Fix Committed
|
Low
|
Balazs Gibizer |
Bug Description
Description
===========
Instance stuck in BUILD state indefinitely if nova-compute service restarted in the mean time. Even after the instance_
Steps to reproduce
==================
1) Start 10 VMs in parallel to increase the chance of hitting the bug
$ for NUM in `seq 1 1 10`; do openstack server create --flavor c1 --image cirros-
2) when the first instance reach the BUILD state restart the nova-compute service
$ sudo systemctl restart <email address hidden>
3) Observer that instance states after the compute is up again.
Expected result
===============
Instances either in ACTIVE or in ERROR state.
Actual result
=============
Some instance stuck in BUILD state.
Environment
===========
all in one devstack build from recent nova master 61558f274842b14
Logs & Configs
==============
stack@ubuntu:~$ openstack server list
+------
| ID | Name | Status | Networks | Image | Flavor |
+------
| 9ee76601-
| e459beae-
| 562f44db-
| 73f1e2c6-
| 1b01acfc-
| c709e3bf-
| 538d2534-
| ed74eb32-
| 582b5356-
| ae36ffca-
+------
stack@ubuntu:~$ openstack server show 9ee76601-
+------
| Field | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | building |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2019-06-
| flavor | cirros256 (c1) |
| hostId | |
| id | 9ee76601-
| image | cirros-
| key_name | None |
| name | vm3 |
| progress | 0 |
| project_id | 2fc0b14ea1e0419
| properties | |
| status | BUILD |
| updated | 2019-06-
| user_id | 262d29f5f0c3445
| volumes_attached | |
+------
stack@ubuntu:~$
mysql> select uuid, host from instances where instances.
+------
| uuid | host |
+------
| 9ee76601-
+------
1 row in set (0.00 sec)
Logs for 9ee76601-
Changed in nova: | |
assignee: | nobody → Balazs Gibizer (balazs-gibizer) |
Changed in nova: | |
importance: | Undecided → Low |
Changed in nova: | |
assignee: | Balazs Gibizer (balazs-gibizer) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → Balazs Gibizer (balazs-gibizer) |
Different instance states after the compute restart:
* ERROR: the instance has already have the instance.host set in the db and therefore the compute startup detects it and push it to ERROR state
* ACTIVE: either the instance is already spawned successfully before the compute is stopped, or the build request still was in flight in AMQP when the compute stopped.
* BUILD: the build request reached the compute before it was stopped but instance.host wasn't set as the instance_claim did not finished before the compute is stopped. When the compute started again the compute does not detect this instance as it is not assigned to its host.
There is a periodic job in the compute that ERRORs out instances according to the instance_ build_timeout config[1]. But it also only checks for instances assigned to the compute host so it does not push the stuck instance to ERROR.
[1]https:/ /github. com/openstack/ nova/blob/ c18f7f47f628e26 6e5b69f4b9733a0 f25ed4ffdd/ nova/compute/ manager. py#L1433