new instance gets stuck indefinitely at build state with task_state none
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Description
===========
nova-compute service is up but does not work.
new instances which get scheduled on that compute node will stuck at build state with task_state none,
and it doesn't go to ERROR state even after it reaches intance build timeout threshold.
(openstack) server show 9299bee1-
+------
| Field | Value |
+------
| OS-DCF:diskConfig | AUTO |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | building |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2021-07-
| flavor | i1.mini (75253a8f-
| hostId | |
| id | 9299bee1-
| image | |
| key_name | Sia-KP |
| name | qwerty-17 |
| progress | 0 |
| project_id | c4a93f6c1c194bf
| properties | |
| status | BUILD |
| updated | 2021-07-
| user_id | 042131e0784b462
| volumes_attached | |
+------
I have two OpenStack setups (staging and production). this issue happens on both of them but randomly on
different compute nodes. both setups are stable/ussuri release and deployed using openstack-ansible.
there were no error in nova logs, I enabled debug on nova services, it cought my eye that on the corrupted
compute node, the logs got stopped sometime before this problem occurs.
compute service list, while this issue happens. (CP-12 is the corrupted compute node)
(openstack) compute service list
+-----+
| ID | Binary | Host | Zone | Status | State | Updated At |
+-----+
| 7 | nova-conductor | SHN-CN-
| 34 | nova-scheduler | SHN-CN-
| 85 | nova-conductor | SHN-CN-
| 91 | nova-conductor | SHN-CN-
| 109 | nova-scheduler | SHN-CN-
| 157 | nova-scheduler | SHN-CN-
| 199 | nova-compute | SHN-CP-72 | nova | enabled | up | 2021-07-
.
.
.
| 232 | nova-compute | SHN-CP-18 | nova | enabled | up | 2021-07-
| 235 | nova-compute | SHN-CP-12 | nova | enabled | up | 2021-07-
| 238 | nova-compute | SHN-CP-20 | nova | enabled | up | 2021-07-
| 241 | nova-compute | SHN-CP-22 | nova | enabled | up | 2021-07-
+-----+
restarting nova-compute will resolve the issue until it happens again.
Steps to reproduce
==================
- not always but sometimes this happens.
- Create multiple instances for higher probability of happenng this issue.
Expected result
===============
either nova-compute service goes to down state, or instance goes to ERROR state, or any warning or error in nova logs.
Actual result
=============
instances which schedule on the corrupted compute node(which BTW happens randomly) will stuck indefinitely at BUILD state
and task_state None
Environment
===========
OSA deployment of stable/ussuri on ubuntu, with install_
this problem happend after I seperated RPC rabbitmq cluster from notify rabbitmq cluster.(not sure if this is related, but
thats when it started happening)
also it worth mentioning that this issue happens on both of my setups.
Logs & Configs
==============
this is the log before nova-compute service stops logging:
https:/
this is nova-compute log when the instance get scheduled on the node:
# journalctl -u nova-compute.
-- Logs begin at Mon 2021-05-31 04:36:00 UTC, end at Sat 2021-07-17 16:23:38 UTC. --
Jul 17 11:49:41 SHN-CP-12 nova-compute[
Jul 17 11:49:41 SHN-CP-12 nova-compute[
Jul 17 11:49:41 SHN-CP-12 nova-compute[
Jul 17 11:49:41 SHN-CP-12 nova-compute[
Is that the only logs about the creation request req-05e8f6c5- ee92-4399- 8bad-1184dc4521 4f on the compute host? If you can still reproduce this try using the following commands to confirm how far we get building the instance on the compute before things freeze:
$ openstack server event list $instance
$ openstack server event show $instance $request-id
root@compute $ journalctl -u nova-compute. service -q $request-id