new instance gets stuck indefinitely at build state with task_state none

Bug #1936720 reported by Siavash Sardari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

Description
===========
nova-compute service is up but does not work.

new instances which get scheduled on that compute node will stuck at build state with task_state none,
and it doesn't go to ERROR state even after it reaches intance build timeout threshold.

(openstack) server show 9299bee1-633d-4233-9f2b-9a7d1871d51b
+-------------------------------------+------------------------------------------------+
| Field | Value |
+-------------------------------------+------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-0000bfb6 |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2021-07-17T11:49:35Z |
| flavor | i1.mini (75253a8f-eb7c-4473-9874-884a01a524a7) |
| hostId | |
| id | 9299bee1-633d-4233-9f2b-9a7d1871d51b |
| image | |
| key_name | Sia-KP |
| name | qwerty-17 |
| progress | 0 |
| project_id | c4a93f6c1c194bf78bd98ee0f4d51978 |
| properties | |
| status | BUILD |
| updated | 2021-07-17T11:49:41Z |
| user_id | 042131e0784b46218521eee7963022bf |
| volumes_attached | |
+-------------------------------------+------------------------------------------------+

I have two OpenStack setups (staging and production). this issue happens on both of them but randomly on
different compute nodes. both setups are stable/ussuri release and deployed using openstack-ansible.

there were no error in nova logs, I enabled debug on nova services, it cought my eye that on the corrupted
compute node, the logs got stopped sometime before this problem occurs.

compute service list, while this issue happens. (CP-12 is the corrupted compute node)

(openstack) compute service list
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
| 7 | nova-conductor | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up | 2021-07-17T14:23:45.000000 |
| 34 | nova-scheduler | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up | 2021-07-17T14:23:43.000000 |
| 85 | nova-conductor | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up | 2021-07-17T14:23:41.000000 |
| 91 | nova-conductor | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up | 2021-07-17T14:23:45.000000 |
| 109 | nova-scheduler | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up | 2021-07-17T14:23:41.000000 |
| 157 | nova-scheduler | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up | 2021-07-17T14:23:45.000000 |
| 199 | nova-compute | SHN-CP-72 | nova | enabled | up | 2021-07-17T14:23:41.000000 |
.
.
.
| 232 | nova-compute | SHN-CP-18 | nova | enabled | up | 2021-07-17T14:23:41.000000 |
| 235 | nova-compute | SHN-CP-12 | nova | enabled | up | 2021-07-17T14:23:41.000000 |
| 238 | nova-compute | SHN-CP-20 | nova | enabled | up | 2021-07-17T14:23:41.000000 |
| 241 | nova-compute | SHN-CP-22 | nova | enabled | up | 2021-07-17T14:23:41.000000 |
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+

restarting nova-compute will resolve the issue until it happens again.

Steps to reproduce
==================
- not always but sometimes this happens.
- Create multiple instances for higher probability of happenng this issue.

Expected result
===============
either nova-compute service goes to down state, or instance goes to ERROR state, or any warning or error in nova logs.

Actual result
=============
instances which schedule on the corrupted compute node(which BTW happens randomly) will stuck indefinitely at BUILD state
and task_state None

Environment
===========
OSA deployment of stable/ussuri on ubuntu, with install_method=source.
this problem happend after I seperated RPC rabbitmq cluster from notify rabbitmq cluster.(not sure if this is related, but
thats when it started happening)
also it worth mentioning that this issue happens on both of my setups.

Logs & Configs
==============
this is the log before nova-compute service stops logging:
https://paste.opendev.org/show/807547/

this is nova-compute log when the instance get scheduled on the node:

# journalctl -u nova-compute.service --since '2021-07-17 11:49:00' --until '2021-07-17 12:00:00' --no-pager
-- Logs begin at Mon 2021-05-31 04:36:00 UTC, end at Sat 2021-07-17 16:23:38 UTC. --
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.086 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "6a148ea3-6793-4e26-acb2-9dd1214a666d" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.105 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 6a148ea3-6793-4e26-acb2-9dd1214a666d] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.711 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "9299bee1-633d-4233-9f2b-9a7d1871d51b" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.731 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 9299bee1-633d-4233-9f2b-9a7d1871d51b] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Is that the only logs about the creation request req-05e8f6c5-ee92-4399-8bad-1184dc45214f on the compute host? If you can still reproduce this try using the following commands to confirm how far we get building the instance on the compute before things freeze:

$ openstack server event list $instance
$ openstack server event show $instance $request-id

root@compute $ journalctl -u nova-compute.service -q $request-id

Changed in nova:
status: New → Incomplete
Revision history for this message
Siavash Sardari (siavash.sardari) wrote :

actually the issue happened twice since I posted this bug, but it was before I got your comment. I'm sure it will happen again in next couple of days and I will provide what you asked for.
BWT pay attention to journalctl time interval and the actual logs that I posted. I did not cut anything out, the nova-compute service will get the RPC, start the _do_build_and_run_instance and just hangs.
Also the same happens in https://paste.opendev.org/show/807547/ pay attention to time interval and the output logs timestamps. In both cases, it sounds that a lock is acquired but never released.

Revision history for this message
Siavash Sardari (siavash.sardari) wrote :
Download full text (5.9 KiB)

The issue happend again.

(openstack) server show 3e204f98-b227-4ea0-b108-91d5e9c45694
+-------------------------------------+------------------------------------------------+
| Field | Value |
+-------------------------------------+------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-0000c0e8 |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2021-07-24T10:07:21Z |
| flavor | i1.mini (75253a8f-eb7c-4473-9874-884a01a524a7) |
| hostId | |
| id | 3e204f98-b227-4ea0-b108-91d5e9c45694 |
| image | |
| key_name | Sia-KP |
| name | test-17 |
| progress | 0 |
| project_id | c4a93f6c1c194bf78bd98ee0f4d51978 |
| properties | |
| status | BUILD |
| updated | 2021-07-24T10:07:26Z |
| user_id | 042131e0784b46218521eee7963022bf |
| volumes_attached | |
+-------------------------------------+------------------------------------------------+
(openstack) server event list 3e204f98-b227-4ea0-b108-91d5e9c45694
+------------------------------------------+--------------------------------------+--------+----------------------------+
| Request ID | Server ID ...

Read more...

Revision history for this message
Siavash Sardari (siavash.sardari) wrote :

sorry, I ran the event show with wrong order

(openstack) server event show 3e204f98-b227-4ea0-b108-91d5e9c45694 req-aea42b8b-3baa-4e99-ae62-44ba845373e0
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| action | create |
| events | [{'event': 'compute__do_build_and_run_instance', 'start_time': '2021-07-24T10:07:26.000000', 'finish_time': None, 'result': None, 'traceback': None}] |
| instance_uuid | 3e204f98-b227-4ea0-b108-91d5e9c45694 |
| message | None |
| project_id | c4a93f6c1c194bf78bd98ee0f4d51978 |
| request_id | req-aea42b8b-3baa-4e99-ae62-44ba845373e0 |
| start_time | 2021-07-24T10:07:19.000000 |
| user_id | 042131e0784b46218521eee7963022bf |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+

And this is the nova-compute logs at the moment it goes to coma
https://paste.opendev.org/show/807697/

Revision history for this message
Siavash Sardari (siavash.sardari) wrote :

update: the root cause here was the haproxy which reached to about 4 billion tasks in run queue, and thus while the health checks of nova services worked, the nova compute service were in a coma like state because of unanswered HTTP calls to update resource usages.
to me this is very strange that why the nova compute service will hang if HTTP calls are not responding, it should get a time out or something.
tbh I don't have enough knowledge and time to dig more on this issue and my problem is kinda solved. for now I will change the status of this bug to invalid. but if anyone is interested to dig deeper I will be happy to help to reproduce this issue.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.