Bug #1821841 “Cannot create instances on Standard Dedicated Stor...” : Bugs : StarlingX

Revision history for this message

Brent Rowsell (brent-rowsell) wrote on 2019-03-27:

#1

Please check that the cirros images does not have this image property property hypervisor_type='qemu'
If so, delete

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-27:

#2

The properties of cirros image used are:

direct_url='rbd://5117acff-103e-4847-8ac2-90e67252c131/images/678b9d81-1fe2-4c4a-84cf-9c8f0da71162/snap', os_hash_algo='sha512', os_hash_value='9a0beeef2d36cebb2ba1256b2e3ac12fc31bba3d49643719c73a150353a852b4a618264e0854e372ce5eaf34dee28caa9b8029e1743ec0ba5aaf8aa9b8f887ab',
os_hidden='False'

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-27:

#3

kubectl_describe_nodes_compute-0 Edit (4.5 KiB, text/plain)

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-27:

#4

kubectl_describe_nodes_compute-1 Edit (4.7 KiB, text/plain)

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-03-28:

#5

Can you provide the logs from the nova-compute pods please.
Also, what is the branch/pull-time of this build (per template):
https://wiki.openstack.org/wiki/StarlingX/BugTemplate

Thanks

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-28:

#6

Issue faced on:

- AIO Duplex BM
- Standard Local Storage (2+2) BM and Virtual
- Standard Dedicated Storage (2+2+2) BM

ISO 20190328T013000Z
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190328T013000Z/outputs/iso/

Logs from collect below.

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-28:

#7

duplex_controller-0_20190328.154536.tar Edit (35.9 MiB, application/x-tar)

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-28:

#8

standard-local_controller-0_20190328.135104.tar Edit (34.7 MiB, application/x-tar)

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-03-28:

#9

standard-dedicated_controller-0_20190328.145719.tar Edit (33.4 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-03-29:

#10

Marking as release gating; high priority given issue is seen in sanity

Changed in starlingx:
assignee:	nobody → Gerry Kopec (gerry-kopec)
importance:	Undecided → High
tags:	added: stx.2019.05
Changed in starlingx:
status:	New → Triaged

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-04-05:

#11

Download full text (3.7 KiB)

Looking at a number of these situations, the root cause is that nova-placement-api does not respond to a request from nova-compute (or other nova pod e.g. nova-scheduler).
If this happens during the nova-compute initial resource check (in nova.compute.manager.ComputeManager.pre_start_hook()), the process is effectively stuck forever and the hypervisor will never come up.
If this happens after the initial resource check is done, hypervisor will come up, but the per minute resource audit becomes stalled forever. Can see this in updated_at stats in compute_nodes table in db. Some actions may work but others may stall depending on placement api responsiveness.
$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"

[root@controller-1 19.01(keystone_admin)]# openstack hypervisor list
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 4 | compute-2 | QEMU | 192.168.206.128 | up |
| 7 | compute-0 | QEMU | 192.168.206.217 | down |
| 10 | compute-1 | QEMU | 192.168.206.231 | down |
+----+---------------------+-----------------+-----------------+-------+
[wrsroot@controller-1 gerry(keystone_admin)]$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"
Sat Mar 30 01:09:14 UTC 2019
+-----------+---------------------+------------+
| host | updated_at | vcpus_used |
+-----------+---------------------+------------+
| compute-2 | 2019-03-30 01:08:21 | 0 |
| compute-0 | 2019-03-30 01:02:33 | 1 |
| compute-1 | 2019-03-30 01:02:33 | 0 |
+-----------+---------------------+------------+

Co-incident with the stalls, you would see eventlet "error: cannot switch to a different thread" in nova-placement-api logs:
nova-placement-api-6b946c744c-rnqk5.log:
2019-03-30 01:02:35.824693 Deprecated: Option "idle_timeout" from group "api_database" is deprecated. Use option "connection_recycle_time" from group "api_database".
2019-03-30 01:02:36.225945 Traceback (most recent call last):
2019-03-30 01:02:36.225994 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 460, in fire_timers
2019-03-30 01:02:36.226195 timer()
2019-03-30 01:02:36.226206 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 59, in __call__
2019-03-30 01:02:36.226254 cb(*args, **kw)
2019-03-30 01:02:36.226265 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
2019-03-30 01:02:36.226340 waiter.switch()
2019-03-30 01:02:36.226356 error: cannot switch to a different thread

Noted that there were changes related to eventlet in nova master branch post creation of stable/stein. See:
https://review.openstack.org/#/c/626952/
This change did appear in the nova docker images (which are currently based on master branch) which we were using at the same time these placement errors/hypervisor down is...

Looking at a number of these situations, the root cause is that nova-placement-api does not respond to a request from nova-compute (or other nova pod e.g. nova-scheduler).
If this happens during the nova-compute initial resource check (in nova.compute.manager.ComputeManager.pre_start_hook()), the process is effectively stuck forever and the hypervisor will never come up.
If this happens after the initial resource check is done, hypervisor will come up, but the per minute resource audit becomes stalled forever. Can see this in updated_at stats in compute_nodes table in db. Some actions may work but others may stall depending on placement api responsiveness.
$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"

[root@controller-1 19.01(keystone_admin)]# openstack hypervisor list
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 4 | compute-2 | QEMU | 192.168.206.128 | up |
| 7 | compute-0 | QEMU | 192.168.206.217 | down |
| 10 | compute-1 | QEMU | 192.168.206.231 | down |
+----+---------------------+-----------------+-----------------+-------+
[wrsroot@controller-1 gerry(keystone_admin)]$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"
Sat Mar 30 01:09:14 UTC 2019
+-----------+---------------------+------------+
| host | updated_at | vcpus_used |
+-----------+---------------------+------------+
| compute-2 | 2019-03-30 01:08:21 | 0 |
| compute-0 | 2019-03-30 01:02:33 | 1 |
| compute-1 | 2019-03-30 01:02:33 | 0 |
+-----------+---------------------+------------+

Co-incident with the stalls, you would see eventlet "error: cannot switch to a different thread" in nova-placement-api logs:
nova-placement-api-6b946c744c-rnqk5.log:
2019-03-30 01:02:35.824693 Deprecated: Option "idle_timeout" from group "api_database" is deprecated. Use option "connection_recycle_time" from group "api_database".
2019-03-30 01:02:36.225945 Traceback (most recent call last):
2019-03-30 01:02:36.225994 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 460, in fire_timers
2019-03-30 01:02:36.226195 timer()
2019-03-30 01:02:36.226206 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 59, in __call__
2019-03-30 01:02:36.226254 cb(*args, **kw)
2019-03-30 01:02:36.226265 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
2019-03-30 01:02:36.226340 waiter.switch()
2019-03-30 01:02:36.226356 error: cannot switch to a different thread

Noted that there were changes related to eventlet in nova master branch post creation of stable/stein. See:
https://review.openstack.org/#/c/626952/
This change did appear in the nova docker images (which are currently based on master branch) which we were using at the same time these placement errors/hypervisor down issues started appearing in sanity/nightly/regression testing.

As our plan is to use stable/stein docker images anyway, I replaced the master nova docker image with one based on stable/stein (courtesy of Don Penney) and was not able to reproduce the problem. Testcase is to make an unrelated/inconsequential override change to nova and run system application-apply. Ran 10 iterations without an issue whereas previously could reproduce after 1-4 iterations.

So recommend waiting for cutover to stable stein docker images (which should be soon) and monitor if this issue still happens.

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-04-08:

#12

With merging of following commits and add of stable docker images to starlingx docker hub, we should be able to repeat the test and determine if the issue is still there:

https://review.openstack.org/#/q/topic:stable-image-builds+(status:open+OR+status:merged)
https://review.openstack.org/#/c/650436/

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-04-09:

#13

This issue was addressed by rebasing the docker images to the Stein release and this commit to have the system application-upload command pull these docker images: https://review.openstack.org/#/c/650436/

Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-04-18:

#14

We have not seen this issue recently

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-04-22:

#15

Regarding last STX Sanity status, issues show up from installation or provisioning, we could not be able to launch instances yet.

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-04-22:

#16

Based on other sanity test results we're getting, I suspect you're not seeing the same issue as this bug. Can you please create a new LP bug to track the issue that you're seeing in comment #15. For details, please include the state of the openstack pods as well as logs (especially nova containers) from all hosts.

StarlingX

Cannot create instances on Standard Dedicated Storage

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches