Cannot create instances on Standard Dedicated Storage

Bug #1821841 reported by Juan Carlos Alonso
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Gerry Kopec

Bug Description

Title
-----

Cannot create instances on Standard Dedicated Storage on Bare Metal environment

Brief Description
-----------------

Cannot create instances on Standard Dedicated Storage on Bare Metal environment

controller-0:~$ openstack server list
+--------------------------------------+-------------+--------+----------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-------------+--------+----------+--------+----------+
| 3240e3bd-8af2-429f-83e8-32927acfbd6d | vm-cirros-1 | ERROR | | cirros | f1.small |
+--------------------------------------+-------------+--------+----------+--------+----------+

controller-0:~$ openstack server show vm-cirros-1
{u'message': u'No valid host was found. ', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 1323, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 780, in _schedule_instances\n return_alternates=return_alternates)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 178, in call\n retry=self.retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send\n retry=retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send\n call_monitor_timeout, retry=retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 636, in _send\n raise result\n', u'created': u'2019-03-26T15:59:32Z'}

Severity
--------

<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------

$ openstack server create --flavor <flavor> --image <image> --nic <net-id> <vm-name>

Reproducibility
---------------

<Reproducible/100%>

System Configuration
--------------------

Standard Dedicated Storage - Bare Metal

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Please check that the cirros images does not have this image property property hypervisor_type='qemu'
If so, delete

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

The properties of cirros image used are:

direct_url='rbd://5117acff-103e-4847-8ac2-90e67252c131/images/678b9d81-1fe2-4c4a-84cf-9c8f0da71162/snap', os_hash_algo='sha512', os_hash_value='9a0beeef2d36cebb2ba1256b2e3ac12fc31bba3d49643719c73a150353a852b4a618264e0854e372ce5eaf34dee28caa9b8029e1743ec0ba5aaf8aa9b8f887ab',
os_hidden='False'

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Can you provide the logs from the nova-compute pods please.
Also, what is the branch/pull-time of this build (per template):
https://wiki.openstack.org/wiki/StarlingX/BugTemplate

Thanks

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Issue faced on:

- AIO Duplex BM
- Standard Local Storage (2+2) BM and Virtual
- Standard Dedicated Storage (2+2+2) BM

ISO 20190328T013000Z
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190328T013000Z/outputs/iso/

Logs from collect below.

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority given issue is seen in sanity

Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
importance: Undecided → High
tags: added: stx.2019.05
Changed in starlingx:
status: New → Triaged
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :
Download full text (3.7 KiB)

Looking at a number of these situations, the root cause is that nova-placement-api does not respond to a request from nova-compute (or other nova pod e.g. nova-scheduler).
If this happens during the nova-compute initial resource check (in nova.compute.manager.ComputeManager.pre_start_hook()), the process is effectively stuck forever and the hypervisor will never come up.
If this happens after the initial resource check is done, hypervisor will come up, but the per minute resource audit becomes stalled forever. Can see this in updated_at stats in compute_nodes table in db. Some actions may work but others may stall depending on placement api responsiveness.
$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"

[root@controller-1 19.01(keystone_admin)]# openstack hypervisor list
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 4 | compute-2 | QEMU | 192.168.206.128 | up |
| 7 | compute-0 | QEMU | 192.168.206.217 | down |
| 10 | compute-1 | QEMU | 192.168.206.231 | down |
+----+---------------------+-----------------+-----------------+-------+
[wrsroot@controller-1 gerry(keystone_admin)]$ date; kubectl exec -it -n openstack mariadb-server-0 -- bash -c "mysql --password=\$MYSQL_ROOT_PASSWORD --user=root nova -e 'select host,updated_at,vcpus_used from compute_nodes;'"
Sat Mar 30 01:09:14 UTC 2019
+-----------+---------------------+------------+
| host | updated_at | vcpus_used |
+-----------+---------------------+------------+
| compute-2 | 2019-03-30 01:08:21 | 0 |
| compute-0 | 2019-03-30 01:02:33 | 1 |
| compute-1 | 2019-03-30 01:02:33 | 0 |
+-----------+---------------------+------------+

Co-incident with the stalls, you would see eventlet "error: cannot switch to a different thread" in nova-placement-api logs:
nova-placement-api-6b946c744c-rnqk5.log:
2019-03-30 01:02:35.824693 Deprecated: Option "idle_timeout" from group "api_database" is deprecated. Use option "connection_recycle_time" from group "api_database".
2019-03-30 01:02:36.225945 Traceback (most recent call last):
2019-03-30 01:02:36.225994 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 460, in fire_timers
2019-03-30 01:02:36.226195 timer()
2019-03-30 01:02:36.226206 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 59, in __call__
2019-03-30 01:02:36.226254 cb(*args, **kw)
2019-03-30 01:02:36.226265 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
2019-03-30 01:02:36.226340 waiter.switch()
2019-03-30 01:02:36.226356 error: cannot switch to a different thread

Noted that there were changes related to eventlet in nova master branch post creation of stable/stein. See:
https://review.openstack.org/#/c/626952/
This change did appear in the nova docker images (which are currently based on master branch) which we were using at the same time these placement errors/hypervisor down is...

Read more...

Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

With merging of following commits and add of stable docker images to starlingx docker hub, we should be able to repeat the test and determine if the issue is still there:

https://review.openstack.org/#/q/topic:stable-image-builds+(status:open+OR+status:merged)
https://review.openstack.org/#/c/650436/

Revision history for this message
Frank Miller (sensfan22) wrote :

This issue was addressed by rebasing the docker images to the Stein release and this commit to have the system application-upload command pull these docker images: https://review.openstack.org/#/c/650436/

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

We have not seen this issue recently

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Regarding last STX Sanity status, issues show up from installation or provisioning, we could not be able to launch instances yet.

Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Based on other sanity test results we're getting, I suspect you're not seeing the same issue as this bug. Can you please create a new LP bug to track the issue that you're seeing in comment #15. For details, please include the state of the openstack pods as well as logs (especially nova containers) from all hosts.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.