AIO-DX: hypervisor is not up in 5 mins after unlocked standby controller becomes available

Bug #1829931 reported by Peng Peng on 2019-05-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Low
Bart Wensley

Bug Description

Brief Description
-----------------
After lock/unlock standby controller, standby controller shows down in hypervisor list after 15 mins.

10 mins after host-unlock, host became available. After another 5 mins, hypervisor was still showing down.

Severity
--------
Major

Steps to Reproduce
------------------

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-18_06-36-50

Last Pass
---------
20190506T233000Z

Timestamp/Logs
--------------

[2019-05-21 13:37:51,407] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1'
[2019-05-21 13:38:02,097] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |

[2019-05-21 13:45:18,276] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:45:21,240] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | down |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 13:50:47,155] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-21 13:50:49,056] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-05-21 13:51:21,057] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:51:23,359] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | down |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 13:51:28,469] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:51:31,188] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Yang Liu (yliu12) wrote :

Additional info:

# After unlock, controller-1 became available in system host-show at following time.
[2019-05-21 13:45:10,063] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-1'
+---------------------+-----------------------------------------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

# It then took another 6 minutes for worker subfunction to be ready and nova hypervisor to be up.
And the complaint is against this extra 6 minutes, which seems to be fairly slow.

Ghada Khalil (gkhalil) on 2019-05-22
tags: added: stx.sanity
Al Bailey (albailey1974) wrote :

The 15 minutes are
host-unlock at 2019-05-21 13:37:51
hypervisors up at 2019-05-21 13:51:28

The daemon.log shows that the controller comes out of its reboot at
2019-05-21T13:41:16 systemd 219 running in system mode.

and gets through the puppet manifests etc..

2019-05-21T13:44:15.089 controller-1 sm[93229]: info Starting sm: OK

The first kubernetes activities start at around that time as well.
  "Started Kubernetes transient mount for"
Those continue until 2019-05-21T13:51:14

Note: the nameserver logs come out even later:
2019-05-21T13:55:23.000 controller-1 dnsmasq[275911]: info using nameserver 10.96.0.10#53 for domain cluster.local

It looks like this system took a very long time to fully come up.

Numan Waheed (nwaheed) on 2019-05-22
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) on 2019-05-22
tags: added: stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Ghada Khalil (gkhalil) wrote :

Marking as release gating; needs further investigation and a better understanding of the frequency

tags: added: stx.2.0
Ghada Khalil (gkhalil) wrote :

@Peng, Does this issue happen on standard configs?

Peng Peng (ppeng) wrote :

Yes, it is also happened on standard lab WCP_99-103
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

[2019-05-22 08:41:47,154] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-22 08:41:48,764] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 4 | compute-2 | QEMU | 192.168.206.84 | up |
| 7 | compute-1 | QEMU | 192.168.206.43 | up |
| 10 | compute-0 | QEMU | 192.168.206.123 | up |
+----+---------------------+-----------------+-----------------+-------+

[2019-05-22 08:49:27,127] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-22 08:49:27,127] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-22 09:10:29,994] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-22 09:12:52,140] 387 DEBUG MainThread ssh.expect :: Output:
Unable to establish connection to http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/dffd81c66f134c40bed770a1ff06a845/os-hypervisors/detail: HTTPConnectionPool(host='nova-api-proxy.openstack.svc.cluster.local', port=8774): Max retries exceeded with url: /v2.1/dffd81c66f134c40bed770a1ff06a845/os-hypervisors/detail (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4e0deefdd0>: Failed to establish a new connection: [Errno 110] Connection timed out',))
controller-1:~$

Bart Wensley (bartwensley) wrote :

This bug report is for the lock/unlock of the standby controller. Peng's note is for the reboot of an active controller, which is a completely different testcase. Recent sanity reports do not show issues with lock/unlock of the standby controller in a standard configuration.

Peng Peng (ppeng) wrote :

Please ignore #6, That should be a different issue, LP-1829432. Thanks Bart.

Bart Wensley (bartwensley) wrote :

A brief timeline from the logs in the defect (mostly what Al already found):
13:37:51 - controller-1 unlocked
13:41:16 - controller-1 finishes booting
13:44:06 - controller manifest finishes applying
13:44:14 - kubelet starts
13:44:47 - worker manifest finishes applying
13:51:14 - nova-compute container starts
13:51:28 - hypervisor up

The kubelet starts creating containers at 13:44:16. Creating all the containers takes a long time (about 7 minutes) because:
- It has to create 178 containers.
- There are dependencies between containers.
- These containers are all affined to two cores.

I took a brief look at the gaps between container creations and found one large gap - there was a 1m48s gap between dockerd launching the calico-node container before launching the next one. Maybe that is expected?

In any case, nothing jumps out here as a specific problem, other than a lot of containers being created, using only two cores.

Ghada Khalil (gkhalil) on 2019-06-05
Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
Peng Peng (ppeng) on 2019-06-05
summary: - Standby controller not up in hypervisor list in 15 mins after host-
- unlock
+ hypervisor is not up in 5 mins after unlocked Standby controller becomes
+ available
Peng Peng (ppeng) on 2019-06-05
description: updated
Ghada Khalil (gkhalil) on 2019-07-05
summary: - hypervisor is not up in 5 mins after unlocked Standby controller becomes
- available
+ AIO-DX: hypervisor is not up in 5 mins after unlocked standby controller
+ becomes available
Bart Wensley (bartwensley) wrote :

This is just a note that there are currently issues with extremely high platform CPU usage on AIO deployments - see https://bugs.launchpad.net/starlingx/+bug/1837426. Until that bug is resolved, no progress will be made on this issue.

Brent Rowsell (brent-rowsell) wrote :

How long does it take for the hypervisor to come up on the current load ?

Peng Peng (ppeng) wrote :
Download full text (4.2 KiB)

Issue was reproduced in IP_5-6 load: 2019-08-19_20-59-00
After 6 mins, hypervisor is up.

[2019-08-20 09:29:35,346] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:29:38,493] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU | 192.168.205.3 | up |
| 6 | controller-1 | QEMU | 192.168.205.4 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-08-20 09:32:29,503] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-08-20 09:33:40,287] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-08-20 09:42:30,969] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:42:34,893] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU | 192.168.205.3 | down |
| 6 | controller-1 | QEMU | 192.168.205.4 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-08-20 09:47:28,606] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:47:31,561] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU ...

Read more...

Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Peng Peng (ppeng) wrote :

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Ghada Khalil (gkhalil) wrote :

Changing to Low priority. Based on discussion with Frank (containers PL), this can still be an issue, but there are no feasible solutions to address it within a reasonable timeline. We'll keep it open for now to monitor for future occurrences.

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.3.0 stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers