AIO-DX: hypervisor is not up in 5 mins after unlocked standby controller becomes available

Bug #1829931 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
Frank Miller

Bug Description

Brief Description
-----------------
After lock/unlock standby controller, standby controller shows down in hypervisor list after 15 mins.

10 mins after host-unlock, host became available. After another 5 mins, hypervisor was still showing down.

Severity
--------
Major

Steps to Reproduce
------------------

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-18_06-36-50

Last Pass
---------
20190506T233000Z

Timestamp/Logs
--------------

[2019-05-21 13:37:51,407] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1'
[2019-05-21 13:38:02,097] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |

[2019-05-21 13:45:18,276] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:45:21,240] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | down |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 13:50:47,155] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-21 13:50:49,056] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-05-21 13:51:21,057] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:51:23,359] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | down |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-05-21 13:51:28,469] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-21 13:51:31,188] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 3 | controller-1 | QEMU | 192.168.206.4 | up |
| 7 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Yang Liu (yliu12) wrote :

Additional info:

# After unlock, controller-1 became available in system host-show at following time.
[2019-05-21 13:45:10,063] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-1'
+---------------------+-----------------------------------------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

# It then took another 6 minutes for worker subfunction to be ready and nova hypervisor to be up.
And the complaint is against this extra 6 minutes, which seems to be fairly slow.

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Revision history for this message
Al Bailey (albailey1974) wrote :

The 15 minutes are
host-unlock at 2019-05-21 13:37:51
hypervisors up at 2019-05-21 13:51:28

The daemon.log shows that the controller comes out of its reboot at
2019-05-21T13:41:16 systemd 219 running in system mode.

and gets through the puppet manifests etc..

2019-05-21T13:44:15.089 controller-1 sm[93229]: info Starting sm: OK

The first kubernetes activities start at around that time as well.
  "Started Kubernetes transient mount for"
Those continue until 2019-05-21T13:51:14

Note: the nameserver logs come out even later:
2019-05-21T13:55:23.000 controller-1 dnsmasq[275911]: info using nameserver 10.96.0.10#53 for domain cluster.local

It looks like this system took a very long time to fully come up.

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
tags: added: stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; needs further investigation and a better understanding of the frequency

tags: added: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, Does this issue happen on standard configs?

Revision history for this message
Peng Peng (ppeng) wrote :

Yes, it is also happened on standard lab WCP_99-103
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

[2019-05-22 08:41:47,154] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-22 08:41:48,764] 387 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 4 | compute-2 | QEMU | 192.168.206.84 | up |
| 7 | compute-1 | QEMU | 192.168.206.43 | up |
| 10 | compute-0 | QEMU | 192.168.206.123 | up |
+----+---------------------+-----------------+-----------------+-------+

[2019-05-22 08:49:27,127] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-22 08:49:27,127] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-22 09:10:29,994] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-05-22 09:12:52,140] 387 DEBUG MainThread ssh.expect :: Output:
Unable to establish connection to http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/dffd81c66f134c40bed770a1ff06a845/os-hypervisors/detail: HTTPConnectionPool(host='nova-api-proxy.openstack.svc.cluster.local', port=8774): Max retries exceeded with url: /v2.1/dffd81c66f134c40bed770a1ff06a845/os-hypervisors/detail (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4e0deefdd0>: Failed to establish a new connection: [Errno 110] Connection timed out',))
controller-1:~$

Revision history for this message
Bart Wensley (bartwensley) wrote :

This bug report is for the lock/unlock of the standby controller. Peng's note is for the reboot of an active controller, which is a completely different testcase. Recent sanity reports do not show issues with lock/unlock of the standby controller in a standard configuration.

Revision history for this message
Peng Peng (ppeng) wrote :

Please ignore #6, That should be a different issue, LP-1829432. Thanks Bart.

Revision history for this message
Bart Wensley (bartwensley) wrote :

A brief timeline from the logs in the defect (mostly what Al already found):
13:37:51 - controller-1 unlocked
13:41:16 - controller-1 finishes booting
13:44:06 - controller manifest finishes applying
13:44:14 - kubelet starts
13:44:47 - worker manifest finishes applying
13:51:14 - nova-compute container starts
13:51:28 - hypervisor up

The kubelet starts creating containers at 13:44:16. Creating all the containers takes a long time (about 7 minutes) because:
- It has to create 178 containers.
- There are dependencies between containers.
- These containers are all affined to two cores.

I took a brief look at the gaps between container creations and found one large gap - there was a 1m48s gap between dockerd launching the calico-node container before launching the next one. Maybe that is expected?

In any case, nothing jumps out here as a specific problem, other than a lot of containers being created, using only two cores.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
Peng Peng (ppeng)
summary: - Standby controller not up in hypervisor list in 15 mins after host-
- unlock
+ hypervisor is not up in 5 mins after unlocked Standby controller becomes
+ available
Peng Peng (ppeng)
description: updated
Ghada Khalil (gkhalil)
summary: - hypervisor is not up in 5 mins after unlocked Standby controller becomes
- available
+ AIO-DX: hypervisor is not up in 5 mins after unlocked standby controller
+ becomes available
Revision history for this message
Bart Wensley (bartwensley) wrote :

This is just a note that there are currently issues with extremely high platform CPU usage on AIO deployments - see https://bugs.launchpad.net/starlingx/+bug/1837426. Until that bug is resolved, no progress will be made on this issue.

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

How long does it take for the hypervisor to come up on the current load ?

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (4.2 KiB)

Issue was reproduced in IP_5-6 load: 2019-08-19_20-59-00
After 6 mins, hypervisor is up.

[2019-08-20 09:29:35,346] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:29:38,493] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU | 192.168.205.3 | up |
| 6 | controller-1 | QEMU | 192.168.205.4 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-08-20 09:32:29,503] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-08-20 09:33:40,287] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-08-20 09:42:30,969] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:42:34,893] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU | 192.168.205.3 | down |
| 6 | controller-1 | QEMU | 192.168.205.4 | up |
+----+---------------------+-----------------+---------------+-------+

[2019-08-20 09:47:28,606] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2019-08-20 09:47:31,561] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-0 | QEMU ...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing to Low priority. Based on discussion with Frank (containers PL), this can still be an issue, but there are no feasible solutions to address it within a reasonable timeline. We'll keep it open for now to monitor for future occurrences.

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.3.0 stx.retestneeded
Revision history for this message
Bart Wensley (bartwensley) wrote :

Assigning to Frank as any further work on this would be done by the containers team.

Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Frank Miller (sensfan22)
Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.