Multi-node system host-lock failed after swact during initial setup

Bug #1824994 reported by Peng Peng
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Teresa Ho

Bug Description

Brief Description
-----------------
During lab setup period, host-swact to controller-1, then try to host-lock controller-0, but failed

Severity
--------
Critical

Steps to Reproduce
------------------
as description
....
TC-name: installation

Expected Behavior
------------------
controller-0 was locked successfully

Actual Behavior
----------------
controller-0 is always in unlock status

Reproducibility
---------------
Reproducible
10/10

System Configuration
--------------------
Multi-node system

Lab-name: WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 20190415T233001Z

Last Pass
---------
20190410T013000Z

Timestamp/Logs
--------------
[2019-04-16 07:48:12,039] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-04-16 07:49:35,183] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'
[2019-04-16 07:49:37,528] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

[2019-04-16 07:50:23,355] 1315 WARNING MainThread host_helper.wait_for_host_values:: controller-0 did not reach state(s) within 20s - {'administrative': 'locked'}
[2019-04-16 07:50:23,356] 423 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-04-16 07:50:23,356] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-04-16 07:50:24,598] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+-----------------------------------------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |

[wrsroot@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
| 5 | compute-2 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-1 ~(keystone_admin)]$ date
Tue Apr 16 14:05:29 UTC 2019

Test Activity
-------------
installation

Tags: stx.2.0 stx.ha
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Erich Cordoba (ericho) wrote :

we are hitting the same issue in a standard controller configuration (2+2).

Also, it's not possible to lock controller-0, after system host-lock controller-0 the always shows:

[wrsroot@controller-1 ~(keystone_admin)]$ system host-show controller-0 | grep administrative
| administrative | unlocked |

Revision history for this message
Frank Miller (sensfan22) wrote :
Download full text (5.7 KiB)

From the logs provided by Peng, nfv-vim is failing the lock due to task disable_host_services timing out:

2019-04-16T07:49:37.365 controller-1 VIM_Thread[110384] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=5766bad1-fc00-42bf-b7ee-534677538e95, host_name=controller-0, do_action=lock.
2019-04-16T07:49:37.365 controller-1 VIM_Thread[110384] INFO _host_state_enabled.py.33 Exiting state (enabled) for controller-0.
2019-04-16T07:49:37.366 controller-1 VIM_Thread[110384] INFO _host_state_disabling.py.27 Entering state (disabling) for controller-0.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _host_director.py.395 Notify other directors that the host controller-0 services are disabling.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _instance_director.py.1339 Host controller-0 services disabling.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _instance_director.py.495 No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.373 controller-1 VIM_Thread[110384] INFO _host_director.py.408 Notify other directors that the host controller-0 services are disabled.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] INFO _instance_director.py.1392 Host controller-0 services disabled.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] DEBUG _instance_director.py.1408 Canceling previous host operation host-lock, before continuing with host operation host-lock for controller-0.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] INFO _instance_director.py.495 No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.375 controller-1 VIM_Thread[110384] INFO _host.py.820 Host controller-0 FSM State-Change: prev_state=enabled, state=disabling, event=lock.
2019-04-16T07:49:37.377 controller-1 VIM_Thread[110384] INFO _host_director.py.475 Host controller-0 state change notification.
2019-04-16T07:49:37.379 controller-1 VIM_Thread[110384] DEBUG nfvi_infrastructure_api.py.2260 Host rest-api patch path: /nfvi-plugins/v1/hosts/5766bad1-fc00-42bf-b7ee-534677538e95.
2019-04-16T07:49:57.421 controller-1 VIM_Thread[110384] INFO _task_worker_pool.py.73 Timeout worker Infrastructure-Worker-0
2019-04-16T07:49:58.425 controller-1 VIM_Thread[110384] ERROR _task.py.200 Task(disable_host_services) work (taint_node) timed out, id=22...

Read more...

Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.5 KiB)

The VIM fails to lock controller-0 because it can't apply the NoExecute taint:

2019-04-16T07:49:58.727 controller-1 VIM_Thread[110384] ERROR Caught exception while trying to enable controller-0 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',)).
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 907, in enable_host_services
    future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',))

This appears to be caused because the kube-apiserver pod is not reachable. Looking at the controller-1 networking.info file in the collect, I can see that although controller-1 is active, the floating cluster IP (192.168.206.2) is not on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dd:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.4/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.4/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.6/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.5/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dd50/64 scope link
       valid_lft forever preferred_lft forever

Looking at the controller-0 networking.info file I can see that the floating cluster IP (192.168.206.2) is on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dc:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.3/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.3/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.206.2/24 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dc2c/64 scope link
       valid_lft forever preferred_lft forever

So in summary, when the swact was done, the floating cluster IP was not moved to controller-1. I can also see that the kubelet on controller-1 is unable to reach the kube-apiserver. So in addition to the cluster...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority as basic node operations are failing.

This issue was introduced recently by: https://review.openstack.org/#/c/650981/

Changed in starlingx:
importance: Undecided → High
tags: added: stx.2.0 stx.ha
tags: added: stx.retestneeded
Changed in starlingx:
status: New → Triaged
assignee: nobody → Teresa Ho (teresaho)
summary: - 2+3 system host-lock failed after swact during lab setup
+ Multi-node system host-lock failed after swact during initial setup
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/653650

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/653650
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=efa7c7d482222739f89aca5ae25074187b31b74e
Submitter: Zuul
Branch: master

commit efa7c7d482222739f89aca5ae25074187b31b74e
Author: Teresa Ho <email address hidden>
Date: Wed Apr 17 23:12:14 2019 -0400

    Fix SM cluster_host_ip service

    The floating IP of the cluster-host interface was not transferred
    to the newly active controller after swact.
    The service cluster-host-ip should be monitored by SM regardless
    whether the management interface is shared with cluster-host interface.

    Closes-Bug: 1824994

    Change-Id: Ie2bc9012adf7da1e093ff697eda7ae1165643e8c
    Signed-off-by: Teresa Ho <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not see this issue recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.