Bug #1824994 “Multi-node system host-lock failed after swact dur...” : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-04-16:

#1

ALL_NODES_20190416.135618.tar Edit (236.0 MiB, application/x-tar)

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-04-16:

#2

we are hitting the same issue in a standard controller configuration (2+2).

Also, it's not possible to lock controller-0, after system host-lock controller-0 the always shows:

[wrsroot@controller-1 ~(keystone_admin)]$ system host-show controller-0 | grep administrative
| administrative | unlocked |

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-04-17:

#3

Download full text (5.7 KiB)

From the logs provided by Peng, nfv-vim is failing the lock due to task disable_host_services timing out:

2019-04-16T07:49:37.365 controller-1 VIM_Thread[110384] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=5766bad1-fc00-42bf-b7ee-534677538e95, host_name=controller-0, do_action=lock.
2019-04-16T07:49:37.365 controller-1 VIM_Thread[110384] INFO _host_state_enabled.py.33 Exiting state (enabled) for controller-0.
2019-04-16T07:49:37.366 controller-1 VIM_Thread[110384] INFO _host_state_disabling.py.27 Entering state (disabling) for controller-0.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _host_director.py.395 Notify other directors that the host controller-0 services are disabling.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _instance_director.py.1339 Host controller-0 services disabling.
2019-04-16T07:49:37.370 controller-1 VIM_Thread[110384] INFO _instance_director.py.495 No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.373 controller-1 VIM_Thread[110384] INFO _host_director.py.408 Notify other directors that the host controller-0 services are disabled.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] INFO _instance_director.py.1392 Host controller-0 services disabled.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] DEBUG _instance_director.py.1408 Canceling previous host operation host-lock, before continuing with host operation host-lock for controller-0.
2019-04-16T07:49:37.374 controller-1 VIM_Thread[110384] INFO _instance_director.py.495 No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.375 controller-1 VIM_Thread[110384] INFO _host.py.820 Host controller-0 FSM State-Change: prev_state=enabled, state=disabling, event=lock.
2019-04-16T07:49:37.377 controller-1 VIM_Thread[110384] INFO _host_director.py.475 Host controller-0 state change notification.
2019-04-16T07:49:37.379 controller-1 VIM_Thread[110384] DEBUG nfvi_infrastructure_api.py.2260 Host rest-api patch path: /nfvi-plugins/v1/hosts/5766bad1-fc00-42bf-b7ee-534677538e95.
2019-04-16T07:49:57.421 controller-1 VIM_Thread[110384] INFO _task_worker_pool.py.73 Timeout worker Infrastructure-Worker-0
2019-04-16T07:49:58.425 controller-1 VIM_Thread[110384] ERROR _task.py.200 Task(disable_host_services) work (taint_node) timed out, id=22...

From the logs provided by Peng, nfv-vim is failing the lock due to task disable_host_services timing out:

2019-04-16T07:49:37.365 controller-1                           VIM_Thread[110384]    DEBUG                        _vim_nfvi_events.py.63    Host action, host_uuid=5766bad1-fc00-42bf-b7ee-534677538e95, host_name=controller-0, do_action=lock.
2019-04-16T07:49:37.365 controller-1                           VIM_Thread[110384]     INFO                     _host_state_enabled.py.33    Exiting state (enabled) for controller-0.
2019-04-16T07:49:37.366 controller-1                           VIM_Thread[110384]     INFO                   _host_state_disabling.py.27    Entering state (disabling) for controller-0.
2019-04-16T07:49:37.370 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.395   Notify other directors that the host controller-0 services are disabling.
2019-04-16T07:49:37.370 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.1339  Host controller-0 services disabling.
2019-04-16T07:49:37.370 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.495   No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.373 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.408   Notify other directors that the host controller-0 services are disabled.
2019-04-16T07:49:37.374 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.1392  Host controller-0 services disabled.
2019-04-16T07:49:37.374 controller-1                           VIM_Thread[110384]    DEBUG                      _instance_director.py.1408  Canceling previous host operation host-lock, before continuing with host operation host-lock for controller-0.
2019-04-16T07:49:37.374 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.495   No hypervisors available, can't evacuate instances off of host controller-0.
2019-04-16T07:49:37.375 controller-1                           VIM_Thread[110384]     INFO                                   _host.py.820   Host controller-0 FSM State-Change: prev_state=enabled, state=disabling, event=lock.
2019-04-16T07:49:37.377 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.475   Host controller-0 state change notification.
2019-04-16T07:49:37.379 controller-1                           VIM_Thread[110384]    DEBUG                 nfvi_infrastructure_api.py.2260  Host rest-api patch path: /nfvi-plugins/v1/hosts/5766bad1-fc00-42bf-b7ee-534677538e95.
2019-04-16T07:49:57.421 controller-1                           VIM_Thread[110384]     INFO                       _task_worker_pool.py.73    Timeout worker Infrastructure-Worker-0
2019-04-16T07:49:58.425 controller-1                           VIM_Thread[110384]    ERROR                                   _task.py.200   Task(disable_host_services) work (taint_node) timed out, id=22.
2019-04-16T07:49:58.425 controller-1                           VIM_Thread[110384]    ERROR                 nfvi_infrastructure_api.py.955   Kubernetes taint_node failed, operation did not complete, host_uuid=5766bad1-fc00-42bf-b7ee-534677538e95, host_name=controller-0.
2019-04-16T07:49:58.426 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.464   Notify other directors that a host controller-0 abort is inprogress.
2019-04-16T07:49:58.426 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.1332  Canceling host operation host-lock for host controller-0.
2019-04-16T07:49:58.427 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.464   Notify other directors that a host controller-0 abort is inprogress.
2019-04-16T07:49:58.427 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.421   Notify other directors that the host controller-0 is disabled.
2019-04-16T07:49:58.427 controller-1                           VIM_Thread[110384]     INFO                      _instance_director.py.1427  Host controller-0 disabled.
2019-04-16T07:49:58.427 controller-1                           VIM_Thread[110384]    DEBUG                             _host_tasks.py.276   Task (disable-host_controller-0) complete.
2019-04-16T07:49:58.428 controller-1                           VIM_Thread[110384]     INFO                   _host_state_disabling.py.81    Disable failed for controller-0.
2019-04-16T07:49:58.428 controller-1                           VIM_Thread[110384]     INFO                   _host_state_disabling.py.83    Lock failed for controller-0.
2019-04-16T07:49:58.428 controller-1                           VIM_Thread[110384]     INFO                   _host_state_disabling.py.36    Exiting state (disabling) for controller-0.
2019-04-16T07:49:58.428 controller-1                           VIM_Thread[110384]     INFO            _host_state_disabling_failed.py.27    Entering state (disabling-failed) for controller-0.
2019-04-16T07:49:58.429 controller-1                           VIM_Thread[110384]     INFO                                   _host.py.820   Host controller-0 FSM State-Change: prev_state=disabling, state=disabling-failed, event=task-failed.
2019-04-16T07:49:58.435 controller-1                           VIM_Thread[110384]     INFO                          _host_director.py.475   Host controller-0 state change notification.

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-04-17:

#4

Download full text (3.5 KiB)

The VIM fails to lock controller-0 because it can't apply the NoExecute taint:

2019-04-16T07:49:58.727 controller-1 VIM_Thread[110384] ERROR Caught exception while trying to enable controller-0 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',)).
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 907, in enable_host_services
future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',))

This appears to be caused because the kube-apiserver pod is not reachable. Looking at the controller-1 networking.info file in the collect, I can see that although controller-1 is active, the floating cluster IP (192.168.206.2) is not on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dd:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.4/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.4/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.6/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.5/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dd50/64 scope link
       valid_lft forever preferred_lft forever

Looking at the controller-0 networking.info file I can see that the floating cluster IP (192.168.206.2) is on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dc:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.3/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.3/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.206.2/24 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dc2c/64 scope link
       valid_lft forever preferred_lft forever

So in summary, when the swact was done, the floating cluster IP was not moved to controller-1. I can also see that the kubelet on controller-1 is unable to reach the kube-apiserver. So in addition to the cluster...

The VIM fails to lock controller-0 because it can't apply the NoExecute taint:

2019-04-16T07:49:58.727 controller-1                           VIM_Thread[110384]    ERROR Caught exception while trying to enable controller-0 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',)).
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 907, in enable_host_services
    future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90c8a44690>: Failed to establish a new connection: [Errno 111] Connection refused',))

This appears to be caused because the kube-apiserver pod is not reachable. Looking at the controller-1 networking.info file in the collect, I can see that although controller-1 is active, the floating cluster IP (192.168.206.2) is not on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dd:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.4/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.4/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.6/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet 192.168.204.5/24 brd 192.168.204.255 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dd50/64 scope link 
       valid_lft forever preferred_lft forever

Looking at the controller-0 networking.info file I can see that the floating cluster IP (192.168.206.2) is on the management interface:

9: ens801f0.139@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue state UP group default qlen 1000
    link/ether 90:e2:ba:b0:dc:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.3/24 brd 192.168.204.255 scope global ens801f0.139:1
       valid_lft forever preferred_lft forever
    inet 192.168.206.3/24 brd 192.168.206.255 scope global ens801f0.139:5
       valid_lft forever preferred_lft forever
    inet 192.168.206.2/24 scope global secondary ens801f0.139
       valid_lft forever preferred_lft forever
    inet6 fe80::92e2:baff:feb0:dc2c/64 scope link 
       valid_lft forever preferred_lft forever

So in summary, when the swact was done, the floating cluster IP was not moved to controller-1. I can also see that the kubelet on controller-1 is unable to reach the kube-apiserver. So in addition to the cluster IP being on the wrong controller, it doesn't seem to be reachable:

2019-04-16T07:48:31.021 controller-1 kubelet[94159]: info E0416 07:48:31.021588   94159 kubelet_node_status.go:381] Error updating node status, will retry: error getting node "controller-1": Get https://192.168.206.2:6443/api/v1/nodes/controller-1?timeout=4s: context deadline exceeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-17:

#5

Marking as release gating; high priority as basic node operations are failing.

This issue was introduced recently by: https://review.openstack.org/#/c/650981/

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.2.0 stx.ha
tags:	added: stx.retestneeded
Changed in starlingx:
status:	New → Triaged
assignee:	nobody → Teresa Ho (teresaho)
summary:	- 2+3 system host-lock failed after swact during lab setup + Multi-node system host-lock failed after swact during initial setup

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-18: Fix proposed to stx-config (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/653650

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-18: Fix merged to stx-config (master)

#7

Reviewed: https://review.openstack.org/653650
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=efa7c7d482222739f89aca5ae25074187b31b74e
Submitter: Zuul
Branch: master

commit efa7c7d482222739f89aca5ae25074187b31b74e
Author: Teresa Ho <email address hidden>
Date: Wed Apr 17 23:12:14 2019 -0400

Fix SM cluster_host_ip service

    The floating IP of the cluster-host interface was not transferred
    to the newly active controller after swact.
    The service cluster-host-ip should be monitored by SM regardless
    whether the management interface is shared with cluster-host interface.

Closes-Bug: 1824994

Change-Id: Ie2bc9012adf7da1e093ff697eda7ae1165643e8c
Signed-off-by: Teresa Ho <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-06:

#8

Not see this issue recently.

tags:

removed: stx.retestneeded

StarlingX

Multi-node system host-lock failed after swact during initial setup

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches