StarlingX

Backup & Restore: Nodes fail to unlock after restore

Bug #1844828 reported by Senthil Mukundakumar on 2019-09-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	High	Ovidiu Poncea

Bug Description

Brief Description
-----------------
In a regular system with storage, after completing the restore action on the active controller and remaining nodes, all the nodes except active controller failed to unlock.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-unlock controller-1
^[[1;2BTimeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "configure_ihost" info: "<unknown>"

The standby controller and other nodes do have connectivity but unable to ssh from active controller
sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
PING controller-1 (192.168.204.4) 56(84) bytes of data.
64 bytes from controller-1 (192.168.204.4): icmp_seq=1 ttl=64 time=0.188 ms

Severity
--------
Provide the severity of the defect.
Critical: Controller-1 failed to unlock

Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller
7. Power on and PXE boot controller-1. Ceph OSDs on controller-1 will remain intact. Unlock controller-1

Expected Behavior
------------------
The controller-1 should become online and ready to get unlocked

Actual Behavior
----------------
Standby controller became online but failed to unlock

Reproducibility
---------------

System Configuration
--------------------
Regular system with storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-18_20-00-00"

Timestamp/Logs
--------------
2019-09-20 19:14:02.979
Since active controller not able to ssh to other nodes, the logs might not contain data from other nodes
collecting data from 8 host(s): controller-0 compute-1 compute-2 compute-3 controller-1 storage-0 storage-1 compute-0
collecting controller-0_20190920.191322 ... done (00:01:23 19M)
collecting compute-1_20190920.191322 ... failed to collect from compute-1 (reason:36:permission error)
collecting compute-2_20190920.191322 ... failed to collect from compute-2 (reason:36:permission error)
collecting compute-3_20190920.191322 ... failed to collect from compute-3 (reason:33:host unreachable)
collecting controller-1_20190920.191322 ... failed to collect from controller-1 (reason:36:permission error)
collecting storage-0_20190920.191322 ... failed to collect from storage-0 (reason:36:permission error)
collecting storage-1_20190920.191322 ... failed to collect from storage-1 (reason:36:permission error)
collecting compute-0_20190920.191322 ... failed to collect from compute-0 (reason:36:permission error)
creating all-nodes tarball /scratch/ALL_NODES_20190920.191322.tar ... done (00:01:34 19M)

Test Activity
-------------
Feature Testing

Tags:

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-09-20:

LOGS Edit (18.9 MiB, application/x-tar)

Ghada Khalil (gkhalil) on 2019-09-23

tags:

added: stx.update

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-23:

Download full text (5.4 KiB)

The first unlock of controller-1 is done earlier than the timestamp listed by Senthil. From the nfv-vim.log:

2019-09-20T18:19:46.407 controller-0 VIM_Thread[101227] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=fe491d66-c3e7-47e3-bf4c-d081ea0c9545, host_name=controller-1, do_action=unlock.

This seems to fail. Is it related to using https?

2019-09-20T18:20:45.612 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)).
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 950, in disable_host_services
future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',))
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _instance_director.py.1332 Canceling host operation host-disable for host controller-1.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_director.py.421 Notify other directors that the host controller-1 is disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _instance_director.py.1427 Host controller-1 disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] DEBUG _host_tasks.py.276 Task (disable-host_controller-1) complete.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_state_disabling.py.81 Disable failed for controller-1.

Also before the unlock attempt of controller-1 seeing other similar errors in nfv-vim.log:

The first unlock of controller-1 is done earlier than the timestamp listed by Senthil.  From the nfv-vim.log:

2019-09-20T18:19:46.407 controller-0                           VIM_Thread[101227]    DEBUG                        _vim_nfvi_events.py.63    Host action, host_uuid=fe491d66-c3e7-47e3-bf4c-d081ea0c9545, host_name=controller-1, do_action=unlock.

This seems to fail.  Is it related to using https?

2019-09-20T18:20:45.612 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)).
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 950, in disable_host_services
    future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',))
2019-09-20T18:20:45.613 controller-0                           VIM_Thread[101227]     INFO                          _host_director.py.464   Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.613 controller-0                           VIM_Thread[101227]     INFO                      _instance_director.py.1332  Canceling host operation host-disable for host controller-1.
2019-09-20T18:20:45.613 controller-0                           VIM_Thread[101227]     INFO                          _host_director.py.464   Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.614 controller-0                           VIM_Thread[101227]     INFO                          _host_director.py.421   Notify other directors that the host controller-1 is disabled.
2019-09-20T18:20:45.614 controller-0                           VIM_Thread[101227]     INFO                      _instance_director.py.1427  Host controller-1 disabled.
2019-09-20T18:20:45.614 controller-0                           VIM_Thread[101227]    DEBUG                             _host_tasks.py.276   Task (disable-host_controller-1) complete.
2019-09-20T18:20:45.614 controller-0                           VIM_Thread[101227]     INFO                   _host_state_disabling.py.81    Disable failed for controller-1.

Also before the unlock attempt of controller-1 seeing other similar errors in nfv-vim.log:

2019-09-20T15:47:06.114 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to query the state of the system, error=[OpenStack Exception: method=GET, url=http://localhost:2112/v1/systems, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=None, reason=<urlopen error [Errno 111] Connection refused>].
2019-09-20T15:47:23.463 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to enable controller-0 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f12860d2c10>: Failed to establish a new connection: [Errno 113] No route to host',)).
2019-09-20T15:47:53.526 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to disable compute-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/compute-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f1284065810>: Failed to establish a new connection: [Errno 113] No route to host',)).
2019-09-20T15:48:05.565 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to disable compute-2 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/compute-2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f12840655d0>: Failed to establish a new connection: [Errno 113] No route to host',)).
2019-09-20T15:48:17.602 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to disable compute-3 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/compute-3 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f1284065790>: Failed to establish a new connection: [Errno 113] No route to host',)).
2019-09-20T15:48:29.638 controller-0                           VIM_Thread[101227]    ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f12840652d0>: Failed to establish a new connection: [Errno 113] No route to host',)).

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-23:

After reviewing logs with Bart, he pointed out that the VIM cannot communicate to kubernetes-api. From deamon.log he sees this error that should be investigated:

2019-09-20T15:22:54.923 localhost kubelet[128297]: info E0920 15:22:54.923002 128297 kubelet.go:2248] node "controller-0" not found

Yang Liu (yliu12) on 2019-09-25

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-27:

Marking as stx.3.0 as B&R is a feature deliverable for the release. This needs further investigation.

summary:	- Backup & Restore: Nodes are failed to unlock after restore + Backup & Restore: Nodes fail to unlock after restore
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
importance:	Medium → High
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)
tags:	added: stx.3.0

Ovidiu Poncea (ovidiuponcea) on 2019-10-07

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-10-07:

After looking over the logs & conitinuing Bart's analysis it seems like the IP kubernetes is trying to connect to is not present on any interface:

unable to register node "controller-0" with API server: Post https://192.168.206.2:6443/api/v1/nodes: dial tcp

ovidiu@repos:~/logs/lp1844828/controller-0_20190920.191322/var/extra$ grep "192.168.206.2" networking.info
inet 192.168.206.103/24 brd 192.168.206.255 scope global ens3f0
192.168.206.2 dev ens3f0 ref 1 used 0/12596/1 probes 5 INCOMPLETE

The ARP INCOMPLETE entry is there

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-10-07:

Given the fact that there have been changes to the networking interface code I also suggest a retest of this issue.

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-10-16:

Senthil retested it and it did not reproduce, marking it as invalid.

Changed in starlingx:
status:	In Progress → Invalid

Senthil Mukundakumar (smukunda) on 2019-10-31

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

LOGS Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.