Backup & Restore: Nodes fail to unlock after restore

Bug #1844828 reported by Senthil Mukundakumar on 2019-09-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Ovidiu Poncea

Bug Description

Brief Description
-----------------
In a regular system with storage, after completing the restore action on the active controller and remaining nodes, all the nodes except active controller failed to unlock.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-1 | worker | locked | disabled | online |
| 3 | compute-2 | worker | locked | disabled | online |
| 4 | compute-3 | worker | locked | disabled | offline |
| 5 | controller-1 | controller | locked | disabled | online |
| 6 | storage-0 | storage | locked | disabled | online |
| 7 | storage-1 | storage | locked | disabled | online |
| 8 | compute-0 | worker | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system host-unlock controller-1
^[[1;2BTimeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "configure_ihost" info: "<unknown>"

The standby controller and other nodes do have connectivity but unable to ssh from active controller
sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
PING controller-1 (192.168.204.4) 56(84) bytes of data.
64 bytes from controller-1 (192.168.204.4): icmp_seq=1 ttl=64 time=0.188 ms

Severity
--------
Provide the severity of the defect.
Critical: Controller-1 failed to unlock

Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller
7. Power on and PXE boot controller-1. Ceph OSDs on controller-1 will remain intact. Unlock controller-1

Expected Behavior
------------------
The controller-1 should become online and ready to get unlocked

Actual Behavior
----------------
Standby controller became online but failed to unlock

Reproducibility
---------------

System Configuration
--------------------
Regular system with storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-18_20-00-00"

Timestamp/Logs
--------------
2019-09-20 19:14:02.979
Since active controller not able to ssh to other nodes, the logs might not contain data from other nodes
collecting data from 8 host(s): controller-0 compute-1 compute-2 compute-3 controller-1 storage-0 storage-1 compute-0
collecting controller-0_20190920.191322 ... done (00:01:23 19M)
collecting compute-1_20190920.191322 ... failed to collect from compute-1 (reason:36:permission error)
collecting compute-2_20190920.191322 ... failed to collect from compute-2 (reason:36:permission error)
collecting compute-3_20190920.191322 ... failed to collect from compute-3 (reason:33:host unreachable)
collecting controller-1_20190920.191322 ... failed to collect from controller-1 (reason:36:permission error)
collecting storage-0_20190920.191322 ... failed to collect from storage-0 (reason:36:permission error)
collecting storage-1_20190920.191322 ... failed to collect from storage-1 (reason:36:permission error)
collecting compute-0_20190920.191322 ... failed to collect from compute-0 (reason:36:permission error)
creating all-nodes tarball /scratch/ALL_NODES_20190920.191322.tar ... done (00:01:34 19M)

Test Activity
-------------
Feature Testing

Senthil Mukundakumar (smukunda) wrote :
  • LOGS Edit (18.9 MiB, application/x-tar)
Ghada Khalil (gkhalil) on 2019-09-23
tags: added: stx.update
Frank Miller (sensfan22) wrote :
Download full text (5.4 KiB)

The first unlock of controller-1 is done earlier than the timestamp listed by Senthil. From the nfv-vim.log:

2019-09-20T18:19:46.407 controller-0 VIM_Thread[101227] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=fe491d66-c3e7-47e3-bf4c-d081ea0c9545, host_name=controller-1, do_action=unlock.

This seems to fail. Is it related to using https?

2019-09-20T18:20:45.612 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)).
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 950, in disable_host_services
    future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',))
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _instance_director.py.1332 Canceling host operation host-disable for host controller-1.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_director.py.421 Notify other directors that the host controller-1 is disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _instance_director.py.1427 Host controller-1 disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] DEBUG _host_tasks.py.276 Task (disable-host_controller-1) complete.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_state_disabling.py.81 Disable failed for controller-1.

Also before the unlock attempt of controller-1 seeing other similar errors in nfv-vim.log:

2019-09-20T15:47:06.114 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to query the state of the system, error=[OpenStack Exception: method=GET, url=http://localhost:2112/v1/systems, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=None, reason=<urlopen error [Errno 111] Connection refused>]...

Read more...

Frank Miller (sensfan22) wrote :

After reviewing logs with Bart, he pointed out that the VIM cannot communicate to kubernetes-api. From deamon.log he sees this error that should be investigated:

2019-09-20T15:22:54.923 localhost kubelet[128297]: info E0920 15:22:54.923002 128297 kubelet.go:2248] node "controller-0" not found

Yang Liu (yliu12) on 2019-09-25
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 as B&R is a feature deliverable for the release. This needs further investigation.

summary: - Backup & Restore: Nodes are failed to unlock after restore
+ Backup & Restore: Nodes fail to unlock after restore
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
tags: added: stx.3.0
Changed in starlingx:
status: Triaged → In Progress
Ovidiu Poncea (ovidiu.poncea) wrote :

After looking over the logs & conitinuing Bart's analysis it seems like the IP kubernetes is trying to connect to is not present on any interface:

unable to register node "controller-0" with API server: Post https://192.168.206.2:6443/api/v1/nodes: dial tcp

ovidiu@repos:~/logs/lp1844828/controller-0_20190920.191322/var/extra$ grep "192.168.206.2" networking.info
    inet 192.168.206.103/24 brd 192.168.206.255 scope global ens3f0
192.168.206.2 dev ens3f0 ref 1 used 0/12596/1 probes 5 INCOMPLETE

The ARP INCOMPLETE entry is there

Ovidiu Poncea (ovidiu.poncea) wrote :

Given the fact that there have been changes to the networking interface code I also suggest a retest of this issue.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments