Backup & Restore: Nodes fail to unlock after restore

Bug #1844828 reported by Senthil Mukundakumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Ovidiu Poncea

Bug Description

Brief Description
-----------------
In a regular system with storage, after completing the restore action on the active controller and remaining nodes, all the nodes except active controller failed to unlock.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-1 | worker | locked | disabled | online |
| 3 | compute-2 | worker | locked | disabled | online |
| 4 | compute-3 | worker | locked | disabled | offline |
| 5 | controller-1 | controller | locked | disabled | online |
| 6 | storage-0 | storage | locked | disabled | online |
| 7 | storage-1 | storage | locked | disabled | online |
| 8 | compute-0 | worker | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system host-unlock controller-1
^[[1;2BTimeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "configure_ihost" info: "<unknown>"

The standby controller and other nodes do have connectivity but unable to ssh from active controller
sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
PING controller-1 (192.168.204.4) 56(84) bytes of data.
64 bytes from controller-1 (192.168.204.4): icmp_seq=1 ttl=64 time=0.188 ms

Severity
--------
Provide the severity of the defect.
Critical: Controller-1 failed to unlock

Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller
7. Power on and PXE boot controller-1. Ceph OSDs on controller-1 will remain intact. Unlock controller-1

Expected Behavior
------------------
The controller-1 should become online and ready to get unlocked

Actual Behavior
----------------
Standby controller became online but failed to unlock

Reproducibility
---------------

System Configuration
--------------------
Regular system with storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-18_20-00-00"

Timestamp/Logs
--------------
2019-09-20 19:14:02.979
Since active controller not able to ssh to other nodes, the logs might not contain data from other nodes
collecting data from 8 host(s): controller-0 compute-1 compute-2 compute-3 controller-1 storage-0 storage-1 compute-0
collecting controller-0_20190920.191322 ... done (00:01:23 19M)
collecting compute-1_20190920.191322 ... failed to collect from compute-1 (reason:36:permission error)
collecting compute-2_20190920.191322 ... failed to collect from compute-2 (reason:36:permission error)
collecting compute-3_20190920.191322 ... failed to collect from compute-3 (reason:33:host unreachable)
collecting controller-1_20190920.191322 ... failed to collect from controller-1 (reason:36:permission error)
collecting storage-0_20190920.191322 ... failed to collect from storage-0 (reason:36:permission error)
collecting storage-1_20190920.191322 ... failed to collect from storage-1 (reason:36:permission error)
collecting compute-0_20190920.191322 ... failed to collect from compute-0 (reason:36:permission error)
creating all-nodes tarball /scratch/ALL_NODES_20190920.191322.tar ... done (00:01:34 19M)

Test Activity
-------------
Feature Testing

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :
  • LOGS Edit (18.9 MiB, application/x-tar)
Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Frank Miller (sensfan22) wrote :
Download full text (5.4 KiB)

The first unlock of controller-1 is done earlier than the timestamp listed by Senthil. From the nfv-vim.log:

2019-09-20T18:19:46.407 controller-0 VIM_Thread[101227] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=fe491d66-c3e7-47e3-bf4c-d081ea0c9545, host_name=controller-1, do_action=unlock.

This seems to fail. Is it related to using https?

2019-09-20T18:20:45.612 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)).
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 950, in disable_host_services
    future.result = (yield)
Exception: MaxRetryError: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /api/v1/nodes/controller-1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',))
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _instance_director.py.1332 Canceling host operation host-disable for host controller-1.
2019-09-20T18:20:45.613 controller-0 VIM_Thread[101227] INFO _host_director.py.464 Notify other directors that a host controller-1 abort is inprogress.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_director.py.421 Notify other directors that the host controller-1 is disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _instance_director.py.1427 Host controller-1 disabled.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] DEBUG _host_tasks.py.276 Task (disable-host_controller-1) complete.
2019-09-20T18:20:45.614 controller-0 VIM_Thread[101227] INFO _host_state_disabling.py.81 Disable failed for controller-1.

Also before the unlock attempt of controller-1 seeing other similar errors in nfv-vim.log:

2019-09-20T15:47:06.114 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to query the state of the system, error=[OpenStack Exception: method=GET, url=http://localhost:2112/v1/systems, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=None, reason=<urlopen error [Errno 111] Connection refused>]...

Read more...

Revision history for this message
Frank Miller (sensfan22) wrote :

After reviewing logs with Bart, he pointed out that the VIM cannot communicate to kubernetes-api. From deamon.log he sees this error that should be investigated:

2019-09-20T15:22:54.923 localhost kubelet[128297]: info E0920 15:22:54.923002 128297 kubelet.go:2248] node "controller-0" not found

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 as B&R is a feature deliverable for the release. This needs further investigation.

summary: - Backup & Restore: Nodes are failed to unlock after restore
+ Backup & Restore: Nodes fail to unlock after restore
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
tags: added: stx.3.0
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

After looking over the logs & conitinuing Bart's analysis it seems like the IP kubernetes is trying to connect to is not present on any interface:

unable to register node "controller-0" with API server: Post https://192.168.206.2:6443/api/v1/nodes: dial tcp

ovidiu@repos:~/logs/lp1844828/controller-0_20190920.191322/var/extra$ grep "192.168.206.2" networking.info
    inet 192.168.206.103/24 brd 192.168.206.255 scope global ens3f0
192.168.206.2 dev ens3f0 ref 1 used 0/12596/1 probes 5 INCOMPLETE

The ARP INCOMPLETE entry is there

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Given the fact that there have been changes to the networking interface code I also suggest a retest of this issue.

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Senthil retested it and it did not reproduce, marking it as invalid.

Changed in starlingx:
status: In Progress → Invalid
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.