Backup & Restore: Nodes fail to unlock after restore
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
High
|
Ovidiu Poncea |
Bug Description
Brief Description
-----------------
In a regular system with storage, after completing the restore action on the active controller and remaining nodes, all the nodes except active controller failed to unlock.
[sysadmin@
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-1 | worker | locked | disabled | online |
| 3 | compute-2 | worker | locked | disabled | online |
| 4 | compute-3 | worker | locked | disabled | offline |
| 5 | controller-1 | controller | locked | disabled | online |
| 6 | storage-0 | storage | locked | disabled | online |
| 7 | storage-1 | storage | locked | disabled | online |
| 8 | compute-0 | worker | locked | disabled | online |
+----+-
[sysadmin@
^[[1;2BTimeout while waiting on RPC response - topic: "sysinv.
The standby controller and other nodes do have connectivity but unable to ssh from active controller
sysadmin@
PING controller-1 (192.168.204.4) 56(84) bytes of data.
64 bytes from controller-1 (192.168.204.4): icmp_seq=1 ttl=64 time=0.188 ms
Severity
--------
Provide the severity of the defect.
Critical: Controller-1 failed to unlock
Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller
7. Power on and PXE boot controller-1. Ceph OSDs on controller-1 will remain intact. Unlock controller-1
Expected Behavior
------------------
The controller-1 should become online and ready to get unlocked
Actual Behavior
----------------
Standby controller became online but failed to unlock
Reproducibility
---------------
System Configuration
-------
Regular system with storage
Branch/Pull Time/Commit
-------
BUILD_ID=
Timestamp/Logs
--------------
2019-09-20 19:14:02.979
Since active controller not able to ssh to other nodes, the logs might not contain data from other nodes
collecting data from 8 host(s): controller-0 compute-1 compute-2 compute-3 controller-1 storage-0 storage-1 compute-0
collecting controller-
collecting compute-
collecting compute-
collecting compute-
collecting controller-
collecting storage-
collecting storage-
collecting compute-
creating all-nodes tarball /scratch/
Test Activity
-------------
Feature Testing
tags: | added: stx.update |
tags: | added: stx.retestneeded |
Changed in starlingx: | |
status: | Triaged → In Progress |
tags: | removed: stx.retestneeded |
The first unlock of controller-1 is done earlier than the timestamp listed by Senthil. From the nfv-vim.log:
2019-09- 20T18:19: 46.407 controller-0 VIM_Thread[101227] DEBUG _vim_nfvi_ events. py.63 Host action, host_uuid= fe491d66- c3e7-47e3- bf4c-d081ea0c95 45, host_name= controller- 1, do_action=unlock.
This seems to fail. Is it related to using https?
2019-09- 20T18:20: 45.612 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to disable controller-1 kubernetes host services, error=MaxRetryE rror: HTTPSConnection Pool(host= '192.168. 206.2', port=6443): Max retries exceeded with url: /api/v1/ nodes/controlle r-1 (Caused by NewConnectionEr ror('<urllib3. connection. VerifiedHTTPSCo nnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)). python2. 7/site- packages/ nfv_plugins/ nfvi_plugins/ nfvi_infrastruc ture_api. py", line 950, in disable_ host_services Pool(host= '192.168. 206.2', port=6443): Max retries exceeded with url: /api/v1/ nodes/controlle r-1 (Caused by NewConnectionEr ror('<urllib3. connection. VerifiedHTTPSCo nnection object at 0x7f129493d050>: Failed to establish a new connection: [Errno 113] No route to host',)) 20T18:20: 45.613 controller-0 VIM_Thread[101227] INFO _host_director. py.464 Notify other directors that a host controller-1 abort is inprogress. 20T18:20: 45.613 controller-0 VIM_Thread[101227] INFO _instance_ director. py.1332 Canceling host operation host-disable for host controller-1. 20T18:20: 45.613 controller-0 VIM_Thread[101227] INFO _host_director. py.464 Notify other directors that a host controller-1 abort is inprogress. 20T18:20: 45.614 controller-0 VIM_Thread[101227] INFO _host_director. py.421 Notify other directors that the host controller-1 is disabled. 20T18:20: 45.614 controller-0 VIM_Thread[101227] INFO _instance_ director. py.1427 Host controller-1 disabled. 20T18:20: 45.614 controller-0 VIM_Thread[101227] DEBUG _host_tasks.py.276 Task (disable- host_controller -1) complete. 20T18:20: 45.614 controller-0 VIM_Thread[101227] INFO _host_state_ disabling. py.81 Disable failed for controller-1.
Traceback (most recent call last):
File "/usr/lib64/
future.result = (yield)
Exception: MaxRetryError: HTTPSConnection
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-
2019-09-
Also before the unlock attempt of controller-1 seeing other similar errors in nfv-vim.log:
2019-09- 20T15:47: 06.114 controller-0 VIM_Thread[101227] ERROR Caught exception while trying to query the state of the system, error=[OpenStack Exception: method=GET, url=http:// localhost: 2112/v1/ systems, headers= {'Content- Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=None, reason=<urlopen error [Errno 111] Connection refused>]...