I have taken a look at the logs - here is the executive summary: - Issue 1: Controller-0 fails to go active because the dcorch-patch-api-proxy service attempted to go active before the management-ip service. The dcorch-patch-api-proxy repeatedly fails to bind to the management IP (which doesn't exist yet), which eventually causes SM to shut down and restart all services. - Issue 2: Controller-1 then fails to go active because the mtcAgent can't get the cluster IP. - Issue 3: The application of the worker manifest on controller-0 causes containerd to be restarted, which also restarts dockerd and kubelet. Dockerd never comes back. I believe the restart of containerd was introduced with the kata feature (https://review.opendev.org/#/c/703266/). - Issue 4: Controller-0 then fails to go active because the dc-iso-fs service does not come up. Does not appear that OCF script was invoked. - Issue 5: All services on controller-0 are shut down at this point, but it takes over 30 minutes before it is actually rebooted. - Issue 6: Controller-1 then fails to go active because the management-ip service fails to go active. It cannot assign the floating management IP due to an IPv6 address collision (controller-0 still has it assigned and hasn't rebooted). Once controller-0 reboots, the floating management IP is assigned and we recover. Recommendations: - Issue 1: Should fix ASAP by adding proper service dependencies. There are probably other missing SM dependencies for the dcorch proxies. - Issue 2: Requires investigation from maintenance person. - Issue 3: Multiple problems here - we don't want the worker manifest to be causing containerd/dockerd/kubelet restarts (if possible) and we need to understand why dockerd never recovers. - Issue 4: Need an SM person to investigate why dc-iso-fs did not come up. - Issue 5: Need a maintenance person to investigate why it took so long for controller-0 to actually reboot after all the services were shut down. - Issue 6: Will be resolved by a fix for issue 5. Detailed investigation notes... # Controllers powered on here [2020-03-21 09:07:24,367] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 16257' [2020-03-21 09:07:27,985] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 95973' # SM starts here | 2020-03-21T09:13:56.110 | 1 | node-scn | controller-0 | unknown-unknown | unlocked-unknown | customer action # Docker is started by systemd 2020-03-21T09:14:26.974 [18225.00077] controller-0 pmond mon pmonHdlr.cpp (1142) register_process : Info : dockerd Registered (3406) # dcorch-patch-api-proxy does not go enabled | 2020-03-21T09:15:07.103 | 261 | service-scn | dcorch-patch-api-proxy | enabled-active | disabled | process (pid=22794) failed # because it can't bind the address 2020-03-21 09:15:07.043 22794 INFO dcorch.api.proxy [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Server on http://fd01:11::2:25491 with 2 2020-03-21 09:15:07.044 22794 ERROR oslo.service.wsgi [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Could not bind to fd01:11::2:25491: error: [Errno 99] Cannot assign requested address 2020-03-21 09:15:07.044 22794 CRITICAL dcorch [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Unhandled error: error: [Errno 99] Cannot assign requested address # SM disables all services | 2020-03-21T09:15:10.641 | 267 | service-group-scn | distributed-cloud-services | go-active-failed | disabling-failed | dcorch-patch-api-proxy(disabled, failed) | 2020-03-21T09:15:10.642 | 268 | service-group-scn | vim-services | go-active | disabling | | 2020-03-21T09:15:10.643 | 269 | service-group-scn | cloud-services | go-active | disabling | | 2020-03-21T09:15:10.644 | 270 | service-group-scn | controller-services | go-active | disabling | | 2020-03-21T09:15:10.644 | 271 | service-group-scn | oam-services | go-active | disabling | # So controller-1 attempts to go active | 2020-03-21T09:15:17.388 | 218 | service-group-scn | distributed-cloud-services | standby | go-active | | 2020-03-21T09:15:17.389 | 219 | service-group-scn | vim-services | standby | go-active | | 2020-03-21T09:15:17.389 | 220 | service-group-scn | cloud-services | standby | go-active | | 2020-03-21T09:15:17.390 | 221 | service-group-scn | controller-services | standby | go-active | | 2020-03-21T09:15:17.391 | 222 | service-group-scn | oam-services | standby | go-active | # It fails because of the mtc-agent | 2020-03-21T09:15:33.476 | 381 | service-scn | mtc-agent | enabled-active | disabled | process (pid=34192) failed | 2020-03-21T09:15:35.849 | 386 | service-group-scn | controller-services | active | active-failed | mtc-agent(disabled, failed) # The mtcAgent seems to be unable to get the cluster IP 2020-03-21T09:15:33.465 [34192.00081] controller-1 mtcAgent --- msgClass.cpp (1052) msgClassTx : Info : Creating vlan172 socket on port 2118 with address: fd01:11::4 2020-03-21T09:15:33.465 [34192.00082] controller-1 mtcAgent --- msgClass.cpp ( 342) getAddressFromInterface :Error : controller-1-cluster-host ip address resolution failed (err: Name or service not known) 2020-03-21T09:15:33.465 fmAPI.cpp(137): Connected to FM Manager. 2020-03-21T09:15:33.465 [34192.00083] controller-1 mtcAgent --- msgClass.cpp (1052) msgClassTx : Info : Creating vlan173 socket on port 2115 with address: fd01:11::4 2020-03-21T09:15:33.465 fmAlarmUtils.cpp(630): Sending FM clear alarm request: alarm_id (200.001), entity_id (host=controller-1) 2020-03-21T09:15:33.465 [34192.00084] controller-1 mtcAgent --- msgClass.cpp ( 342) getAddressFromInterface :Error : controller-1-cluster-host ip address resolution failed (err: Name or service not known) 2020-03-21T09:15:33.465 fmAlarmUtils.cpp(669): FM Response for clear alarm: (10), alarm_id (200.001), entity_id (host=controller-1) 2020-03-21T09:15:33.465 [34192.00085] controller-1 mtcAgent --- msgClass.cpp (1121) initSocket :Error : Failed to bind socket to address (22:Invalid argument) 2020-03-21T09:15:33.465 [34192.00086] controller-1 mtcAgent mtc mtcNodeCtrl.cpp (1154) _self_provision :Error : Socket initialization failed (rc:34) 2020-03-21T09:15:33.465 [34192.00087] controller-1 mtcAgent mtc mtcNodeCtrl.cpp (1247) daemon_service_run :Error : Failed to self provision active controller # This causes SM to disable all services on controller-1 | 2020-03-21T09:15:37.498 | 387 | service-group-scn | distributed-cloud-services | active | disabling | | 2020-03-21T09:15:37.499 | 388 | service-group-scn | vim-services | active | disabling | | 2020-03-21T09:15:37.500 | 389 | service-group-scn | cloud-services | active | disabling | | 2020-03-21T09:15:37.500 | 390 | service-group-scn | controller-services | active-failed | disabling-failed | mtc-agent(disabled, failed) | 2020-03-21T09:15:37.501 | 391 | service-group-scn | oam-services | active | disabling | # Back on controller-0 - application of worker manifest restarts containerd 2020-03-21T09:15:44.062 [0;36mDebug: 2020-03-21 09:15:44 +0000 Exec[enable-containerd](provider=posix): Executing '/usr/bin/systemctl enable containerd.service'[0m 2020-03-21T09:15:44.064 [0;36mDebug: 2020-03-21 09:15:44 +0000 Executing: '/usr/bin/systemctl enable containerd.service'[0m 2020-03-21T09:15:44.114 [mNotice: 2020-03-21 09:15:44 +0000 /Stage[main]/Platform::Containerd::Config/Exec[enable-containerd]/returns: executed successfully[0m 2020-03-21T09:15:44.115 [0;36mDebug: 2020-03-21 09:15:44 +0000 /Stage[main]/Platform::Containerd::Config/Exec[enable-containerd]: The container Class[Platform::Containerd::Config] will propagate my refresh event[0m 2020-03-21T09:15:44.118 [0;36mDebug: 2020-03-21 09:15:44 +0000 Exec[restart-containerd](provider=posix): Executing '/usr/bin/systemctl restart containerd.service'[0m 2020-03-21T09:15:44.120 [0;36mDebug: 2020-03-21 09:15:44 +0000 Executing: '/usr/bin/systemctl restart containerd.service'[0m # This causes both docker and kubelet to be restarted 2020-03-21T09:15:44.219 controller-0 systemd[1]: info Stopped Kubernetes Kubelet Server. 2020-03-21T09:15:44.226 controller-0 systemd[1]: info Stopping Docker Application Container Engine... 2020-03-21T09:15:44.329 controller-0 systemd[1]: info Starting Docker Application Container Engine... # This causes pmon to detect that dockerd has failed and pmon attempts to restart it 2020-03-21T09:15:44.233 [18225.00103] controller-0 pmond mon pmonHdlr.cpp ( 303) manage_process_failure :Error : dockerd failed (3406) (p:1 a:0) 2020-03-21T09:15:44.237 [18225.00104] controller-0 pmond com nodeUtil.cpp (1844) get_system_state : Info : systemctl reports host as 'starting' 2020-03-21T09:15:44.237 [18225.00105] controller-0 pmond mon pmonHdlr.cpp (1547) manage_alarm : Info : dockerd process has failed ; Auto recovery in progress. 2020-03-21T09:15:44.237 [18225.00106] controller-0 pmond mon pmonMsg.cpp ( 328) pmon_send_event : Info : controller-0 pmon log sent 2020-03-21T09:15:44.474 [18225.00107] controller-0 pmond mon pmonHdlr.cpp (1007) process_running : Info : dockerd process not running 2020-03-21T09:15:44.474 [18225.00108] controller-0 pmond mon pmonHdlr.cpp (1311) respawn_process : Info : dockerd Spawn (75467) 2020-03-21T09:15:49.974 [18225.00109] controller-0 pmond mon pmonFsm.cpp ( 594) pmon_passive_handler :Error : dockerd spawn timeout (75467) 2020-03-21T09:15:49.974 [18225.00110] controller-0 pmond mon pmonHdlr.cpp ( 946) kill_running_process : Warn : dockerd Killed (75467) 2020-03-21T09:15:49.974 [18225.00111] controller-0 pmond mon pmonHdlr.cpp ( 965) kill_running_child : Warn : dockerd start script still running (75467) ; killed 2020-03-21T09:15:56.474 [18225.00112] controller-0 pmond mon pmonHdlr.cpp (1007) process_running : Info : dockerd process not running 2020-03-21T09:15:56.475 [18225.00113] controller-0 pmond mon pmonHdlr.cpp (1311) respawn_process : Info : dockerd Spawn (79441) 2020-03-21T09:16:01.974 [18225.00114] controller-0 pmond mon pmonFsm.cpp ( 594) pmon_passive_handler :Error : dockerd spawn timeout (79441) # Pmon continues to attempt to restart docker until controller-0 is rebooted 2020-03-21T09:26:07.474 [18225.00505] controller-0 pmond mon pmonHdlr.cpp (1007) process_running : Info : dockerd process not running 2020-03-21T09:26:07.475 [18225.00506] controller-0 pmond mon pmonHdlr.cpp (1311) respawn_process : Info : dockerd Spawn (422494) 2020-03-21T09:26:09.848 [18225.00507] controller-0 pmond mon pmonHdlr.cpp (1468) daemon_sigchld_hdlr : Info : dockerd spawn failed (rc:1:1) (2.373 secs) # The systemd logs record docker being terminated and restarted (by pmon I assume) 2020-03-21T09:15:44.501 controller-0 dockerd[75127]: info time="2020-03-21T09:15:44.501832104Z" level=info msg="Processing signal 'terminated'" 2020-03-21T09:15:55.035 controller-0 systemd[1]: info Stopped Docker Application Container Engine. 2020-03-21T09:15:55.073 controller-0 systemd[1]: info Starting Docker Application Container Engine... 2020-03-21T09:15:56.480 controller-0 dockerd[78772]: info time="2020-03-21T09:15:56.480227424Z" level=info msg="Processing signal 'terminated'" # At this point SM has disabled the services on controller-1, so SM on controller-0 attempts to re-enable the services | 2020-03-21T09:20:11.565 | 349 | service-group-scn | distributed-cloud-services | disabled | go-active | | 2020-03-21T09:20:11.565 | 350 | service-group-scn | vim-services | disabled | go-active | | 2020-03-21T09:20:11.566 | 351 | service-group-scn | cloud-services | disabled | go-active | | 2020-03-21T09:20:11.567 | 352 | service-group-scn | controller-services | disabled | go-active | | 2020-03-21T09:20:11.567 | 353 | service-group-scn | oam-services | disabled | go-active | # This time the controller-0 management-ip service assigns floating IP 2020-03-21T09:20:12.025 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever) 2020-03-21T09:20:12.032 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: Bringing device vlan172 up 2020-03-21T09:20:14.050 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: /usr/libexec/heartbeat/send_ua -i 200 -c 7 fd01:11::2 64 vlan172 # The dcorch-patch-api-proxy fails to enable once, but then succeeds (after the floating IP was assigned) | 2020-03-21T09:20:13.046 | 435 | service-scn | dcorch-patch-api-proxy | enabled-active | disabled | process (pid=358067) failed | 2020-03-21T09:20:13.484 | 436 | service-scn | dcorch-patch-api-proxy | disabled | enabling | enabled-active state requested | 2020-03-21T09:20:13.542 | 437 | service-scn | dcorch-patch-api-proxy | enabling | enabled-active | enable success # However, dc-iso-fs and platform-export-fs fail to enable | 2020-03-21T09:21:16.529 | 490 | service-scn | dc-iso-fs | enabling | disabling | enable failed | 2020-03-21T09:21:36.518 | 491 | service-scn | platform-export-fs | enabling | disabling | enable failed # I do not see any logs from the dc-iso-fs OCF script, so I don't know why it didn't go enabled (it appears SM did not call the start) # SM on controller-0 does attempt to disable the management-ip service, but it doesn't look like it completes | 2020-03-21T09:22:43.195 | 570 | service-scn | management-ip | enabled-active | disabling | disable state requested # Puppet eventually fails the manifest apply because docker is not running 2020-03-21T09:26:09.860 [1;31mError: 2020-03-21 09:26:09 +0000 Systemd start for docker failed! # I assume that is what causes controller-0 to reboot? I didn't find the linkage. # I see the controller-0 services shut down here - for example here is SM 2020-03-21T09:26:15.000 controller-0 sm: debug time[972.073] log<2800> INFO: sm[18147]: sm_process.c(857): Shutting down. 2020-03-21T09:26:15.000 controller-0 sm: debug time[972.403] log<211> INFO: sm_ta[18177]: sm_task_affining_thread.c(145): Shutting down. 2020-03-21T09:26:15.000 controller-0 sm: debug time[972.403] log<212> INFO: sm_ta[18177]: sm_task_affining_thread.c(154): Shutdown complete. 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<7> INFO: sm-watchdog[17760]: sm_watchdog_process.c(221): Shutting down. 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<8> INFO: sm-watchdog[17760]: sm_watchdog_module.c(231): Finalizing module (/var/lib/sm/watchdog/modules/libsm_watchdog_nfs.so.1.0.0). 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<9> INFO: sm-watchdog[17760]: sm_watchdog_module.c(231): Finalizing module (/var/lib/sm/watchdog/modules/libsm_watchdog_nfs.so.1.0.0). 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<10> INFO: sm-watchdog[17760]: sm_watchdog_process.c(237): Shutdown complete. 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutting down. 2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutdown complete. # Here are the last logs in kern.log before the reboot 2020-03-21T09:26:12.343 controller-0 kernel: info [ 968.975201] c6xx 0000:da:00.0: qat_dev2 stopped 10 acceleration engines 2020-03-21T09:26:13.347 controller-0 kernel: info [ 969.976946] c6xx 0000:da:00.0: Resetting device qat_dev2 2020-03-21T09:26:13.347 controller-0 kernel: info [ 969.976949] c6xx 0000:da:00.0: Function level reset # But I don't see controller-0 come up until here 2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuset 2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpu 2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuacct # Meanwhile, after SM is shutdown on controller-0, controller-1 takes activity again | 2020-03-21T09:26:14.337 | 594 | neighbor-scn | controller-0 | full | down | Neighbor dead | 2020-03-21T09:26:14.337 | 595 | service-domain-scn | controller | backup | waiting | neighbor controller-0 ageout of leader | 2020-03-21T09:26:14.636 | 596 | service-group-scn | distributed-cloud-services | standby | go-active | | 2020-03-21T09:26:14.637 | 597 | service-group-scn | vim-services | standby | go-active | | 2020-03-21T09:26:14.638 | 598 | service-group-scn | storage-monitoring-services | standby | go-active | | 2020-03-21T09:26:14.639 | 599 | service-group-scn | patching-services | standby | go-active | | 2020-03-21T09:26:14.639 | 600 | service-group-scn | cloud-services | standby | go-active | | 2020-03-21T09:26:14.640 | 601 | service-group-scn | controller-services | standby | go-active | | 2020-03-21T09:26:14.641 | 602 | service-group-scn | oam-services | standby | go-active | # However, the management-ip can't be assigned | 2020-03-21T09:26:16.968 | 675 | service-scn | management-ip | enabling | disabling | enable failed # This is due to an address collision for the floating IP 2020-03-21T09:26:14.851 controller-1 OCF_IPaddr2(management-ip)[160972]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever) 2020-03-21T09:26:14.855 controller-1 OCF_IPaddr2(management-ip)[160972]: info INFO: Bringing device vlan172 up 2020-03-21T09:26:15.865 controller-1 OCF_IPaddr2(management-ip)[160972]: err ERROR: IPv6 address collision fd01:11::2 [DAD] 2020-03-21T09:26:15.889 controller-1 OCF_IPaddr2(management-ip)[160972]: err ERROR: run_send_ua failed. # This continues until controller-0 finally reboots, at which point controller-1 can finally use the floating IP 2020-03-21T09:56:21.673 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever) 2020-03-21T09:56:21.677 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: Bringing device vlan172 up 2020-03-21T09:56:23.689 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: /usr/libexec/heartbeat/send_ua -i 200 -c 7 fd01:11::2 64 vlan172