Comment 1 for bug 1869195

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

SM eventually rebooted controller-0 after it experienced a fatal service failure.

SM started enabling services on controller-0 at 09:20:11.
at 09:25:37 platform-export-fs reported fatal failure SO sm disabled all service and initiated reboot.

2020-03-21T09:25:37.000 controller-0 sm: debug time[934.142] log<2669> INFO: sm[18147]: sm_service_fsm.c(1032): Service (platform-export-fs) received event (disable-failed) was in the disabling-failed state and is now in the disabling-failed state, condition=fatal-failure.

2020-03-21T09:22:37.394 | 510 | service-group-scn | controller-services | go-active | go-active-failed | platform-export-fs(disabling, failed)

2020-03-21T09:25:39.824 | 639 | node-reboot | controller-0 | | | service group (controller-services) recovery from fatal condition escalated to a reboot.

2020-03-21T09:26:06.000 controller-0 sm: debug time[963.872] log<2715> INFO: sm[18147]: sm_service_api.c(300): Recovery of service (platform-export-fs) requested.
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2716> INFO: sm[18147]: sm_node_api.cpp(888): ***********************************************
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2717> INFO: sm[18147]: sm_node_api.cpp(889): ** Issuing a controlled reboot of the system **
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2718> INFO: sm[18147]: sm_node_api.cpp(890): ***********************************************

SM did not start again on C0 till 34 minutes later. The following are 4 back to back sm logs.

2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutting down.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutdown complete.
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Starting
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Started.

Kernel log shows that the server seemed to get hung in BIOS for 30 minutes. Again back to back logs.

2020-03-21T09:26:13.347 controller-0 kernel: info [ 969.976949] c6xx 0000:da:00.0: Function level reset
2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuset