AIO-DX: controller took more than 30 minutes to reboot after all services shut down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.
Investigation of the above issue (bug 1868604) revealed that one of the failures was that after all services on controller-0 were shut down, it took over 30 minutes before it was rebooted.
See bug 1868604 for the full analysis.
Severity
--------
Major
Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.
Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins
Actual Behavior
----------------
ssh re-connected in 50 mins
Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor
System Configuration
-------
DC system (AIO-DX system controller)
Lab-name: DC-3
Branch/Pull Time/Commit
-------
2020-03-20_00-10-00
Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00
Timestamp/Logs
--------------
See bug 1868604
Test Activity
-------------
Sanity
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.metal |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
tags: | added: stx.4.0 |
SM eventually rebooted controller-0 after it experienced a fatal service failure.
SM started enabling services on controller-0 at 09:20:11.
at 09:25:37 platform-export-fs reported fatal failure SO sm disabled all service and initiated reboot.
2020-03- 21T09:25: 37.000 controller-0 sm: debug time[934.142] log<2669> INFO: sm[18147]: sm_service_ fsm.c(1032) : Service (platform- export- fs) received event (disable-failed) was in the disabling-failed state and is now in the disabling-failed state, condition= fatal-failure.
2020-03- 21T09:22: 37.394 | 510 | service-group-scn | controller-services | go-active | go-active-failed | platform- export- fs(disabling, failed)
2020-03- 21T09:25: 39.824 | 639 | node-reboot | controller-0 | | | service group (controller- services) recovery from fatal condition escalated to a reboot.
2020-03- 21T09:26: 06.000 controller-0 sm: debug time[963.872] log<2715> INFO: sm[18147]: sm_service_ api.c(300) : Recovery of service (platform- export- fs) requested. 21T09:26: 09.000 controller-0 sm: debug time[966.767] log<2716> INFO: sm[18147]: sm_node_ api.cpp( 888): ******* ******* ******* ******* ******* ******* ***** 21T09:26: 09.000 controller-0 sm: debug time[966.767] log<2717> INFO: sm[18147]: sm_node_ api.cpp( 889): ** Issuing a controlled reboot of the system ** 21T09:26: 09.000 controller-0 sm: debug time[966.767] log<2718> INFO: sm[18147]: sm_node_ api.cpp( 890): ******* ******* ******* ******* ******* ******* *****
2020-03-
2020-03-
2020-03-
SM did not start again on C0 till 34 minutes later. The following are 4 back to back sm logs.
2020-03- 21T09:26: 19.000 controller-0 sm-watchdog: debug Shutting down. 21T09:26: 19.000 controller-0 sm-watchdog: debug Shutdown complete. 21T10:00: 22.000 controller-0 sm-watchdog: debug Starting 21T10:00: 22.000 controller-0 sm-watchdog: debug Started.
2020-03-
2020-03-
2020-03-
Kernel log shows that the server seemed to get hung in BIOS for 30 minutes. Again back to back logs.
2020-03- 21T09:26: 13.347 controller-0 kernel: info [ 969.976949] c6xx 0000:da:00.0: Function level reset 21T09:58: 03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuset
2020-03-