AIO-DX: controller took more than 30 minutes to reboot after all services shut down

Bug #1869195 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that after all services on controller-0 were shut down, it took over 30 minutes before it was rebooted.

See bug 1868604 for the full analysis.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
tags: added: stx.metal
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

SM eventually rebooted controller-0 after it experienced a fatal service failure.

SM started enabling services on controller-0 at 09:20:11.
at 09:25:37 platform-export-fs reported fatal failure SO sm disabled all service and initiated reboot.

2020-03-21T09:25:37.000 controller-0 sm: debug time[934.142] log<2669> INFO: sm[18147]: sm_service_fsm.c(1032): Service (platform-export-fs) received event (disable-failed) was in the disabling-failed state and is now in the disabling-failed state, condition=fatal-failure.

2020-03-21T09:22:37.394 | 510 | service-group-scn | controller-services | go-active | go-active-failed | platform-export-fs(disabling, failed)

2020-03-21T09:25:39.824 | 639 | node-reboot | controller-0 | | | service group (controller-services) recovery from fatal condition escalated to a reboot.

2020-03-21T09:26:06.000 controller-0 sm: debug time[963.872] log<2715> INFO: sm[18147]: sm_service_api.c(300): Recovery of service (platform-export-fs) requested.
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2716> INFO: sm[18147]: sm_node_api.cpp(888): ***********************************************
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2717> INFO: sm[18147]: sm_node_api.cpp(889): ** Issuing a controlled reboot of the system **
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2718> INFO: sm[18147]: sm_node_api.cpp(890): ***********************************************

SM did not start again on C0 till 34 minutes later. The following are 4 back to back sm logs.

2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutting down.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutdown complete.
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Starting
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Started.

Kernel log shows that the server seemed to get hung in BIOS for 30 minutes. Again back to back logs.

2020-03-21T09:26:13.347 controller-0 kernel: info [ 969.976949] c6xx 0000:da:00.0: Function level reset
2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuset

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

No relevant Controller-0 BMC SEL logs.

Server was AWOL for 30 minutes over the reset.
 - stuck shutdown ?
 - stuck in BIOS ?

There are zero logs for the missing 30 minutes.

Request issue regate and re-assignment or close.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Looks like a HW issue.
Monitor and re-open if it occurs again.

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.