StarlingX

AIO-DX: controller took more than 30 minutes to reboot after all services shut down

Bug #1869195 reported by Bart Wensley on 2020-03-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that after all services on controller-0 were shut down, it took over 30 minutes before it was rebooted.

See bug 1868604 for the full analysis.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

Tags:

Bart Wensley (bartwensley) on 2020-03-26

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Ghada Khalil (gkhalil) on 2020-03-26

tags:

added: stx.metal

Ghada Khalil (gkhalil) on 2020-03-27

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.4.0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-03-30:

SM eventually rebooted controller-0 after it experienced a fatal service failure.

SM started enabling services on controller-0 at 09:20:11.
at 09:25:37 platform-export-fs reported fatal failure SO sm disabled all service and initiated reboot.

2020-03-21T09:25:37.000 controller-0 sm: debug time[934.142] log<2669> INFO: sm[18147]: sm_service_fsm.c(1032): Service (platform-export-fs) received event (disable-failed) was in the disabling-failed state and is now in the disabling-failed state, condition=fatal-failure.

2020-03-21T09:26:06.000 controller-0 sm: debug time[963.872] log<2715> INFO: sm[18147]: sm_service_api.c(300): Recovery of service (platform-export-fs) requested.
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2716> INFO: sm[18147]: sm_node_api.cpp(888): ***********************************************
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2717> INFO: sm[18147]: sm_node_api.cpp(889): ** Issuing a controlled reboot of the system **
2020-03-21T09:26:09.000 controller-0 sm: debug time[966.767] log<2718> INFO: sm[18147]: sm_node_api.cpp(890): ***********************************************

SM did not start again on C0 till 34 minutes later. The following are 4 back to back sm logs.

2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutting down.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutdown complete.
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Starting
2020-03-21T10:00:22.000 controller-0 sm-watchdog: debug Started.

Kernel log shows that the server seemed to get hung in BIOS for 30 minutes. Again back to back logs.

2020-03-21T09:26:13.347 controller-0 kernel: info [ 969.976949] c6xx 0000:da:00.0: Function level reset
2020-03-21T09:58:03.000 controller-0 kernel: info [ 0.000000] Initializing cgroup subsys cpuset

SM eventually rebooted controller-0 after it experienced a fatal service failure.

SM started enabling services on controller-0 at 09:20:11.
at 09:25:37 platform-export-fs reported fatal failure SO sm disabled all service and initiated reboot.

SM did not start again on C0 till 34 minutes later. The following are 4 back to back sm logs.

Kernel log shows that the server seemed to get hung in BIOS for 30 minutes. Again back to back logs.

2020-03-21T09:26:13.347 controller-0 kernel: info [  969.976949] c6xx 0000:da:00.0: Function level reset
2020-03-21T09:58:03.000 controller-0 kernel: info [    0.000000] Initializing cgroup subsys cpuset

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-03-30:

No relevant Controller-0 BMC SEL logs.

Server was AWOL for 30 minutes over the reset.
- stuck shutdown ?
- stuck in BIOS ?

There are zero logs for the missing 30 minutes.

Request issue regate and re-assignment or close.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-03-30:

Looks like a HW issue.
Monitor and re-open if it occurs again.

Changed in starlingx:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.