Distributed Cloud: dc-iso-fs service did not recover after power cycling both controllers

Bug #1869194 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Bin Qian

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that the dc-iso-fs service did not come up. Does not appear that OCF script was invoked.

See bug 1868604 for the full analysis.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

tags: added: stx.distcloud
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue w/ dead office recovery on DC

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
tags: added: stx.4.0
Revision history for this message
Dariush Eslimi (deslimi) wrote :

assigning to Bin for further investigation.

Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :

The ocf script for dc-iso-fs had been invoked, multiple times. The daemon-ocf.log had been archived (probably we should consider reduce some logging?)

2020-03-21T09:15:22.747 controller-1 OCF_Filesystem(dc-iso-fs)[27126]: info INFO: Running start for /opt/platform/iso on /www/pages/iso
2020-03-21T09:15:37.885 controller-1 OCF_Filesystem(dc-iso-fs)[48597]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
2020-03-21T09:15:37.902 controller-1 OCF_Filesystem(dc-iso-fs)[48597]: info INFO: Trying to unmount /www/pages/iso
2020-03-21T09:15:39.371 controller-1 OCF_Filesystem(dc-iso-fs)[48597]: info INFO: unmounted /www/pages/iso successfully
2020-03-21T09:56:30.200 controller-1 OCF_Filesystem(dc-iso-fs)[2522984]: info INFO: Running start for /opt/platform/iso on /www/pages/iso
2020-03-21T10:00:30.211 controller-1 OCF_Filesystem(dc-iso-fs)[2845934]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
2020-03-21T10:00:30.229 controller-1 OCF_Filesystem(dc-iso-fs)[2845934]: info INFO: Trying to unmount /www/pages/iso
2020-03-21T10:00:31.712 controller-1 OCF_Filesystem(dc-iso-fs)[2845934]: info INFO: unmounted /www/pages/iso successfully
2020-03-21T05:27:52.962 controller-1 OCF_Filesystem(dc-iso-fs)[3164666]: info INFO: Running start for /opt/platform/iso on /www/pages/iso

the archived daemon-ocf.log.6.gz on controller-0 probably messed up, the timestamp is not in the right order:
2020-03-21T06:58:31.291 controller-0 OCF_Filesystem(dc-iso-fs)[116181]: info INFO: Running start for /opt/platform/iso on /www/pages/iso
2020-03-21T09:15:10.774 controller-0 OCF_Filesystem(dc-iso-fs)[23571]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
2020-03-21T10:00:55.577 controller-0 OCF_Filesystem(dc-iso-fs)[123119]: info INFO: Running start for /opt/platform/iso on /www/pages/iso
2020-03-21T05:27:33.111 controller-0 OCF_Filesystem(dc-iso-fs)[472667]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
2020-03-21T05:27:33.128 controller-0 OCF_Filesystem(dc-iso-fs)[472667]: info INFO: Trying to unmount /www/pages/iso
2020-03-21T05:27:34.465 controller-0 OCF_Filesystem(dc-iso-fs)[472667]: info INFO: unmounted /www/pages/iso successfully

Revision history for this message
Bin Qian (bqian20) wrote :
Download full text (3.5 KiB)

The go-enabled timeout (30s) in this case seems insufficient. Although it may be because the mgmt ip related, enabling actions timeout consistently
| 2020-03-21T09:20:16.508 | 463 | service-scn | dc-iso-fs | enabling-throttle | enabling | enabled-active state requested
| 2020-03-21T09:21:16.529 | 490 | service-scn | dc-iso-fs | enabling | disabling | enable failed

2020-03-21T09:20:16.000 controller-0 sm: debug time[613.439] log<1945> INFO: sm[18147]: sm_service_fsm.c(1032): Service (dc-iso-fs) received event (enable-throttle) was in the enabling-throttle state and is now in the enabling state.
2020-03-21T09:20:46.000 controller-0 sm: debug time[643.439] log<2044> INFO: sm[18147]: sm_service_enable.c(245): Action (enable) timeout with result (failed), state (unknown), status (unknown), and condition (unknown) for service (dc-iso-fs), reason_text=, exit_code=-65534.
2020-03-21T09:20:46.000 controller-0 sm: debug time[643.439] log<2045> INFO: sm[18147]: sm_service_action.c(345): Aborting service (dc-iso-fs) with kill signal, pid=362782.
2020-03-21T09:20:46.000 controller-0 sm: debug time[643.439] log<2046> INFO: sm[18147]: sm_service_enable.c(154): Max retires not met for action (enable) of service (dc-iso-fs), attempts=1.
2020-03-21T09:20:46.000 controller-0 sm: debug time[643.440] log<2047> INFO: sm[18147]: sm_service_enable.c(461): Started enable action (370970) for service (dc-iso-fs).
2020-03-21T09:21:16.000 controller-0 sm: debug time[673.457] log<2092> INFO: sm[18147]: sm_service_enable.c(245): Action (enable) timeout with result (failed), state (unknown), status (unknown), and condition (unknown) for service (dc-iso-fs), reason_text=, exit_code=-65534.
2020-03-21T09:21:16.000 controller-0 sm: debug time[673.457] log<2093> INFO: sm[18147]: sm_service_action.c(345): Aborting service (dc-iso-fs) with kill signal, pid=370970.
2020-03-21T09:21:16.000 controller-0 sm: debug time[673.457] log<2094> INFO: sm[18147]: sm_service_enable.c(149): Max retires met for action (enable) of service (dc-iso-fs), attempts=2.
2020-03-21T09:21:16.000 controller-0 sm: debug time[673.458] log<2095> INFO: sm[18147]: sm_service_disable.c(380): Started disable action (373984) for service (dc-iso-fs), flag (0).
2020-03-21T09:21:16.000 controller-0 sm: debug time[673.458] log<2096> INFO: sm[18147]: sm_service_fsm.c(1032): Service (dc-iso-fs) received event (enable-failed) was in the enabling state and is now in the disabling state.
2020-03-21T09:22:16.000 controller-0 sm: debug time[733.463] log<2181> INFO: sm[18147]: sm_service_disable.c(163): Action (disable) timeout with result (success), state (disabled), status (unknown), and condition (unknown) for service (dc-iso-fs), reason_text=, exit_code=-65534.
2020-03-21T09:22:16.000 controller-0 sm: debug time[733.463] log<2182> INFO: sm[18147]: sm_service_action.c(345): Aborting service (dc-iso-fs) with kill signal, pid=373984.
2020-03-21T09:22:16.000 controller-0 sm: debug time[733.463] log<2183> INFO: sm[18147]: sm_service_fsm.c(1032): Service (dc-iso-fs...

Read more...

Revision history for this message
Bart Wensley (bartwensley) wrote :

Bin - some questions for you...

Why is the following log not generated:
INFO: Running start for /opt/platform/iso on /www/pages/iso

...for the dc-iso-fs enable done here?
2020-03-21T09:20:16.000 controller-0 sm: debug time[613.439] log<1945> INFO: sm[18147]: sm_service_fsm.c(1032): Service (dc-iso-fs) received event (enable-throttle) was in the enabling-throttle state and is now in the enabling state.

How is the fix for bug 715018 going to make any difference? The dc-iso-fs does not have any dependencies that would be impacted by that fix.

The dc-iso-fs timeouts seem to be shorter than the ones for other filesystems (e.g. helmrepository-fs) - is that a mistake?

Revision history for this message
Bin Qian (bqian20) wrote :

it was generated, logs below can be found in daemon-ocf.log.6.gz

2020-03-21T09:15:10.774 controller-0 OCF_Filesystem(dc-iso-fs)[23571]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
2020-03-21T10:00:55.577 controller-0 OCF_Filesystem(dc-iso-fs)[123119]: info INFO: Running start for /opt/platform/iso on /www/pages/iso

Revision history for this message
Bin Qian (bqian20) wrote :

dc-iso-fs has dependency chain below:
dc-iso-fs->platform-fs->drbd-platform->management-ip.

Revision history for this message
Bin Qian (bqian20) wrote :

expected timeout is calculated/estimated for each individual service. Each service can be different in the timeout, but if we do experience longer wait time, then we sure need to adjust the timeout value.

Revision history for this message
Bart Wensley (bartwensley) wrote :

Bin:
> expected timeout is calculated/estimated for each individual service. Each service can be different in the timeout, but if we do experience longer wait time, then we sure need to adjust the timeout value.

The filesystem services are essentially all doing the same thing. I can't think of any reason the timeouts would need to be different.

Revision history for this message
Bart Wensley (bartwensley) wrote :

Bin:
> dc-iso-fs has dependency chain below:
> dc-iso-fs->platform-fs->drbd-platform->management-ip.

My point is that the fix for bug 715018 does not impact those dependencies so why would you think it would impact this issue.

Revision history for this message
Bart Wensley (bartwensley) wrote :

Bin:
> it was generated, logs below can be found in daemon-ocf.log.6.gz
> 2020-03-21T09:15:10.774 controller-0 OCF_Filesystem(dc-iso-fs)[23571]: info INFO: Running stop for /opt/platform/iso on /www/pages/iso
> 2020-03-21T10:00:55.577 controller-0 OCF_Filesystem(dc-iso-fs)[123119]: info INFO: Running start for /opt/platform/iso on /www/pages/iso

Neither of those logs are associated with the dc-iso-fs service start issued at 9:20:16.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority and removing the stx.4.0 tag based on information provided by Bart Wensley who reported this issue. From Bart:
• The issue occurred as part of a long series of failures and was not the main trigger.
• The issue has not occurred since.
• Bin took a look and didn’t find any obvious root cause.

tags: removed: stx.4.0
Changed in starlingx:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.