StarlingX

DC After both system controller nodes power off/on ssh connection lost for 50 mins

Bug #1868604 reported by Peng Peng on 2020-03-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bart Wensley

Bug Description

Brief Description
-----------------
In Distributed Cloud, After power off/on both system controller nodes, ssh connection lost for 50 mins.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both system controller nodes,
check ssh connection

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, like within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
unknown

Timestamp/Logs
--------------
[2020-03-21 09:06:21,007] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOff -t 16257'
[2020-03-21 09:06:21,078] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-03-21 09:06:21,078] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOff -t 95973'
[2020-03-21 09:06:22,837] 436 DEBUG MainThread ssh.expect :: Output:
1

[2020-03-21 09:07:24,367] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 16257'
[2020-03-21 09:07:26,625] 436 DEBUG MainThread ssh.expect :: Output:
1
[2020-03-21 09:07:27,985] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 95973'
[2020-03-21 09:07:30,693] 436 DEBUG MainThread ssh.expect :: Output:
1

[2020-03-21 09:10:04,339] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:11:04,464] 407 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ ', '.*assword\\:[ ]?$|assword for .*:[ ]?$', '.*\$yes/no\$.*', 'svc-cgcsauto@yow-tuxlab2\\:(.*)\\$ '].
expect timeout.
[2020-03-21 09:11:04,464] 1294 INFO MainThread ssh.connect :: Unable to ssh to 2620:10a:a001:a103::1065

[2020-03-21 09:56:18,396] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:56:21,568] 436 DEBUG MainThread ssh.expect :: Output:
command-line line 0: Unsupported option "rsaauthentication"
ssh: connect to host 2620:10a:a001:a103::1065 port 22: No route to host
svc-cgcsauto@yow-tuxlab2:~$
[2020-03-21 09:56:21,569] 1294 INFO MainThread ssh.connect :: Unable to ssh to 2620:10a:a001:a103::1065
[2020-03-21 09:56:21,569] 1310 INFO MainThread ssh.connect :: Retry in 10 seconds
[2020-03-21 09:56:31,579] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:56:31,761] 436 DEBUG MainThread ssh.expect :: Output:
command-line line 0: Unsupported option "rsaauthentication"
Warning: Permanently added '2620:10a:a001:a103::1065' (ECDSA) to the list of known hosts.
Release 20.01
------------------------------------------------------------------------
W A R N I N G *** W A R N I N G *** W A R N I N G *** W A R N I N G ***
------------------------------------------------------------------------
THIS IS A PRIVATE COMPUTER SYSTEM.
This computer system including all related equipment, network devices
(specifically including Internet access), are provided only for authorized use.
All computer systems may be monitored for all lawful purposes, including to
ensure that their use is authorized, for management of the system, to
facilitate protection against unauthorized access, and to verify security
procedures, survivability and operational security. Monitoring includes active
attacks by authorized personnel and their entities to test or verify the
security of the system. During monitoring, information may be examined,
recorded, copied and used for authorized purposes. All information including
personal information, placed on or sent over this system may be monitored. Uses
of this system, authorized or unauthorized, constitutes consent to monitoring
of this system. Unauthorized use may subject you to criminal prosecution.
Evidence of any such unauthorized use collected during monitoring may be used
for administrative, criminal or other adverse action. Use of this system
constitutes consent to monitoring for these purposes.

sysadmin@2620:10a:a001:a103::1065's password:
[2020-03-21 09:56:31,761] 314 DEBUG MainThread ssh.send :: Send 'Li69nux*'
[2020-03-21 09:56:32,034] 436 DEBUG MainThread ssh.expect :: Output:
Last login: Sat Mar 21 08:33:54 2020 from fd01:11::5
/etc/motd.d/00-header:

[H[2J
WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

[?1034hcontroller-1:~$
[2020-03-21 09:56:32,034] 1288 INFO MainThread ssh.connect :: Successfully connected

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Yang Liu (yliu12) wrote on 2020-03-23:

After DOR, following alarms appeared:

Revision history for this message

Peng Peng (ppeng) wrote on 2020-03-23:

log @
https://files.starlingx.kube.cengn.ca/launchpad/1868604

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-03-23:

Assigning to Bart to triage

tags:	added: stx.distcloud
Changed in starlingx:
assignee:	nobody → Bart Wensley (bartwensley)
tags:	added: stx.4.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Yang Liu (yliu12) wrote on 2020-03-23:

Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-03-25:

Download full text (20.6 KiB)

I have taken a look at the logs - here is the executive summary:
- Issue 1: Controller-0 fails to go active because the dcorch-patch-api-proxy service attempted to go active before the management-ip service. The dcorch-patch-api-proxy repeatedly fails to bind to the management IP (which doesn't exist yet), which eventually causes SM to shut down and restart all services.
- Issue 2: Controller-1 then fails to go active because the mtcAgent can't get the cluster IP.
- Issue 3: The application of the worker manifest on controller-0 causes containerd to be restarted, which also restarts dockerd and kubelet. Dockerd never comes back. I believe the restart of containerd was introduced with the kata feature (https://review.opendev.org/#/c/703266/).
- Issue 4: Controller-0 then fails to go active because the dc-iso-fs service does not come up. Does not appear that OCF script was invoked.
- Issue 5: All services on controller-0 are shut down at this point, but it takes over 30 minutes before it is actually rebooted.
- Issue 6: Controller-1 then fails to go active because the management-ip service fails to go active. It cannot assign the floating management IP due to an IPv6 address collision (controller-0 still has it assigned and hasn't rebooted). Once controller-0 reboots, the floating management IP is assigned and we recover.

Recommendations:
- Issue 1: Should fix ASAP by adding proper service dependencies. There are probably other missing SM dependencies for the dcorch proxies.
- Issue 2: Requires investigation from maintenance person.
- Issue 3: Multiple problems here - we don't want the worker manifest to be causing containerd/dockerd/kubelet restarts (if possible) and we need to understand why dockerd never recovers.
- Issue 4: Need an SM person to investigate why dc-iso-fs did not come up.
- Issue 5: Need a maintenance person to investigate why it took so long for controller-0 to actually reboot after all the services were shut down.
- Issue 6: Will be resolved by a fix for issue 5.

Detailed investigation notes...

# Controllers powered on here
[2020-03-21 09:07:24,367] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 16257'
[2020-03-21 09:07:27,985] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 95973'

# Docker is started by systemd
2020-03-21T09:14:26.974 [18225.00077] controller-0 pmond mon pmonHdlr.cpp (1142) register_process : Info : dockerd Registered (3406)

Detailed investigation notes...

# Docker is started by systemd
2020-03-21T09:14:26.974 [18225.00077] controller-0 pmond mon pmonHdlr.cpp      (1142) register_process        : Info : dockerd Registered (3406)

# because it can't bind the address
2020-03-21 09:15:07.043 22794 INFO dcorch.api.proxy [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Server on http://fd01:11::2:25491 with 2
2020-03-21 09:15:07.044 22794 ERROR oslo.service.wsgi [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Could not bind to fd01:11::2:25491: error: [Errno 99] Cannot assign requested address
2020-03-21 09:15:07.044 22794 CRITICAL dcorch [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Unhandled error: error: [Errno 99] Cannot assign requested address

# The mtcAgent seems to be unable to get the  cluster IP
2020-03-21T09:15:33.465 [34192.00081] controller-1 mtcAgent --- msgClass.cpp      (1052) msgClassTx              : Info : Creating vlan172 socket on port 2118 with address: fd01:11::4
2020-03-21T09:15:33.465 [34192.00082] controller-1 mtcAgent --- msgClass.cpp      ( 342) getAddressFromInterface :Error : controller-1-cluster-host ip address resolution failed (err: Name or service not known)
2020-03-21T09:15:33.465 fmAPI.cpp(137): Connected to FM Manager.
2020-03-21T09:15:33.465 [34192.00083] controller-1 mtcAgent --- msgClass.cpp      (1052) msgClassTx              : Info : Creating vlan173 socket on port 2115 with address: fd01:11::4
2020-03-21T09:15:33.465 fmAlarmUtils.cpp(630): Sending FM clear alarm request: alarm_id (200.001), entity_id (host=controller-1)
2020-03-21T09:15:33.465 [34192.00084] controller-1 mtcAgent --- msgClass.cpp      ( 342) getAddressFromInterface :Error : controller-1-cluster-host ip address resolution failed (err: Name or service not known)
2020-03-21T09:15:33.465 fmAlarmUtils.cpp(669): FM Response for clear alarm: (10), alarm_id (200.001), entity_id (host=controller-1)
2020-03-21T09:15:33.465 [34192.00085] controller-1 mtcAgent --- msgClass.cpp      (1121) initSocket              :Error : Failed to bind socket to address (22:Invalid argument)
2020-03-21T09:15:33.465 [34192.00086] controller-1 mtcAgent mtc mtcNodeCtrl.cpp   (1154) _self_provision         :Error : Socket initialization failed (rc:34)
2020-03-21T09:15:33.465 [34192.00087] controller-1 mtcAgent mtc mtcNodeCtrl.cpp   (1247) daemon_service_run      :Error : Failed to self provision active controller

# Back on controller-0 - application of worker manifest restarts containerd
2020-03-21T09:15:44.062 [0;36mDebug: 2020-03-21 09:15:44 +0000 Exec[enable-containerd](provider=posix): Executing '/usr/bin/systemctl enable containerd.service'[0m
2020-03-21T09:15:44.064 [0;36mDebug: 2020-03-21 09:15:44 +0000 Executing: '/usr/bin/systemctl enable containerd.service'[0m
2020-03-21T09:15:44.114 [mNotice: 2020-03-21 09:15:44 +0000 /Stage[main]/Platform::Containerd::Config/Exec[enable-containerd]/returns: executed successfully[0m
2020-03-21T09:15:44.115 [0;36mDebug: 2020-03-21 09:15:44 +0000 /Stage[main]/Platform::Containerd::Config/Exec[enable-containerd]: The container Class[Platform::Containerd::Config] will propagate my refresh event[0m
2020-03-21T09:15:44.118 [0;36mDebug: 2020-03-21 09:15:44 +0000 Exec[restart-containerd](provider=posix): Executing '/usr/bin/systemctl restart containerd.service'[0m
2020-03-21T09:15:44.120 [0;36mDebug: 2020-03-21 09:15:44 +0000 Executing: '/usr/bin/systemctl restart containerd.service'[0m

# This causes both docker and kubelet to be restarted
2020-03-21T09:15:44.219 controller-0 systemd[1]: info Stopped Kubernetes Kubelet Server.
2020-03-21T09:15:44.226 controller-0 systemd[1]: info Stopping Docker Application Container Engine...
<snip>
2020-03-21T09:15:44.329 controller-0 systemd[1]: info Starting Docker Application Container Engine...

# This causes pmon to detect that dockerd has failed and pmon attempts to restart it
2020-03-21T09:15:44.233 [18225.00103] controller-0 pmond mon pmonHdlr.cpp      ( 303) manage_process_failure  :Error : dockerd failed (3406) (p:1 a:0)
2020-03-21T09:15:44.237 [18225.00104] controller-0 pmond com nodeUtil.cpp      (1844) get_system_state        : Info : systemctl reports host as 'starting'
2020-03-21T09:15:44.237 [18225.00105] controller-0 pmond mon pmonHdlr.cpp      (1547) manage_alarm            : Info : dockerd process has failed ; Auto recovery in progress.
2020-03-21T09:15:44.237 [18225.00106] controller-0 pmond mon pmonMsg.cpp       ( 328) pmon_send_event         : Info : controller-0 pmon log sent
2020-03-21T09:15:44.474 [18225.00107] controller-0 pmond mon pmonHdlr.cpp      (1007) process_running         : Info : dockerd process not running
2020-03-21T09:15:44.474 [18225.00108] controller-0 pmond mon pmonHdlr.cpp      (1311) respawn_process         : Info : dockerd Spawn      (75467)
2020-03-21T09:15:49.974 [18225.00109] controller-0 pmond mon pmonFsm.cpp       ( 594) pmon_passive_handler    :Error : dockerd spawn timeout (75467)
2020-03-21T09:15:49.974 [18225.00110] controller-0 pmond mon pmonHdlr.cpp      ( 946) kill_running_process    : Warn : dockerd Killed     (75467)
2020-03-21T09:15:49.974 [18225.00111] controller-0 pmond mon pmonHdlr.cpp      ( 965) kill_running_child      : Warn : dockerd start script still running (75467) ; killed
2020-03-21T09:15:56.474 [18225.00112] controller-0 pmond mon pmonHdlr.cpp      (1007) process_running         : Info : dockerd process not running
2020-03-21T09:15:56.475 [18225.00113] controller-0 pmond mon pmonHdlr.cpp      (1311) respawn_process         : Info : dockerd Spawn      (79441)
2020-03-21T09:16:01.974 [18225.00114] controller-0 pmond mon pmonFsm.cpp       ( 594) pmon_passive_handler    :Error : dockerd spawn timeout (79441)

# Pmon continues to attempt to restart docker until controller-0 is rebooted
2020-03-21T09:26:07.474 [18225.00505] controller-0 pmond mon pmonHdlr.cpp      (1007) process_running         : Info : dockerd process not running
2020-03-21T09:26:07.475 [18225.00506] controller-0 pmond mon pmonHdlr.cpp      (1311) respawn_process         : Info : dockerd Spawn      (422494)
2020-03-21T09:26:09.848 [18225.00507] controller-0 pmond mon pmonHdlr.cpp      (1468) daemon_sigchld_hdlr     : Info : dockerd spawn failed (rc:1:1) (2.373 secs)

# The systemd logs record docker being terminated and restarted (by pmon I assume)
2020-03-21T09:15:44.501 controller-0 dockerd[75127]: info time="2020-03-21T09:15:44.501832104Z" level=info msg="Processing signal 'terminated'"
2020-03-21T09:15:55.035 controller-0 systemd[1]: info Stopped Docker Application Container Engine.
2020-03-21T09:15:55.073 controller-0 systemd[1]: info Starting Docker Application Container Engine...
2020-03-21T09:15:56.480 controller-0 dockerd[78772]: info time="2020-03-21T09:15:56.480227424Z" level=info msg="Processing signal 'terminated'"

# This time the controller-0 management-ip service assigns floating IP
2020-03-21T09:20:12.025 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever)
2020-03-21T09:20:12.032 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: Bringing device vlan172 up
2020-03-21T09:20:14.050 controller-0 OCF_IPaddr2(management-ip)[357575]: info INFO: /usr/libexec/heartbeat/send_ua -i 200 -c 7 fd01:11::2 64 vlan172

# I do not see any logs from the dc-iso-fs OCF script, so I don't know why it didn't go enabled (it appears SM did not call the start)

# SM on controller-0 does attempt to disable the management-ip service, but it doesn't look like it completes
| 2020-03-21T09:22:43.195 |        570 | service-scn          | management-ip                    | enabled-active                   | disabling                        | disable state requested 
# Puppet eventually fails the manifest apply because docker is not running
2020-03-21T09:26:09.860 [1;31mError: 2020-03-21 09:26:09 +0000 Systemd start for docker failed!

# I assume that is what causes controller-0 to reboot? I didn't find the linkage.

# I see the controller-0 services shut down here - for example here is SM
2020-03-21T09:26:15.000 controller-0 sm: debug time[972.073] log<2800> INFO: sm[18147]: sm_process.c(857): Shutting down.
2020-03-21T09:26:15.000 controller-0 sm: debug time[972.403] log<211> INFO: sm_ta[18177]: sm_task_affining_thread.c(145): Shutting down.
2020-03-21T09:26:15.000 controller-0 sm: debug time[972.403] log<212> INFO: sm_ta[18177]: sm_task_affining_thread.c(154): Shutdown complete.
<snip>
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<7> INFO: sm-watchdog[17760]: sm_watchdog_process.c(221): Shutting down.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<8> INFO: sm-watchdog[17760]: sm_watchdog_module.c(231): Finalizing module (/var/lib/sm/watchdog/modules/libsm_watchdog_nfs.so.1.0.0).
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<9> INFO: sm-watchdog[17760]: sm_watchdog_module.c(231): Finalizing module (/var/lib/sm/watchdog/modules/libsm_watchdog_nfs.so.1.0.0).
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug time[976.376] log<10> INFO: sm-watchdog[17760]: sm_watchdog_process.c(237): Shutdown complete.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutting down.
2020-03-21T09:26:19.000 controller-0 sm-watchdog: debug Shutdown complete.

# Here are the last logs in kern.log before the reboot
2020-03-21T09:26:12.343 controller-0 kernel: info [  968.975201] c6xx 0000:da:00.0: qat_dev2 stopped 10 acceleration engines
2020-03-21T09:26:13.347 controller-0 kernel: info [  969.976946] c6xx 0000:da:00.0: Resetting device qat_dev2
2020-03-21T09:26:13.347 controller-0 kernel: info [  969.976949] c6xx 0000:da:00.0: Function level reset

# But I don't see controller-0 come up until here
2020-03-21T09:58:03.000 controller-0 kernel: info [    0.000000] Initializing cgroup subsys cpuset
2020-03-21T09:58:03.000 controller-0 kernel: info [    0.000000] Initializing cgroup subsys cpu
2020-03-21T09:58:03.000 controller-0 kernel: info [    0.000000] Initializing cgroup subsys cpuacct

# This is due to an address collision for the floating IP
2020-03-21T09:26:14.851 controller-1 OCF_IPaddr2(management-ip)[160972]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever)
2020-03-21T09:26:14.855 controller-1 OCF_IPaddr2(management-ip)[160972]: info INFO: Bringing device vlan172 up
2020-03-21T09:26:15.865 controller-1 OCF_IPaddr2(management-ip)[160972]: err ERROR: IPv6 address collision fd01:11::2 [DAD]
2020-03-21T09:26:15.889 controller-1 OCF_IPaddr2(management-ip)[160972]: err ERROR: run_send_ua failed.

# This continues until controller-0 finally reboots, at which point controller-1 can finally use the floating IP
2020-03-21T09:56:21.673 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: Adding inet6 address fd01:11::2/64 to device vlan172 (with preferred_lft forever)
2020-03-21T09:56:21.677 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: Bringing device vlan172 up
2020-03-21T09:56:23.689 controller-1 OCF_IPaddr2(management-ip)[2510974]: info INFO: /usr/libexec/heartbeat/send_ua -i 200 -c 7 fd01:11::2 64 vlan172

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-26: Fix merged to ha (master)

Reviewed: https://review.opendev.org/715018
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Submitter: Zuul
Branch: master

commit 92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Author: Tao Liu <email address hidden>
Date: Wed Mar 25 14:00:09 2020 -0400

Add SM enable dependency for dcorch-patch-api-proxy

This update adds dcmanager-manager as dcorch-patch-api-proxy enable
dependency.

    Test cases
    1. Swact few times and ensure the dcorch-patch-api-proxy is enabled
       after the dcmanager-manager is enabled.
    2. Power off/on both AIO controllers and ensure that
       dcorch-patch-api-proxy is enabled after the
       dcmanager-manager is enabled.

    Change-Id: Ief61bcbd973398acce4473c7cd429f03d34b5a98
    Partial-Bug: 1868604
    Signed-off-by: Tao Liu <email address hidden>

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-03-26:

To make tracking the issues easier, I have created additional bugs:
- Issue 1: will be tracked under this bug
- Issue 2: created bug 1869192
- Issue 3: created bug 1869193
- Issue 4: created bug 1869194
- Issue 5: created bug 1869195

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-31: Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716146

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-31: Fix merged to ha (f/centos8)

Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/716146
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=76e00d4e3cd054614d69fdd17e2c877001a7c755
Submitter: Zuul
Branch: f/centos8

commit 92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Author: Tao Liu <email address hidden>
Date: Wed Mar 25 14:00:09 2020 -0400

Add SM enable dependency for dcorch-patch-api-proxy

This update adds dcmanager-manager as dcorch-patch-api-proxy enable
dependency.

    Change-Id: Ief61bcbd973398acce4473c7cd429f03d34b5a98
    Partial-Bug: 1868604
    Signed-off-by: Tao Liu <email address hidden>

commit 26cc9d6cc0e80ecaa5de3318a50de2cdb150a4a8
Author: Angie Wang <email address hidden>
Date: Wed Feb 5 15:36:18 2020 -0500

Reduce SM timeouts for sysinv-conductor

    SM runs ocf script to disable sysinv-conductor. Currently,
    the timeout for sysinv-conductor ocf script to forcibly
    terminate sysinv-conductor is 55s and the timeout for SM
    to kill sysinv-conductor ocf script is 60s. In the case
    that host-swact happens during application and k8s upgrade
    operations, it will take a long time to complete swact
    as these operations may spawn long-running processes or
    greenthreads which prevent the sysinv-conductor from
    shutting down before reaching the timeout to forcibly shutdown.
    In a controller swact scenario, this results in the system
    having almost no services running for almost a minute,
    while SM is waiting for sysinv-conductor to shut down.

    This commit updates the timeout for sysinv-conductor ocf
    script to the default 15s and reduces the timeout for SM
    to kill the script is 20s.

    Tests conducted:
      - platform sanity
      - perform host-swact during stx-openstack upload/apply/remove
      - perform host-swact during kube upgrade operations
      - perform host-swact during platform-integ-apps apply
      - verified sysinv-conductor can be forcely shutdown after 15s
        when long-running processes are running

    Change-Id: I337154a140f6cec3d6ab953003bf355b4396249e
    Story: 2006781
    Task: 38480
    Signed-off-by: Angie Wang <email address hidden>

commit 220a4cf5e18ef64bfa71a78d45ff7f317385964a
Author: Bin Qian <email address hidden>
Date: Tue Feb 4 11:54:18 2020 -0500

Verify upload to GitHub mirror with a new commit

Change-Id: I74d53c136d14ba342a226c468eec97d44322cabd
Signed-off-by: Bin Qian <email address hidden>

commit d4e5b00ddaa24517f27df0ae2fc2b5f229baa2b6
Author: Bin Qian <email address hidden>
Date: Tue Feb 4 10:35:22 2020 -0500

Trigger upload job to sync GitHub

    There is no material change in this commit. Only to trigger
    the upload job to sync to GitHub
    Change-Id: I61d35d4ab7319de88034f86e46a6ffd62f0fd53b
    Signed-off-by: Bin Qian <email address hidden>

commit a76ed3bebdb7aaf56961fadf617e7b7c25da6d76
A...

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in starlingx:
status:	In Progress → Fix Released