DC After both system controller nodes power off/on ssh connection lost for 50 mins

Bug #1868604 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bart Wensley

Bug Description

Brief Description
-----------------
In Distributed Cloud, After power off/on both system controller nodes, ssh connection lost for 50 mins.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both system controller nodes,
check ssh connection

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, like within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
unknown

Timestamp/Logs
--------------
[2020-03-21 09:06:21,007] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOff -t 16257'
[2020-03-21 09:06:21,078] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-03-21 09:06:21,078] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOff -t 95973'
[2020-03-21 09:06:22,837] 436 DEBUG MainThread ssh.expect :: Output:
1

[2020-03-21 09:07:24,367] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 16257'
[2020-03-21 09:07:26,625] 436 DEBUG MainThread ssh.expect :: Output:
1
[2020-03-21 09:07:27,985] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 95973'
[2020-03-21 09:07:30,693] 436 DEBUG MainThread ssh.expect :: Output:
1

[2020-03-21 09:10:04,339] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:11:04,464] 407 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ ', '.*assword\\:[ ]?$|assword for .*:[ ]?$', '.*\\(yes/no\\).*', 'svc-cgcsauto@yow-tuxlab2\\:(.*)\\$ '].
expect timeout.
[2020-03-21 09:11:04,464] 1294 INFO MainThread ssh.connect :: Unable to ssh to 2620:10a:a001:a103::1065

[2020-03-21 09:56:18,396] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:56:21,568] 436 DEBUG MainThread ssh.expect :: Output:
command-line line 0: Unsupported option "rsaauthentication"
ssh: connect to host 2620:10a:a001:a103::1065 port 22: No route to host
svc-cgcsauto@yow-tuxlab2:~$
[2020-03-21 09:56:21,569] 1294 INFO MainThread ssh.connect :: Unable to ssh to 2620:10a:a001:a103::1065
[2020-03-21 09:56:21,569] 1310 INFO MainThread ssh.connect :: Retry in 10 seconds
[2020-03-21 09:56:31,579] 314 DEBUG MainThread ssh.send :: Send '/usr/bin/ssh -o RSAAuthentication=no -o PubkeyAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null sysadmin@2620:10a:a001:a103::1065'
[2020-03-21 09:56:31,761] 436 DEBUG MainThread ssh.expect :: Output:
command-line line 0: Unsupported option "rsaauthentication"
Warning: Permanently added '2620:10a:a001:a103::1065' (ECDSA) to the list of known hosts.
Release 20.01
------------------------------------------------------------------------
W A R N I N G *** W A R N I N G *** W A R N I N G *** W A R N I N G ***
------------------------------------------------------------------------
THIS IS A PRIVATE COMPUTER SYSTEM.
This computer system including all related equipment, network devices
(specifically including Internet access), are provided only for authorized use.
All computer systems may be monitored for all lawful purposes, including to
ensure that their use is authorized, for management of the system, to
facilitate protection against unauthorized access, and to verify security
procedures, survivability and operational security. Monitoring includes active
attacks by authorized personnel and their entities to test or verify the
security of the system. During monitoring, information may be examined,
recorded, copied and used for authorized purposes. All information including
personal information, placed on or sent over this system may be monitored. Uses
of this system, authorized or unauthorized, constitutes consent to monitoring
of this system. Unauthorized use may subject you to criminal prosecution.
Evidence of any such unauthorized use collected during monitoring may be used
for administrative, criminal or other adverse action. Use of this system
constitutes consent to monitoring for these purposes.

sysadmin@2620:10a:a001:a103::1065's password:
[2020-03-21 09:56:31,761] 314 DEBUG MainThread ssh.send :: Send 'Li69nux*'
[2020-03-21 09:56:32,034] 436 DEBUG MainThread ssh.expect :: Output:
Last login: Sat Mar 21 08:33:54 2020 from fd01:11::5
/etc/motd.d/00-header:


WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

[?1034hcontroller-1:~$
[2020-03-21 09:56:32,034] 1288 INFO MainThread ssh.connect :: Successfully connected

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :

After DOR, following alarms appeared:

| f6a850a0-2dfd-411b-a2b3-1d9a68befffa | 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major | 2020-03-21T09:57:46.659447 |
| ee80d1af-b0fc-4bd8-9923-e0b16714a232 | 200.006 | controller-0 critical 'dockerd' process has failed and could not be auto-recovered gracefully. Auto-recovery progression by host reboot is required and in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. | host=controller-0.process=dockerd | critical | 2020-03-21T09:20:19.229356 |

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bart to triage

tags: added: stx.distcloud
Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Yang Liu (yliu12) wrote :

Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (20.6 KiB)

I have taken a look at the logs - here is the executive summary:
- Issue 1: Controller-0 fails to go active because the dcorch-patch-api-proxy service attempted to go active before the management-ip service. The dcorch-patch-api-proxy repeatedly fails to bind to the management IP (which doesn't exist yet), which eventually causes SM to shut down and restart all services.
- Issue 2: Controller-1 then fails to go active because the mtcAgent can't get the cluster IP.
- Issue 3: The application of the worker manifest on controller-0 causes containerd to be restarted, which also restarts dockerd and kubelet. Dockerd never comes back. I believe the restart of containerd was introduced with the kata feature (https://review.opendev.org/#/c/703266/).
- Issue 4: Controller-0 then fails to go active because the dc-iso-fs service does not come up. Does not appear that OCF script was invoked.
- Issue 5: All services on controller-0 are shut down at this point, but it takes over 30 minutes before it is actually rebooted.
- Issue 6: Controller-1 then fails to go active because the management-ip service fails to go active. It cannot assign the floating management IP due to an IPv6 address collision (controller-0 still has it assigned and hasn't rebooted). Once controller-0 reboots, the floating management IP is assigned and we recover.

Recommendations:
- Issue 1: Should fix ASAP by adding proper service dependencies. There are probably other missing SM dependencies for the dcorch proxies.
- Issue 2: Requires investigation from maintenance person.
- Issue 3: Multiple problems here - we don't want the worker manifest to be causing containerd/dockerd/kubelet restarts (if possible) and we need to understand why dockerd never recovers.
- Issue 4: Need an SM person to investigate why dc-iso-fs did not come up.
- Issue 5: Need a maintenance person to investigate why it took so long for controller-0 to actually reboot after all the services were shut down.
- Issue 6: Will be resolved by a fix for issue 5.

Detailed investigation notes...

# Controllers powered on here
[2020-03-21 09:07:24,367] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 16257'
[2020-03-21 09:07:27,985] 314 DEBUG MainThread ssh.send :: Send '/folk/vlm/commandline/vlmTool turnOn -t 95973'

# SM starts here
| 2020-03-21T09:13:56.110 | 1 | node-scn | controller-0 | unknown-unknown | unlocked-unknown | customer action

# Docker is started by systemd
2020-03-21T09:14:26.974 [18225.00077] controller-0 pmond mon pmonHdlr.cpp (1142) register_process : Info : dockerd Registered (3406)

# dcorch-patch-api-proxy does not go enabled
| 2020-03-21T09:15:07.103 | 261 | service-scn | dcorch-patch-api-proxy | enabled-active | disabled | process (pid=22794) failed

# because it can't bind the address
2020-03-21 09:15:07.043 22794 INFO dcorch.api.proxy [req-31e362cf-26ba-4ecd-88e7-a19c91a81c27 - - - - -] Server on http://fd01:11::2:25491 with 2
2020-03-21 09:15:07.044 22794 ERROR oslo.service.wsgi [req-31e362cf-26ba-4ecd-88e7-a19c...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/715018
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Submitter: Zuul
Branch: master

commit 92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Author: Tao Liu <email address hidden>
Date: Wed Mar 25 14:00:09 2020 -0400

    Add SM enable dependency for dcorch-patch-api-proxy

    This update adds dcmanager-manager as dcorch-patch-api-proxy enable
    dependency.

    Test cases
    1. Swact few times and ensure the dcorch-patch-api-proxy is enabled
       after the dcmanager-manager is enabled.
    2. Power off/on both AIO controllers and ensure that
       dcorch-patch-api-proxy is enabled after the
       dcmanager-manager is enabled.

    Change-Id: Ief61bcbd973398acce4473c7cd429f03d34b5a98
    Partial-Bug: 1868604
    Signed-off-by: Tao Liu <email address hidden>

Revision history for this message
Bart Wensley (bartwensley) wrote :

To make tracking the issues easier, I have created additional bugs:
- Issue 1: will be tracked under this bug
- Issue 2: created bug 1869192
- Issue 3: created bug 1869193
- Issue 4: created bug 1869194
- Issue 5: created bug 1869195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716146

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (f/centos8)
Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/716146
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=76e00d4e3cd054614d69fdd17e2c877001a7c755
Submitter: Zuul
Branch: f/centos8

commit 92fc4bcc6063064a3386a2611c3bb5c2a111b70a
Author: Tao Liu <email address hidden>
Date: Wed Mar 25 14:00:09 2020 -0400

    Add SM enable dependency for dcorch-patch-api-proxy

    This update adds dcmanager-manager as dcorch-patch-api-proxy enable
    dependency.

    Test cases
    1. Swact few times and ensure the dcorch-patch-api-proxy is enabled
       after the dcmanager-manager is enabled.
    2. Power off/on both AIO controllers and ensure that
       dcorch-patch-api-proxy is enabled after the
       dcmanager-manager is enabled.

    Change-Id: Ief61bcbd973398acce4473c7cd429f03d34b5a98
    Partial-Bug: 1868604
    Signed-off-by: Tao Liu <email address hidden>

commit 26cc9d6cc0e80ecaa5de3318a50de2cdb150a4a8
Author: Angie Wang <email address hidden>
Date: Wed Feb 5 15:36:18 2020 -0500

    Reduce SM timeouts for sysinv-conductor

    SM runs ocf script to disable sysinv-conductor. Currently,
    the timeout for sysinv-conductor ocf script to forcibly
    terminate sysinv-conductor is 55s and the timeout for SM
    to kill sysinv-conductor ocf script is 60s. In the case
    that host-swact happens during application and k8s upgrade
    operations, it will take a long time to complete swact
    as these operations may spawn long-running processes or
    greenthreads which prevent the sysinv-conductor from
    shutting down before reaching the timeout to forcibly shutdown.
    In a controller swact scenario, this results in the system
    having almost no services running for almost a minute,
    while SM is waiting for sysinv-conductor to shut down.

    This commit updates the timeout for sysinv-conductor ocf
    script to the default 15s and reduces the timeout for SM
    to kill the script is 20s.

    Tests conducted:
      - platform sanity
      - perform host-swact during stx-openstack upload/apply/remove
      - perform host-swact during kube upgrade operations
      - perform host-swact during platform-integ-apps apply
      - verified sysinv-conductor can be forcely shutdown after 15s
        when long-running processes are running

    Change-Id: I337154a140f6cec3d6ab953003bf355b4396249e
    Story: 2006781
    Task: 38480
    Signed-off-by: Angie Wang <email address hidden>

commit 220a4cf5e18ef64bfa71a78d45ff7f317385964a
Author: Bin Qian <email address hidden>
Date: Tue Feb 4 11:54:18 2020 -0500

    Verify upload to GitHub mirror with a new commit

    Change-Id: I74d53c136d14ba342a226c468eec97d44322cabd
    Signed-off-by: Bin Qian <email address hidden>

commit d4e5b00ddaa24517f27df0ae2fc2b5f229baa2b6
Author: Bin Qian <email address hidden>
Date: Tue Feb 4 10:35:22 2020 -0500

    Trigger upload job to sync GitHub

    There is no material change in this commit. Only to trigger
    the upload job to sync to GitHub
    Change-Id: I61d35d4ab7319de88034f86e46a6ffd62f0fd53b
    Signed-off-by: Bin Qian <email address hidden>

commit a76ed3bebdb7aaf56961fadf617e7b7c25da6d76
A...

Read more...

tags: added: in-f-centos8
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Bart Wensley (bartwensley) wrote :

The fixes are released for bug 1869192 and bug 1869193 - this is now ready for re-test.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on
Lab: WP_18_21
Load: 2020-04-18_17-54-33

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.