AIO-DX: mtcAgent did not recover after power cycling both controllers

Bug #1869192 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that controller-1 failed to go active because the mtcAgent can't get the cluster IP.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue w/ dead office recovery

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

SM selected controller-1 to be active following the DOR (Dead Office Recovery) event.
The mtcAgent (maintenance) process experienced two back to back socket initialization failures due to cluster host network interface (vlan173) configuration.

Interestingly, the network config manifest was applying when the mtcAgent was starting.
The manifest reported several instances of 'Facter: value for network_vlan173 is still nil' and then proceeded to configure it.

Daemon log reported LDAP server unavailable errors

2020-03-21T09:15:28.000 controller-1 nslcd[19006]: err [7ed7ab] <group/member="snmpd"> no available LDAP server found: Server is unavailable: Resource temporarily unavailable

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Code update implemented.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/716634
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=082688230827e59ef9905aa03b9fea7f034dfb13
Submitter: Zuul
Branch: master

commit 082688230827e59ef9905aa03b9fea7f034dfb13
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 1 11:38:30 2020 -0400

    Add mtcAgent socket initialization failure retry handling.

    The main maintenance process (mtcAgent) exits on a process start-up
    socket initialization failure. SM restarts the failed process within
    seconds and will swact if the second restart also fails. From startup
    to swact can be as quick as 4 seconds. This is too short to handle a
    collision with a manifest.

    This update adds a number of socket initialization retries to extend
    the time the process has to resolve socket initialization failures by
    giving the collided manifest time to complete between retries.

    The number of retries and inter retry wait time is calibrated to ensure
    that a persistently failing mtcAgent process exits in under 40 seconds.

    This is to ensure that SM is able to detect and swact away from a
    persistently failing maintenance process while also giving the process
    a few tries to resolve on its own.

    Test Plan:

    PASS: Verify socket init failure thresholded retry handling
          with no, persistent and recovered failure conditions.
    PASS: Verify swact if socket init failure is persistent
    PASS: Verify no swact if socket failure recovers after first exit
    PASS: Verify no swact if socket failure recovers over init retry
    PASS: Verify an hour long soak of continuous socket open/close retry

    Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
    Closes-Bug: 1869192
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (9.2 KiB)

Reviewed: https://review.opendev.org/729821
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=51bd3f945544cb97da2ef2a0b12bdf2c5468514c
Submitter: Zuul
Branch: f/centos8

commit efbaf2cd0db97fc1b43ffbf2a8346eb12638a08f
Author: Sharath Kumar K <email address hidden>
Date: Mon May 4 10:24:30 2020 +0200

    De-branding in starlingx/metal: TIS -> StarlingX

    1. Rename TIS to StarlingX for .spec file

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch11 changes.

    Story: 2006387
    Task: 36207

    Change-Id: I52673924a8186afb7482d7ba7b601f4733268afb
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 24499f8f25a72abbb109a3a6494ad38744a1d147
Author: Ovidiu Poncea <email address hidden>
Date: Fri May 8 19:21:20 2020 +0300

    Fix wipedisk rootfs partition and bootloader removal

    This commit changes:
     o DD wipe is no longer executed on / or we loose access
       to system commands before wipe is complete. Wiping
       root beginning and end is not mandatory as on reinstall
       it is reformatted.
     o Partitions on rootfs, other than platform-backup are
       removed.
     o Bootloader is removed so that boot from secondary devices
       can be done. This is useful at host reinstall. W/o this
       change boot hangs and manual intervention is needed.

    Change-Id: I1ab9f70d00a38568fc00063cdaa54ec3be48dc33
    Closes-Bug: 1877579
    Signed-off-by: Ovidiu Poncea <email address hidden>

commit a56b99c84693149a035ffe6594099f60db71584e
Author: Ovidiu Poncea <email address hidden>
Date: Sat May 2 11:47:44 2020 -0400

    Fix partition removal after wipe

    After the wipe step, partitions are not removed when installing the load.
    This commit fixes this.

    Also, on some systems with NVMe, udev doesn't correctly remove the device
    nodes to the deleted partitions from /dev/nvme* causing them to be seen as
    non block devices, this leads to failures on format or assigning LVM PVs.

    Change-Id: I3ab9f70d00a38568fc00063cdaa54ec3be48dc58
    Closes-Bug: 1876374
    Signed-off-by: Ovidiu Poncea <email address hidden>

commit ece0dd0ce5e36c461c93a5cc3b803fb3b5c5e59e
Author: Mihnea Saracin <email address hidden>
Date: Wed Apr 15 20:25:22 2020 +0300

    Persistent backup partition

    Add a backup partition that has
    the following characteristics:
    - It will never be deleted
      (not at install, reinstall, upgrade nor B&R)
    - The partition will have 10G
    - It will be resizable at upgrades

    Story: 2007403
    Task: 39548
    Change-Id: I2ec9f70d00a38568fc00063cdaa54ec3be48dc47
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 84c720f4562bde3d06b245bca0b7ad41655d35f5
Author: Don Penney <email address hidden>
Date: Mon Apr 27 22:50:10 2020 -0400

    Drop copy of .cfg files from controller kickstarts

    In a boot from an ISO modified by update-iso.sh with a ks-addon, the
    ...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.