StarlingX

AIO-DX: mtcAgent did not recover after power cycling both controllers

Bug #1869192 reported by Bart Wensley on 2020-03-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that controller-1 failed to go active because the mtcAgent can't get the cluster IP.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

Tags:

Bart Wensley (bartwensley) on 2020-03-26

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Ghada Khalil (gkhalil) on 2020-03-26

tags:

added: stx.metal

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-03-27:

stx.4.0 / medium priority - issue w/ dead office recovery

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.4.0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-03-30:

SM selected controller-1 to be active following the DOR (Dead Office Recovery) event.
The mtcAgent (maintenance) process experienced two back to back socket initialization failures due to cluster host network interface (vlan173) configuration.

Interestingly, the network config manifest was applying when the mtcAgent was starting.
The manifest reported several instances of 'Facter: value for network_vlan173 is still nil' and then proceeded to configure it.

Daemon log reported LDAP server unavailable errors

2020-03-21T09:15:28.000 controller-1 nslcd[19006]: err [7ed7ab] <group/member="snmpd"> no available LDAP server found: Server is unavailable: Resource temporarily unavailable

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-01:

Code update implemented.

OpenStack Infra (hudson-openstack) on 2020-04-01

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-02: Fix merged to metal (master)

Reviewed: https://review.opendev.org/716634
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=082688230827e59ef9905aa03b9fea7f034dfb13
Submitter: Zuul
Branch: master

commit 082688230827e59ef9905aa03b9fea7f034dfb13
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 1 11:38:30 2020 -0400

Add mtcAgent socket initialization failure retry handling.

    The main maintenance process (mtcAgent) exits on a process start-up
    socket initialization failure. SM restarts the failed process within
    seconds and will swact if the second restart also fails. From startup
    to swact can be as quick as 4 seconds. This is too short to handle a
    collision with a manifest.

    This update adds a number of socket initialization retries to extend
    the time the process has to resolve socket initialization failures by
    giving the collided manifest time to complete between retries.

The number of retries and inter retry wait time is calibrated to ensure
that a persistently failing mtcAgent process exits in under 40 seconds.

    This is to ensure that SM is able to detect and swact away from a
    persistently failing maintenance process while also giving the process
    a few tries to resolve on its own.

Test Plan:

    PASS: Verify socket init failure thresholded retry handling
          with no, persistent and recovered failure conditions.
    PASS: Verify swact if socket init failure is persistent
    PASS: Verify no swact if socket failure recovers after first exit
    PASS: Verify no swact if socket failure recovers over init retry
    PASS: Verify an hour long soak of continuous socket open/close retry

    Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
    Closes-Bug: 1869192
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729821

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix merged to metal (f/centos8)

Download full text (9.2 KiB)

Reviewed: https://review.opendev.org/729821
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=51bd3f945544cb97da2ef2a0b12bdf2c5468514c
Submitter: Zuul
Branch: f/centos8

commit efbaf2cd0db97fc1b43ffbf2a8346eb12638a08f
Author: Sharath Kumar K <email address hidden>
Date: Mon May 4 10:24:30 2020 +0200

De-branding in starlingx/metal: TIS -> StarlingX

1. Rename TIS to StarlingX for .spec file

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

Please note, doing de-brand changes in batches, this is batch11 changes.

Story: 2006387
Task: 36207

Change-Id: I52673924a8186afb7482d7ba7b601f4733268afb
Signed-off-by: Sharath Kumar K <email address hidden>

commit 24499f8f25a72abbb109a3a6494ad38744a1d147
Author: Ovidiu Poncea <email address hidden>
Date: Fri May 8 19:21:20 2020 +0300

Fix wipedisk rootfs partition and bootloader removal

    This commit changes:
     o DD wipe is no longer executed on / or we loose access
       to system commands before wipe is complete. Wiping
       root beginning and end is not mandatory as on reinstall
       it is reformatted.
     o Partitions on rootfs, other than platform-backup are
       removed.
     o Bootloader is removed so that boot from secondary devices
       can be done. This is useful at host reinstall. W/o this
       change boot hangs and manual intervention is needed.

    Change-Id: I1ab9f70d00a38568fc00063cdaa54ec3be48dc33
    Closes-Bug: 1877579
    Signed-off-by: Ovidiu Poncea <email address hidden>

commit a56b99c84693149a035ffe6594099f60db71584e
Author: Ovidiu Poncea <email address hidden>
Date: Sat May 2 11:47:44 2020 -0400

Fix partition removal after wipe

After the wipe step, partitions are not removed when installing the load.
This commit fixes this.

    Also, on some systems with NVMe, udev doesn't correctly remove the device
    nodes to the deleted partitions from /dev/nvme* causing them to be seen as
    non block devices, this leads to failures on format or assigning LVM PVs.

    Change-Id: I3ab9f70d00a38568fc00063cdaa54ec3be48dc58
    Closes-Bug: 1876374
    Signed-off-by: Ovidiu Poncea <email address hidden>

commit ece0dd0ce5e36c461c93a5cc3b803fb3b5c5e59e
Author: Mihnea Saracin <email address hidden>
Date: Wed Apr 15 20:25:22 2020 +0300

Persistent backup partition

    Add a backup partition that has
    the following characteristics:
    - It will never be deleted
      (not at install, reinstall, upgrade nor B&R)
    - The partition will have 10G
    - It will be resizable at upgrades

    Story: 2007403
    Task: 39548
    Change-Id: I2ec9f70d00a38568fc00063cdaa54ec3be48dc47
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 84c720f4562bde3d06b245bca0b7ad41655d35f5
Author: Don Penney <email address hidden>
Date: Mon Apr 27 22:50:10 2020 -0400

Drop copy of .cfg files from controller kickstarts

In a boot from an ISO modified by update-iso.sh with a ks-addon, the
...

Reviewed:  https://review.opendev.org/729821
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=51bd3f945544cb97da2ef2a0b12bdf2c5468514c
Submitter: Zuul
Branch:    f/centos8

commit efbaf2cd0db97fc1b43ffbf2a8346eb12638a08f
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Mon May 4 10:24:30 2020 +0200

De-branding in starlingx/metal: TIS -> StarlingX
    
    1. Rename TIS to StarlingX for .spec file
    
    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.
    
    Please note, doing de-brand changes in batches, this is batch11 changes.
    
    Story: 2006387
    Task: 36207
    
    Change-Id: I52673924a8186afb7482d7ba7b601f4733268afb
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit 24499f8f25a72abbb109a3a6494ad38744a1d147
Author: Ovidiu Poncea <ovidiu.poncea@windriver.com>
Date:   Fri May 8 19:21:20 2020 +0300

Fix wipedisk rootfs partition and bootloader removal
    
    This commit changes:
     o DD wipe is no longer executed on / or we loose access
       to system commands before wipe is complete. Wiping
       root beginning and end is not mandatory as on reinstall
       it is reformatted.
     o Partitions on rootfs, other than platform-backup are
       removed.
     o Bootloader is removed so that boot from secondary devices
       can be done. This is useful at host reinstall. W/o this
       change boot hangs and manual intervention is needed.
    
    Change-Id: I1ab9f70d00a38568fc00063cdaa54ec3be48dc33
    Closes-Bug: 1877579
    Signed-off-by: Ovidiu Poncea <ovidiu.poncea@windriver.com>

commit a56b99c84693149a035ffe6594099f60db71584e
Author: Ovidiu Poncea <ovidiu.poncea@windriver.com>
Date:   Sat May 2 11:47:44 2020 -0400

Fix partition removal after wipe
    
    After the wipe step, partitions are not removed when installing the load.
    This commit fixes this.
    
    Also, on some systems with NVMe, udev doesn't correctly remove the device
    nodes to the deleted partitions from /dev/nvme* causing them to be seen as
    non block devices, this leads to failures on format or assigning LVM PVs.
    
    Change-Id: I3ab9f70d00a38568fc00063cdaa54ec3be48dc58
    Closes-Bug: 1876374
    Signed-off-by: Ovidiu Poncea <ovidiu.poncea@windriver.com>

commit ece0dd0ce5e36c461c93a5cc3b803fb3b5c5e59e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Wed Apr 15 20:25:22 2020 +0300

Persistent backup partition
    
    Add a backup partition that has
    the following characteristics:
    - It will never be deleted
      (not at install, reinstall, upgrade nor B&R)
    - The partition will have 10G
    - It will be resizable at upgrades
    
    Story: 2007403
    Task: 39548
    Change-Id: I2ec9f70d00a38568fc00063cdaa54ec3be48dc47
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit 84c720f4562bde3d06b245bca0b7ad41655d35f5
Author: Don Penney <don.penney@windriver.com>
Date:   Mon Apr 27 22:50:10 2020 -0400

Drop copy of .cfg files from controller kickstarts
    
    In a boot from an ISO modified by update-iso.sh with a ks-addon, the
    ks-addon.cfg was being unnecessarily copied in a kickstart postinstall
    scriptlet. This behaviour, copying the .cfg files, was originally
    required for copying the net-boot kickstarts for installation of other
    nodes, however the kickstarts are now packaged and installed, and no
    other .cfg files are required from the installation.
    
    As such, this update drops the copying of these files.
    
    Change-Id: Id088ff00711b0703299f822ab1f25901e94a6d4d
    Closes-Bug: 1875464
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 718a68ff90e8cddd08fb1bc9005caf5e8ef939c2
Author: Sabeel Ansari <Sabeel.Ansari@windriver.com>
Date:   Fri Apr 17 10:05:26 2020 -0400

Filter cert-manager & nginx RPMs
    
    Cert-manager & nginx-ingress-controller packages are only needed
    on controllers. Filtering them out of worker & storage nodes.
    
    Story: 2007360
    Task: 39462
    
    Change-Id: I53b61ae0831d53d8bfc5f41ac3724f97a51e5d83
    Signed-off-by: Sabeel Ansari <Sabeel.Ansari@windriver.com>

commit 0b9f186330bbd7db852645a63be260f30e4247b7
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Apr 16 00:32:18 2020 -0400

Filter OIDC rpms from worker and storage node
    
    OIDC and DEX packages are only needed on the controllers.
    
    Story: 2006711
    Task: 39434
    
    Change-Id: I331f2505979a9c35430d409167af1b1338b9a52c
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 8d1695fa6a1e318300cffc87b0bce0a84f41cfc9
Author: Saul Wold <sgw@linux.intel.com>
Date:   Mon Mar 30 12:07:19 2020 -0700

Convert BOOTIF MAC to NIC name
    
    When booting via PXE one can set the ksdevice=bootif and add
    'IPAPPEND 2' to the PXE boot config, this will set the MAC
    address in the BOOTIF kernel cmdline which can then be used to
    determine the NIC name to be used for the installation.
    
    Tested with both valid and invalid input to BOOTIF=
    
    Story: 2007486
    Task: 39205
    Change-Id: Iec1e8215571a6bb8fb79a461ce9210dddf2c764f
    Signed-off-by: Saul Wold <sgw@linux.intel.com>

commit 7423edce9bae1e53d0cce1d6715de183d1ef8e39
Author: Dongqi Chen <chen.dq@neusoft.com>
Date:   Thu Apr 2 17:25:45 2020 +0800

Fix mtce-common build error with gcc-8.2.1
    
    gcc-8.2.1 reports "Werror=format-truncation" error due to there is
    possibility the string be truncated, add return value check could
    avoid the error.
    
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
    Signed-off-by: Dongqi Chen <chen.dq@neusoft.com>
    
    Change-Id: I8fa08077e47ee3777a50f018af77b3e8fc6191f9
    Story: 2007506
    Task: 39278

commit 456c255ea11b741ea91eca7f0f277f167488002f
Author: Dongqi Chen <chen.dq@neusoft.com>
Date:   Fri Apr 3 14:21:52 2020 +0800

Fix mtce build error with gcc-8.2.1
    
    1.Remove 'const' in DELIMITER to fix "Werror=ignored-qualifiers"
    2.Replace sprintf with snprintf, and add return value check for
      snprintf to fix "Werror=format-overflow"
    3.Replace strncpy with snprintf to fix "Werror=stringop-truncation"
    
    Change-Id: Iecca021fc02df35a472a3f8aa04c9501998e2dba
    Story: 2007506
    Task: 39279
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
    Signed-off-by: Dongqi Chen <chen.dq@neusoft.com>

commit b725a0974b243e5a66c976638e0350244c89f21c
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Thu Apr 2 10:18:32 2020 +0200

De-branding in starlingx/metal: Titanium Cloud -> StarlingX
    
    1. Rename Titanium Cloud to StarlingX for .spec files
    2. Rename Titanium Cloud to StarlingX for .service file
    
    Test:
    After the de-brand change, bootimage.iso has built in the flock layer
    and installed on the dev machine to validate the changes.
    
    Please note, doing de-brand changes in batches, this is batch1 changes.
    
    Story: 2006387
    Task: 36207
    
    Change-Id: Ifa4dc5c7aa3189815e00b796fc833852e88c8fe3
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit f11e52b000d42504b49ea72fecbd4a1852129dd1
Author: Saul Wold <sgw@linux.intel.com>
Date:   Mon Mar 30 11:52:59 2020 -0700

Select disk via kickstart
    
    If the boot_device and/or rootfs_device are not set on the kernel
    cmdline detect them early in the kickstart script. This can help
    solve the issue about which disk type is to be used.
    
    Remove the older code for disk detection
    Tested with both sda and nvme disk types
    
    Story: 2007486
    Task: 39204
    Change-Id: I4fa3b44a4e656e280820ceeefafaf127cb048df6
    Signed-off-by: Saul Wold <sgw@linux.intel.com>

commit 082688230827e59ef9905aa03b9fea7f034dfb13
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 1 11:38:30 2020 -0400

Add mtcAgent socket initialization failure retry handling.
    
    The main maintenance process (mtcAgent) exits on a process start-up
    socket initialization failure. SM restarts the failed process within
    seconds and will swact if the second restart also fails. From startup
    to swact can be as quick as 4 seconds. This is too short to handle a
    collision with a manifest.
    
    This update adds a number of socket initialization retries to extend
    the time the process has to resolve socket initialization failures by
    giving the collided manifest time to complete between retries.
    
    The number of retries and inter retry wait time is calibrated to ensure
    that a persistently failing mtcAgent process exits in under 40 seconds.
    
    This is to ensure that SM is able to detect and swact away from a
    persistently failing maintenance process while also giving the process
    a few tries to resolve on its own.
    
    Test Plan:
    
    PASS: Verify socket init failure thresholded retry handling
          with no, persistent and recovered failure conditions.
    PASS: Verify swact if socket init failure is persistent
    PASS: Verify no swact if socket failure recovers after first exit
    PASS: Verify no swact if socket failure recovers over init retry
    PASS: Verify an hour long soak of continuous socket open/close retry
    
    Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
    Closes-Bug: 1869192
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.