StarlingX

After DOR, compute nodes came back in a degraded state

Bug #1956755 reported by Andre Kantek on 2022-01-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Andre Kantek

Bug Description

Brief Description
-----------------
After the DOR test all the computes have Alarm “compute is degraded due to the failure of its 'hbsClient' process.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
1) Power off all the nodes
2) Wait for 60 seconds
3) Power on

Expected Behavior
------------------
After DOR test all the hosts were able to recover by itself

Actual Behavior
----------------
Computes are not recovered hbsClient process failure

Reproducibility
---------------
Intermittent
State if the issue is 100% reproducible, intermittent or seen once. If it is intermittent, state the frequency of occurrence

System Configuration
--------------------
Multi-node system (AIO-DX+worker)

Branch/Pull Time/Commit
-----------------------
"2021-12-12_21-12-25"

Last Pass
---------
"2021-12-12_21-12-25"

Test Activity
-------------
Regression Testing

Workaround
----------
Reboot worker node

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2022-01-07:

screening: stx.7.0 / medium - specific issue after power reset; workaround exists. Sufficient to fix in the active branch.

tags:	added: stx.7.0 stx.networking
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Andre Kantek (akantek)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-07: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/823789

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-10: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch: master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500

After DOR, compute nodes came back in a degraded state

    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.

    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.

    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.

    Test Plan:
    PASS execution of DOR test with AIO-DX + 3 worker nodes
    PASS execution of power-of/power-on of single worker node
    PASS execution of host lock/unlock

Closes-Bug: 1956755

Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea

Reviewed:  https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch:    master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Fri Jan 7 06:56:43 2022 -0500

After DOR, compute nodes came back in a degraded state
    
    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.
    
    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.
    
    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.
    
    Test Plan:
    PASS  execution of DOR test with AIO-DX + 3 worker nodes
    PASS  execution of power-of/power-on of single worker node
    PASS  execution of host lock/unlock
    
    Closes-Bug: 1956755
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.