After DOR, compute nodes came back in a degraded state

Bug #1956755 reported by Andre Kantek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Andre Kantek

Bug Description

Brief Description
-----------------
After the DOR test all the computes have Alarm “compute is degraded due to the failure of its 'hbsClient' process.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
1) Power off all the nodes
2) Wait for 60 seconds
3) Power on

Expected Behavior
------------------
After DOR test all the hosts were able to recover by itself

Actual Behavior
----------------
Computes are not recovered hbsClient process failure

Reproducibility
---------------
Intermittent
State if the issue is 100% reproducible, intermittent or seen once. If it is intermittent, state the frequency of occurrence

System Configuration
--------------------
Multi-node system (AIO-DX+worker)

Branch/Pull Time/Commit
-----------------------
"2021-12-12_21-12-25"

Last Pass
---------
"2021-12-12_21-12-25"

Test Activity
-------------
Regression Testing

Workaround
----------
Reboot worker node

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / medium - specific issue after power reset; workaround exists. Sufficient to fix in the active branch.

tags: added: stx.7.0 stx.networking
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Andre Kantek (akantek)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/823789

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch: master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500

    After DOR, compute nodes came back in a degraded state

    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.

    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.

    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.

    Test Plan:
    PASS execution of DOR test with AIO-DX + 3 worker nodes
    PASS execution of power-of/power-on of single worker node
    PASS execution of host lock/unlock

    Closes-Bug: 1956755

    Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
    Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.