Comment 3 for bug 1956755

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch: master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500

    After DOR, compute nodes came back in a degraded state

    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.

    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.

    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.

    Test Plan:
    PASS execution of DOR test with AIO-DX + 3 worker nodes
    PASS execution of power-of/power-on of single worker node
    PASS execution of host lock/unlock

    Closes-Bug: 1956755

    Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
    Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea