StarlingX

Bug #1956755
Comment #3

Comment 3 for bug 1956755

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-10: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch: master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500

After DOR, compute nodes came back in a degraded state

    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.

    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.

    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.

    Test Plan:
    PASS execution of DOR test with AIO-DX + 3 worker nodes
    PASS execution of power-of/power-on of single worker node
    PASS execution of host lock/unlock

Closes-Bug: 1956755

Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea

Reviewed:  https://review.opendev.org/c/starlingx/stx-puppet/+/823789
Committed: https://opendev.org/starlingx/stx-puppet/commit/3de5754b785049f914b70a432bccc15ea05822e3
Submitter: "Zuul (22348)"
Branch:    master

commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Fri Jan 7 06:56:43 2022 -0500

After DOR, compute nodes came back in a degraded state
    
    It was detected that after a DOR (dead office recovery), the worker
    nodes were entering a degraded state. In the test scenario,
    containing AIO-DX + 3 workers in IPv6, it was caused by the missing
    cluster interface after bringing up. On the configuration, the
    management interface was created on vlan186 and the cluster interface
    on vlan187, with both attached to the PXE interface. Most of the time
    this triggers an extra reboot on the worker node that corrects the
    situation (if the host is unlocked), but occasionally the system
    remains in the error.
    
    After reproduction tests, the root cause was detected on systemd's
    network service timeout (of 5 minutes) during vlan186 DHCP request.
    During the DOR recovery, the controllers did not have yet started
    the DHCP server at the moment of vlan186's initial request (logs are
    showing this happened after 10 minutes). Since systemd stops the
    service, vlan187 is not configured on the kernel (the interfaces are
    configured following an alphabetical order), thus generating the
    alarms.
    
    The correction adds a verification on apply_network_config.sh (run
    during worker manifest apply) to compare the vlan interface config
    files in /etc/sysconfig/network-scripts/ with the configured
    interfaces on the kernel and, if not present, apply the configuration
    to create the device.
    
    Test Plan:
    PASS  execution of DOR test with AIO-DX + 3 worker nodes
    PASS  execution of power-of/power-on of single worker node
    PASS  execution of host lock/unlock
    
    Closes-Bug: 1956755
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea