commit 3de5754b785049f914b70a432bccc15ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500
After DOR, compute nodes came back in a degraded state
It was detected that after a DOR (dead office recovery), the worker
nodes were entering a degraded state. In the test scenario,
containing AIO-DX + 3 workers in IPv6, it was caused by the missing
cluster interface after bringing up. On the configuration, the
management interface was created on vlan186 and the cluster interface
on vlan187, with both attached to the PXE interface. Most of the time
this triggers an extra reboot on the worker node that corrects the
situation (if the host is unlocked), but occasionally the system
remains in the error.
After reproduction tests, the root cause was detected on systemd's
network service timeout (of 5 minutes) during vlan186 DHCP request.
During the DOR recovery, the controllers did not have yet started
the DHCP server at the moment of vlan186's initial request (logs are
showing this happened after 10 minutes). Since systemd stops the
service, vlan187 is not configured on the kernel (the interfaces are
configured following an alphabetical order), thus generating the
alarms.
The correction adds a verification on apply_network_config.sh (run
during worker manifest apply) to compare the vlan interface config
files in /etc/sysconfig/network-scripts/ with the configured
interfaces on the kernel and, if not present, apply the configuration
to create the device.
Test Plan:
PASS execution of DOR test with AIO-DX + 3 worker nodes
PASS execution of power-of/power-on of single worker node
PASS execution of host lock/unlock
Closes-Bug: 1956755
Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
Change-Id: Ia7517c67c47013069a622a30976150c2c1127aea
Reviewed: https:/ /review. opendev. org/c/starlingx /stx-puppet/ +/823789 /opendev. org/starlingx/ stx-puppet/ commit/ 3de5754b785049f 914b70a432bccc1 5ea05822e3
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 3de5754b785049f 914b70a432bccc1 5ea05822e3
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Fri Jan 7 06:56:43 2022 -0500
After DOR, compute nodes came back in a degraded state
It was detected that after a DOR (dead office recovery), the worker
nodes were entering a degraded state. In the test scenario,
containing AIO-DX + 3 workers in IPv6, it was caused by the missing
cluster interface after bringing up. On the configuration, the
management interface was created on vlan186 and the cluster interface
on vlan187, with both attached to the PXE interface. Most of the time
this triggers an extra reboot on the worker node that corrects the
situation (if the host is unlocked), but occasionally the system
remains in the error.
After reproduction tests, the root cause was detected on systemd's
network service timeout (of 5 minutes) during vlan186 DHCP request.
During the DOR recovery, the controllers did not have yet started
the DHCP server at the moment of vlan186's initial request (logs are
showing this happened after 10 minutes). Since systemd stops the
service, vlan187 is not configured on the kernel (the interfaces are
configured following an alphabetical order), thus generating the
alarms.
The correction adds a verification on apply_network_ config. sh (run network- scripts/ with the configured
during worker manifest apply) to compare the vlan interface config
files in /etc/sysconfig/
interfaces on the kernel and, if not present, apply the configuration
to create the device.
Test Plan:
PASS execution of DOR test with AIO-DX + 3 worker nodes
PASS execution of power-of/power-on of single worker node
PASS execution of host lock/unlock
Closes-Bug: 1956755
Signed-off-by: Andre Fernando Zanella Kantek <email address hidden> 069a622a3097615 0c2c1127aea
Change-Id: Ia7517c67c47013