After mgmt network reconfig, old mgmt is still used by "sw-patch query-hosts"

Bug #2060066 reported by Fabiano Correa Mercer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Fabiano Correa Mercer

Bug Description

Brief Description
-----------------

After mgmt network reconfig, the command "sudo sw-patch query-hosts" still reports the old mgmt network IP.

```
[sysadmin@controller-0 ~(keystone_admin)]$ sudo sw-patch query-hosts
  Hostname IP Address Patch Current Reboot Required Release State
============ ====================== ============= =============== ======= =====
controller-0 fdff:719a:bf60:1021::3 Yes No 23.09 idle
```

Where "fdff:719a:bf60:1021::3" is an IP from the old mgmt network address pool. After an extra lock/unlock, then the IP is updated correctly. It is not known yet if this is affecting patching apply/remove operation (will be verified and updated in the comments section)

In one reproduction of the issue, both old and new mgmt IPs were displayed at the same time:

```
 [sysadmin@controller-0 ~(keystone_admin)]$ sudo sw-patch query

Patch ID RR Release Patch State
======== == ======= ===========

[sysadmin@controller-0 ~(keystone_admin)]$ sudo sw-patch query-hosts

  Hostname IP Address Patch Current Reboot Required Release State
============ ====================== ============= =============== ======= =====
controller-0 fdff:719a:bf60:1020::3 Yes No 23.09 idle
controller-0 fdff:10:10:22::3 Yes No 23.09 idle

```

Patching.log reports several errors:

```
2023-12-14T18:42:28: sw-patch-controller-daemon[10260]: base.py(169): INFO: Unable to setup sockets. Waiting to retry
2023-12-14T18:42:33: sw-patch-agent[10246]: base.py(138): ERROR: Failed to setup socket
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cgcs_patch/base.py", line 132, in setup_socket
    sock_in = self.setup_socket_ipv6()
  File "/usr/lib/python3/dist-packages/cgcs_patch/base.py", line 107, in setup_socket_ipv6
    self.sock_out.bind((mgmt_ip, 0))
OSError: [Errno 99] Cannot assign requested address
2023-12-14T18:42:33: sw-patch-agent[10246]: base.py(169): INFO: Unable to setup sockets. Waiting to retry
2023-12-14T18:42:33: sw-patch-controller-daemon[10260]: base.py(138): ERROR: Failed to setup socket
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cgcs_patch/base.py", line 132, in setup_socket
    sock_in = self.setup_socket_ipv6()
  File "/usr/lib/python3/dist-packages/cgcs_patch/base.py", line 107, in setup_socket_ipv6
    self.sock_out.bind((mgmt_ip, 0))
OSError: [Errno 99] Cannot assign requested address
```

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
1 - Install AIO-SX IPv4

2 - Reconfig the mgmt network

3 - Verify patching logs and "sudo sw-patch query-hosts"

Expected Behavior
------------------
"sw-patch query-hosts" shows the new mgmt network IP and patching.log does not contain errors

Actual Behavior
----------------
Old mgmt IP is used and some errors are seen in patching.log

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX IPv4

Branch/Pull Time/Commit
-----------------------
Any load after: 2023-12-13_19-00-28

Last Pass
---------
First time testing

Timestamp/Logs
--------------
See Description

Test Activity
-------------
Feature Testing

Workaround
-------------
Restart sw-patch services

Changed in starlingx:
assignee: nobody → Fabiano Correa Mercer (fcorream)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

Related to storyboard: https://storyboard.openstack.org/#!/story/2010722 which is an stx.10.0 feature

tags: added: stx.10.0 stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/update/+/915181

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by "Fabiano Correa Mercer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/914711
Reason: After talking with Matt, the idea is to not change the /etc/hosts before the reboot, the puppet must do it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/914710
Committed: https://opendev.org/starlingx/config/commit/4919bf72138e0a886637e58723aadff421e74adf
Submitter: "Zuul (22348)"
Branch: master

commit 4919bf72138e0a886637e58723aadff421e74adf
Author: Fabiano Correa Mercer <email address hidden>
Date: Thu Mar 28 11:50:36 2024 -0300

    Send the correct mgmt-IP to mtce

    After the management reconfiguration, it was not possible to apply a reboot-required
    patch because the sysinv was sending the old mgmt IP adress to the mtce.
    Consequently, mtce wasn't creating the required file (/var/run/.node_locked) during
    the host-lock command.
    This file is essential for the sw-patch tool to proceed with the installation.

    Additionally, the management network reconfiguration runtime manifest can be executed
    prematurely if the MGMT_NETWORK_RECONFIGURATION_ONGOING flag is used.
    However, users might introduce other changes that could unintentionally trigger the
    runtime manifests before the host-unlock command.
    This could lead to unexpected keystone changes, potentially causing CLI blockage or
    system reboots.

    The MGMT_NETWORK_RECONFIGURATION_ONGOING flag is created when initiating management
    network reconfiguration commands and it is intended to avoid update on the dnsmasq
    files until system reboot.
    Changed to MGMT_NETWORK_RECONFIGURATION_UNLOCK because this flag is intended to
    guarantee keystone changes only occur during the unlock command.

    Tests dome:
    IPv4 AIO-SX fresh install
    IPv4 AIO-DX with mgmt in vlan fresh install
    IPv4 DC with subcloud AIO-SX
    IPv4 AIO-SX mgmt reconfig and apply a reboot-required patch
    IPv4 subcloud AIO-SX mgmt reconfig and apply a reboot-required patch

    Partial-Bug: #2060066
    Story: 2010722
    Task: 49810

    Change-Id: I138d8e31edd60a41a4595cfb8bd2dc478bc01013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/c/starlingx/update/+/915181
Committed: https://opendev.org/starlingx/update/commit/7fef84e36e0b0b1b56ec8c57fdd374e665824af8
Submitter: "Zuul (22348)"
Branch: master

commit 7fef84e36e0b0b1b56ec8c57fdd374e665824af8
Author: Fabiano Correa Mercer <email address hidden>
Date: Fri Apr 5 16:54:36 2024 -0300

    sw-patch-agent waits the new mgmt IP config

    During the management network reconfiguration, the system is restarted
    to controller_config script runs the puppet code and update
    all services to use the new mgmt IP address.
    But the sw-patch services start before the controller_config.
    When they start they get the mgmt_ip using the python socket lib that
    uses the IP address from the /etc/hosts.
    But /etc/hosts at that time is not updated yet, so it get the old
    management network IP.
    To fix this issue, the sw-patch services will wait for the puppet
    code to be applied to make sure the /etc/hosts and new management
    network IPs were installed in the system.

    Tests done:
    IPv4 AIO-SX fresh install
    IPv4 AIO-DX fresh install
    IPv4 DC with subcloud AIO-SX fresh install
    IPv4 AIO-SX mgmt reconfig and apply a non-reboot-required patch
    IPv4 AIO-SX mgmt reconfig and apply a reboot-required patch
    IPv4 subcloud AIO-SX mgmt reconfig and apply a non-reboot-required patch

    IPv4 subcloud AIO-SX mgmt reconfig and apply a reboot-required patch
         For this test the sw-patch was in failed state after the reboot,
         It happens even without the mgmt reconfig and this fix

    Partial-Bug: #2060066
    Story: 2010722
    Task: 49827

    Depends-On: https://review.opendev.org/c/starlingx/config/+/914710
    Change-Id: Ie544425513ef4fede73b4b55770ad6857cdf7eed
    Signed-off-by: Fabiano Correa Mercer <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.