LDAP authentication issue after rehoming a DC AIO-DX+worker subcloud

Bug #2056560 reported by Steven Webster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Steven Webster

Bug Description

Brief Description
-----------------
After rehoming a standard configuration DC subcloud to a new central controller, worker nodes are not able to contact the LDAP server.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Rehome a standard configuration subcloud to a new central controller. Observe that sudo operations from the worker node are delayed, with the error message: sudo: ldap_sasl_bind_s(): Can't contact LDAP server

Expected Behavior
------------------
Worker nodes are able to contact the LDAP server after rehoming.

Actual Behavior
----------------
Worker nodes are not able to contact the LDAP server after rehoming.

Reproducibility
---------------
100%

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
Master 03/08/2024

Workaround
----------
Install a route to the new central system controller on the worker node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/912261

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/912262

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/912261
Committed: https://opendev.org/starlingx/config/commit/f8d30588ade9469dbbd97bc4b2655b30c19da6bb
Submitter: "Zuul (22348)"
Branch: master

commit f8d30588ade9469dbbd97bc4b2655b30c19da6bb
Author: Steven Webster <email address hidden>
Date: Fri Mar 8 08:30:07 2024 -0500

    Fix LDAP issue for DC subcloud

    This commit fixes an LDAP authentication issue seen on worker nodes
    of a subcloud after a rehoming procedure was performed.

    There are two main parts:

    1. Since every host of a subcloud authenticates with the system
       controller, we need to reconfigure the LDAP URI across all nodes
       of the system when the system controller network changes (upon
       rehome). Currently, it is only being reconfigured on controller
       nodes.

    2. Currently, the system uses an SNAT rule to allow worker/storage
       nodes to authenticate with the system controller when the admin
       network is in use. This is because the admin network only exists
       between controller nodes of a distributed cloud. The SNAT rule
       is needed to allow traffic from the (private) management network
       of the subcloud over the admin network to the system controller
       and back again. If the admin network is _not_ being used,
       worker/storage nodes of the subcloud can authenticate with the
       system controller, but routes must be installed on the
       worker/storage nodes to facilitate this. It becomes tricky to
       manage in certain circumstances of rehoming/network config.
       This traffic really should be treated in the same way as that
       of the admin network.

    This commit addresses the above by:

    1. Reconfiguring the ldap_server config across all nodes upon
       system controller network changes.

    2. Generalizing the current admin network nat implementation to
       handle the management network as well.

    Test Plan:

    IPv4, IPv6 distributed clouds

    1. Rehome a subcloud to another system controller and back again
       (mgmt network)
    2. Update the subcloud to use the admin network (mgmt -> admin)
    3. Rehome the subcloud to another system controller and back again
       (admin network)
    4. Update the subcloud to use the mgmt network (admin -> mgmt)

    After each of the numbered steps, the following were performed:

    a. Ensure the system controller could become managed, online, in-sync
    b. Ensure the iptables SNAT rules were installed or updated
       appropriately on the subcloud controller nodes.
    c. Log into a worker node of the subcloud and ensure sudo commands
       could be issued without LDAP timeout.
    d. Log into worder node with LDAP USER X via console and verify
       login succeed

    In general, tcpdump was also used to ensure the SNAT translation was
    actually happening.

    Partial-Bug: #2056560

    Change-Id: Ia675a4ff3a2cba93e4ef62b27dba91802811e097
    Signed-off-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/912262
Committed: https://opendev.org/starlingx/stx-puppet/commit/ff0782df3932b38136fd49a22d4e8509e611cd39
Submitter: "Zuul (22348)"
Branch: master

commit ff0782df3932b38136fd49a22d4e8509e611cd39
Author: Steven Webster <email address hidden>
Date: Fri Mar 8 08:38:02 2024 -0500

    Fix LDAP issue for DC subcloud

    This commit fixes an LDAP authentication issue seen on worker nodes
    of a subcloud after a rehoming procedure was performed.

    Currently, the system uses an SNAT rule to allow worker/storage nodes
    to authenticate with the system controller when the admin network is
    in use. This is because the admin network only exists between
    controller nodes of a distributed cloud. The SNAT rule is needed to
    allow traffic from the (private) management network of the subcloud
    over the admin network to the system controller and back again.
    If the admin network is _not_ being used, worker/storage nodes of
    the subcloud can authenticate with the system controller, but routes
    must be installed on the worker/storage nodes to facilitate this.
    It becomes tricky to manage in certain circumstances of rehoming.
    This traffic really should be treated in the same way as that of the
    admin network.

    This commit addresses the above by generalizing the current admin
    network nat implementation to handle the management network as well.

    Test Plan:

    IPv4, IPv6 distributed clouds

    1. Rehome a subcloud to another system controller and back again
       (mgmt network)
    2. Update the subcloud to use the admin network (mgmt -> admin)
    3. Rehome the subcloud to another system controller and back again
       (admin network)
    4. Update the subcloud to use the mgmt network (admin -> mgmt)

    After each of the numbered steps, the following were performed:

    a. Ensure the system controller could become managed, online, in-sync
    b. Ensure the iptables SNAT rules were installed or updated
       appropriately on the subcloud controller nodes.
    c. Log into a worker node of the subcloud and ensure sudo commands
       could be issued without LDAP timeout.

    In general, tcpdump was also used to ensure the SNAT translation was
    actually happening.

    Closes-Bug: #2056560
    Depends-On: https://review.opendev.org/c/starlingx/config/+/912261

    Change-Id: If583b8eec7a385fb9b38e3ff80d58f5d842fe944
    Signed-off-by: Steven Webster <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0
Ghada Khalil (gkhalil)
summary: - LDAP authentication issue after rehoming DC subcloud
+ LDAP authentication issue after rehoming DC AIO-DX+worker subcloud
summary: - LDAP authentication issue after rehoming DC AIO-DX+worker subcloud
+ LDAP authentication issue after rehoming a DC AIO-DX+worker subcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.