stx-openstack: `clients` pod fails to initialize on stand-by controllers

Bug #2031058 reported by Luan Nunes Utimura
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Luan Nunes Utimura

Bug Description

Brief Description
-----------------
Recently, it has been observed that, on systems with multiple controller nodes, `clients` pods are failing to initialize on stand-by controllers due to the absence of their respective working directories.

Severity
--------
Major.

Steps to Reproduce
------------------
On a system with multiple controller nodes:
1) Upload/apply stx-openstack;
2) Verify that `clients` pods aren't initializing on stand-by controllers.

Expected Behavior
------------------
All `clients` pods should be running.

Actual Behavior
----------------
Only the `clients` pod on the active controller is running.

Reproducibility
---------------
Reproducible.

System Configuration
--------------------
Two+ controllers system.

Branch/Pull Time/Commit
-----------------------
StarlingX (master)
StarlingX OpenStack (master)

Last Pass
---------
N/A.

Timestamp/Logs
--------------
```
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl -n openstack get pods | grep clients
clients-clients-controller-0-937646f6-pnq6c 1/1 Running 0 9m34s
clients-clients-controller-1-cab72f56-tn252 0/1 Init:0/2 0 9m34s

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl -n openstack describe pod/clients-clients-controller-1-cab72f56-tn252
  Warning FailedMount 2m (x12 over 10m) kubelet MountVolume.SetUp failed for volume "clients-working-directory" : hostPath type check failed: /var/opt/openstack is not a directory
  Warning FailedMount 83s (x3 over 8m11s) kubelet Unable to attach or mount volumes: unmounted volumes=[clients-working-directory], unattached volumes=[kube-api-access-kcr8l pod-tmp clients-bin clients-working-directory]: timed out waiting for the condition
```

Test Activity
-------------
Developer Testing.

Workaround
----------
Manually SSH to the stand-by controller(s) and create the expected working directory(ies).

Changed in starlingx:
assignee: nobody → Luan Nunes Utimura (lutimura)
tags: added: stx.9.0 stx.distro.openstack
description: updated
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)
Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/891200
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/dca8b7519244149e28b9dbfbef1e86ba8993942e
Submitter: "Zuul (22348)"
Branch: master

commit dca8b7519244149e28b9dbfbef1e86ba8993942e
Author: Luan Nunes Utimura <email address hidden>
Date: Fri Aug 11 09:09:38 2023 -0300

    clients: Fix dir. creation on standby controllers

    Recently, it has been observed that, on systems with multiple controller
    nodes, the `clients` pods are failing to initialize on standby
    controllers due to the absence of their respective working directories.

    In the past, this wasn't a problem because the working directory was
    originally mounted with type `DirectoryOrCreate`, that is, K8s was
    responsible for ensuring that this directory existed during `clients`
    pods initialization.

    However, the problem with this parameter is that it creates directories
    with `root:root` permissions, which isn't ideal for system setups
    involving multiple user accesses.

    At the time, we solved this problem by simply moving the working
    directory creation logic to the application's lifecycle code, as seen in
    [1].

    This turned out to have side effects on systems with multiple
    controller nodes, however, as not all lifecycle hooks run on standby
    controllers. Consequently, the working directories weren't being created
    on these nodes.

    Simply put, we can solve the pod initialization problem by mounting the
    directories with `DirectoryOrCreate` (again). However, we must ensure
    that these directories will have the right permissions when a host
    swacts, and that's exactly what this change is aimed at.

    This change also improves the code, by:
      * Replacing the `change_file_mode()` and `change_file_owner()` utility
        functions with `os` builtins;
      * Synchronizing LDAP groups with Linux groups.
          - In some scenarios, e.g., multiple "applies followed by removes",
            the `openstack` LDAP group was created with a different GID than
            the `openstack` Linux group, which caused issues with checking
            the clients' working directory permissions.

    [1] https://opendev.org/starlingx/openstack-armada-app/src/commit/b2e10bfc5f25b3a7d2ed4d4c29cc67bf1dea3bdd/python3-k8sapp-openstack/k8sapp_openstack/k8sapp_openstack/lifecycle/lifecycle_openstack.py#L310

    Test Plan (on AIO-DX):
    PASS - Build python3-k8sapp-openstack package
    PASS - Build stx-openstack-helm-fluxcd package
    PASS - Build stx-openstack helm charts
    PASS - Upload/apply stx-openstack
    PASS - Verify that all `clients` pods are running

    On active controller:
      PASS - Verify that the `clients` working directory has the right
             permissions

    On standby controller:
      PASS - Verify that the `clients` working directory *does not* have the
             right permissions

    PASS - Perform a host swact
    PASS - Verify that the `clients` working directory has the right
           permissions on the former standby controller
...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/antelope)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (f/antelope)
Download full text (3.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/896523
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/979572890aaaf2914a2ebb62ac35c7d2b0476bf0
Submitter: "Zuul (22348)"
Branch: f/antelope

commit 979572890aaaf2914a2ebb62ac35c7d2b0476bf0
Author: Luan Nunes Utimura <email address hidden>
Date: Fri Aug 11 09:09:38 2023 -0300

    clients: Fix dir. creation on standby controllers

    Recently, it has been observed that, on systems with multiple controller
    nodes, the `clients` pods are failing to initialize on standby
    controllers due to the absence of their respective working directories.

    In the past, this wasn't a problem because the working directory was
    originally mounted with type `DirectoryOrCreate`, that is, K8s was
    responsible for ensuring that this directory existed during `clients`
    pods initialization.

    However, the problem with this parameter is that it creates directories
    with `root:root` permissions, which isn't ideal for system setups
    involving multiple user accesses.

    At the time, we solved this problem by simply moving the working
    directory creation logic to the application's lifecycle code, as seen in
    [1].

    This turned out to have side effects on systems with multiple
    controller nodes, however, as not all lifecycle hooks run on standby
    controllers. Consequently, the working directories weren't being created
    on these nodes.

    Simply put, we can solve the pod initialization problem by mounting the
    directories with `DirectoryOrCreate` (again). However, we must ensure
    that these directories will have the right permissions when a host
    swacts, and that's exactly what this change is aimed at.

    This change also improves the code, by:
      * Replacing the `change_file_mode()` and `change_file_owner()` utility
        functions with `os` builtins;
      * Synchronizing LDAP groups with Linux groups.
          - In some scenarios, e.g., multiple "applies followed by removes",
            the `openstack` LDAP group was created with a different GID than
            the `openstack` Linux group, which caused issues with checking
            the clients' working directory permissions.

    [1] https://opendev.org/starlingx/openstack-armada-app/src/commit/b2e10bfc5f25b3a7d2ed4d4c29cc67bf1dea3bdd/python3-k8sapp-openstack/k8sapp_openstack/k8sapp_openstack/lifecycle/lifecycle_openstack.py#L310

    Test Plan (on AIO-DX):
    PASS - Build python3-k8sapp-openstack package
    PASS - Build stx-openstack-helm-fluxcd package
    PASS - Build stx-openstack helm charts
    PASS - Upload/apply stx-openstack
    PASS - Verify that all `clients` pods are running

    On active controller:
      PASS - Verify that the `clients` working directory has the right
             permissions

    On standby controller:
      PASS - Verify that the `clients` working directory *does not* have the
             right permissions

    PASS - Perform a host swact
    PASS - Verify that the `clients` working directory has the right
           permissions on the former standby control...

Read more...

tags: added: in-f-antelope
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.