Locking controller timed out waiting for helm-controller to terminate

Bug #2034610 reported by Joshua Reed
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Joshua Reed

Bug Description

Brief Description
-----------------
Locking controller timed out with error 'Terminating pods on disabled host controller-0 timed out..'

Severity
-----------------
Major

Steps to Reproduce
-----------------

Manual:

Find the standby controller.
system host-list

Lock standby controller
system host-lock controller-0

+------------------------+-------------------------------------------------------------------------+
| Property | Value |
+------------------------+-------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| apparmor | disabled |
| availability | available |
| bm_ip | 2620:10a:a001:d48::25 |
| bm_type | dynamic |
| bm_username | sysadmin |
| boot_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:3:110:0 |
| capabilities | {'is_max_cpu_configurable': 'configurable', 'stor_function': 'monitor'} |
| clock_synchronization | ntp |
| config_applied | 525564f9-ae6d-4078-9d34-cdbf7fb431eb |
| config_status | None |
| config_target | 525564f9-ae6d-4078-9d34-cdbf7fb431eb |
| console | ttyS0,115200n8 |
| created_at | 2023-08-31T05:34:22.131670+00:00 |
| cstates_available | C1,C1E,C6,POLL |
| device_image_update | None |
| hostname | controller-0 |
| hw_settle | 0 |
| id | 1 |
| install_output | text |
| install_state | None |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| max_cpu_mhz_allowed | 3300 |
| max_cpu_mhz_configured | None |
| mgmt_ip | fdff:719a:bf60:2016::3 |
| mgmt_mac | b4:83:51:00:ae:f8 |
| min_cpu_mhz_allowed | 800 |
| operational | enabled |
| personality | controller |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:3:110:0 |
| serialid | None |
| software_load | 23.09 |
| task | Locking |
| tboot | |
| ttys_dcd | False |
| updated_at | 2023-08-31T14:13:25.498051+00:00 |
| uptime | 23186 |
| uuid | f1b4280a-c929-44ba-8cfd-a7a00c7dae4a |
| vim_progress_status | services-enabled |
+------------------------+-------------------------------------------------------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$

Wait for some time. Locking controller-0 failed
system host-show controller-0

+------------------------+-------------------------------------------------------------------------+
| Property | Value |
+------------------------+-------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| apparmor | disabled |
| availability | available |
| bm_ip | 2620:10a:a001:d48::25 |
| bm_type | dynamic |
| bm_username | sysadmin |
| boot_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:3:110:0 |
| capabilities | {'is_max_cpu_configurable': 'configurable', 'stor_function': 'monitor', |
| | 'Personality': 'Controller-Standby'} |
| clock_synchronization | ntp |
| config_applied | 525564f9-ae6d-4078-9d34-cdbf7fb431eb |
| config_status | None |
| config_target | 525564f9-ae6d-4078-9d34-cdbf7fb431eb |
| console | ttyS0,115200n8 |
| created_at | 2023-08-31T05:34:22.131670+00:00 |
| cstates_available | C1,C1E,C6,POLL |
| device_image_update | None |
| hostname | controller-0 |
| hw_settle | 0 |
| id | 1 |
| install_output | text |
| install_state | None |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| max_cpu_mhz_allowed | 3300 |
| max_cpu_mhz_configured | None |
| mgmt_ip | fdff:719a:bf60:2016::3 |
| mgmt_mac | b4:83:51:00:ae:f8 |
| min_cpu_mhz_allowed | 800 |
| operational | enabled |
| personality | controller |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:3:110:0 |
| serialid | None |
| software_load | 23.09 |
| task | |
| tboot | |
| ttys_dcd | False |
| updated_at | 2023-08-31T14:27:06.484889+00:00 |
| uptime | 23956 |
| uuid | f1b4280a-c929-44ba-8cfd-a7a00c7dae4a |
| vim_progress_status | Terminating pods on disabled host controller-0 timed out... |
+------------------------+-------------------------------------------------------------------------+

Expected Behavior
-----------------

Controller lock should be successful

Actual Behavior
-----------------

Controller lock failed. Reverts to unlocked state.

Reproducibility
-----------------
Intermittent

System Configuration
-----------------
AIO-PLUX, AIO-DX, STANDARD with Storage.

Last Pass
-----------------

8/31/23 - Bug introduced by https://review.opendev.org/c/starlingx/ansible-playbooks/+/890987

Test Activity
-----------------

Sanity

Workaround

1. kubectl edit deployment -n flux-helm helm-controller
2. Edit the following:
  - Add argument: "--graceful-shutdown-timeout=10s" to the command line.
  - Change the "terminationGracePeriodSeconds" option:
    - from: terminationGracePeriodSeconds: 600
    - to: terminationGracePeriodSeconds: 10

Joshua Reed (jreed7)
Changed in starlingx:
assignee: nobody → Joshua Reed (jreed7)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Setting as high priority as this is causing a yellow sanity for stx: https://lists.starlingx.io/pipermail/starlingx-discuss/2023-September/014493.html

Changed in starlingx:
importance: Undecided → High
tags: added: stx.9.0 stx.apps stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/893978
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/e52e7e9d3caec8bda48ba043fcba58679ae7cd82
Submitter: "Zuul (22348)"
Branch: master

commit e52e7e9d3caec8bda48ba043fcba58679ae7cd82
Author: Joshua Reed <email address hidden>
Date: Wed Sep 6 12:52:59 2023 -0700

    Adjust FluxCD Helm Controller Pod Termination Timeouts.

    Previously, in v0.28.0 of helm-controller, the pod would
    terminate quickly. After the update to a higher version
    of FluxCD and thus v0.35.0 for helm-controller, that no
    longer happens. Instead the pod terminates rather slowly.
    As a result, during a "system host-lock" command, StarlingX
    times out waiting on pods to be evicted/terminated from
    the node that is being locked. The lock fails. This
    behavior causes sanity testing to fail.

    Corrective action is to provide an argument to the helm
    controller deployment spec and lower its termination
    grade period.

    Test Plan:
    1. Full AIO-SX installation. Verify helm controller installs
       properly.
    2. Full AIO-DX installation. Verify helm controller installs
       properly. Lock the standby controller, and verify that
       it locks in a reasonable amount of time.
    3. On AIO-DX, perform a swact. After swact, lock the
       opposite standby controller. Verify that the host locks
       in a reasonable amount of time.

    References:

    Helm Controller Release notes detailing behavioral changes:
    1. https://github.com/fluxcd/helm-controller/blob/v0.28.0/CHANGELOG.md
    2. https://github.com/fluxcd/helm-controller/blob/v0.28.1/CHANGELOG.md

    Closes-Bug: 2034610
    Change-Id: I03d1085a995155e12aa7312a5886e7f6ec8d7709
    Signed-off-by: Joshua Reed <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.