stx-application re-apply strategy requires some changes

Bug #1837750 reported by Frank Miller
46
This bug affects 5 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tyler Smith

Bug Description

Brief Description
-----------------
This LP is tracking a few re-apply considerations:
1) When a host is unlocked, applications are re-applied regardless if there have been any override changes. The re-apply should only occur if a change in an override has occurred since the last apply.
2) If a re-apply is in progress and a user updates the config and triggers another re-apply (eg: via host unlock), the re-apply in progress will not pick up the new change.
3) The re-apply that is triggered on a host unlock is started immediately after the unlock is issued and has to wait for the host to recover before it completes. This results in a long re-apply time and can cause the re-apply to fail if the unlock does not complete.
4) No alarm is raised when a config change is done that requires a re-apply.

Severity
--------
Major

Steps to Reproduce (for #1):
------------------
Start with a system where the stx-openstack application is in applied state.
Lock/unlock the standby controller

Expected Behavior
------------------
The stx-openstack application should not be re-applied

Actual Behavior
----------------
The stx-openstack application is re-applied and requires 10+ minutes for the re-apply to occur as it needs to wait for the host to complete the unlock and its pods to recover.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
all configs

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190720T013000Z"

Last Pass
---------
Did this test scenario pass previously? No

Timestamp/Logs
--------------
n/a

Test Activity
-------------

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.2.0 gating / Medium priority as this is more of an optimization for the re-apply.

tags: added: stx.containers
tags: added: stx.2.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
Frank Miller (sensfan22) wrote :

Attaching controller-0 logs

Revision history for this message
Frank Miller (sensfan22) wrote :

Adding controller-1 logs

Revision history for this message
Frank Miller (sensfan22) wrote :

Here is a timeline of the test done in the WCP113-121 lab.
Controller-1 was active and controller-0 was standby:

2019-07-23T19:52:37.061 [2780154.00918] controller-1 mtcAgent --- nodeBase.cpp ( 665) log_adminAction : Info : controller-0 Lock Action
2019-07-23T19:57:54.660 [2780154.00942] controller-1 mtcAgent --- nodeBase.cpp ( 675) log_adminAction : Info : controller-0 Unlock Action
2019-07-23T20:03:31.369 [2780154.00979] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1231) enable_handler : Info : controller-0 got GOENABLED

2019-07-23 19:57:54.674 2780688 INFO sysinv.api.controllers.v1.host [-] Reapplying the stx-openstack app
2019-07-23 20:08:37.500 2779525 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply completed.

Based on the above logs we see that the re-apply was started right after mtc started the unlock on the standby controller and the re-apply required just over 10 minutes to complete.

Frank Miller (sensfan22)
summary: - stx-application is re-applied when host is unlocked when no override
- changes have been made since last apply
+ stx-application re-apply strategy requires some changes
Frank Miller (sensfan22)
description: updated
Changed in starlingx:
assignee: Tee Ngo (teewrs) → Tyler Smith (tyler.smith)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/677847

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (master)

Reviewed: https://review.opendev.org/677845
Committed: https://git.openstack.org/cgit/starlingx/fault/commit/?id=15463f2baa5d1c39cddde1466c98ced797409cad
Submitter: Zuul
Branch: master

commit 15463f2baa5d1c39cddde1466c98ced797409cad
Author: Tyler Smith <email address hidden>
Date: Wed Aug 21 17:15:34 2019 -0400

    Changes to stx-openstack application automatic re-apply behaviour

    This commit adds a "pending application reapply" alarm to fm,
    which will be raised when there has been a configuration change
    to nodes that affects the helm overrides.

    Partial-Bug: 1837750
    Change-Id: Iec5852a798eee51dacbc5ea5016e4c20d85b668c
    Signed-off-by: Tyler Smith <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/677847
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=b1895200a44986b1a447c28262bf22edeba5f652
Submitter: Zuul
Branch: master

commit b1895200a44986b1a447c28262bf22edeba5f652
Author: Tyler Smith <email address hidden>
Date: Wed Aug 21 18:07:34 2019 -0400

    Changes to stx-openstack application automatic re-apply behaviour

    The stx-openstack application is no longer automatically reapplied
    on node unlock. The new behaviour is handled with a reapply flag:

     - When a node is unlocked, or a runtime manifest is applied,
       overrides are regenerated and compared to
       their old values. If there is a difference a reapply flag is raised
       along with a warning alarm
     - A check was added to the kubernetes audit in the sysinv conductor
       to check if the reapply flag has been raised and to trigger a reapply
       if the system is in a stable state (no hosts currently
       locking/unlocking/booting)
     - This check is also done when a runtime manifest reports success

    Test cases:
    AIO-SX, AIO-DX, and Standard:
     - When a lock/unlock is done with no changes the application is
       not reapplied
     - When a lock/unlock is done after a config change is made the
       application waits until after the unlock and then triggers a reapply
    STANDARD
     - Enabled ceph-rgw chart and ensured that the application was reapplied upon
       config success (likewise for chart disable)
     - If there is a pending reapply, and the user triggers it before the
       system is stable the reapply flag and alarm are removed
     - Provisioning a new compute node and unlocking it for the
       first time triggers an application reapply after it comes online
     - App is reapplied when a node is deleted
     - Compute added without node labels and unlocked results in no reapply
     - Compute locked, labels applied, then unlocked results in a reapply
       pods launch on compute only when labels present (likewise for label removal)
     - Pending reapply flag and alarm persist over a controller swact

    Change-Id: I1ae9fdc2afcdf831cf0e7d96f8af14fcb5f6b579
    Closes-Bug: 1837750
    Depends-On: https://review.opendev.org/677845
    Signed-off-by: Tyler Smith <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678232

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678233

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (r/stx.2.0)

Reviewed: https://review.opendev.org/678232
Committed: https://git.openstack.org/cgit/starlingx/fault/commit/?id=203c223c435816e9a46a77d481e4965838572a18
Submitter: Zuul
Branch: r/stx.2.0

commit 203c223c435816e9a46a77d481e4965838572a18
Author: Tyler Smith <email address hidden>
Date: Wed Aug 21 17:15:34 2019 -0400

    Changes to stx-openstack application automatic re-apply behaviour

    This commit adds a "pending application reapply" alarm to fm,
    which will be raised when there has been a configuration change
    to nodes that affects the helm overrides.

    Partial-Bug: 1837750
    Change-Id: Iec5852a798eee51dacbc5ea5016e4c20d85b668c
    Signed-off-by: Tyler Smith <email address hidden>
    (cherry picked from commit 15463f2baa5d1c39cddde1466c98ced797409cad)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.2.0)

Reviewed: https://review.opendev.org/678233
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=f69b1312949828f5852b06c42c2331e561df5c4b
Submitter: Zuul
Branch: r/stx.2.0

commit f69b1312949828f5852b06c42c2331e561df5c4b
Author: Tyler Smith <email address hidden>
Date: Wed Aug 21 18:07:34 2019 -0400

    Changes to stx-openstack application automatic re-apply behaviour

    The stx-openstack application is no longer automatically reapplied
    on node unlock. The new behaviour is handled with a reapply flag:

     - When a node is unlocked, or a runtime manifest is applied,
       overrides are regenerated and compared to
       their old values. If there is a difference a reapply flag is raised
       along with a warning alarm
     - A check was added to the kubernetes audit in the sysinv conductor
       to check if the reapply flag has been raised and to trigger a reapply
       if the system is in a stable state (no hosts currently
       locking/unlocking/booting)
     - This check is also done when a runtime manifest reports success

    Test cases:
    AIO-SX, AIO-DX, and Standard:
     - When a lock/unlock is done with no changes the application is
       not reapplied
     - When a lock/unlock is done after a config change is made the
       application waits until after the unlock and then triggers a reapply
    STANDARD
     - Enabled ceph-rgw chart and ensured that the application was reapplied upon
       config success (likewise for chart disable)
     - If there is a pending reapply, and the user triggers it before the
       system is stable the reapply flag and alarm are removed
     - Provisioning a new compute node and unlocking it for the
       first time triggers an application reapply after it comes online
     - App is reapplied when a node is deleted
     - Compute added without node labels and unlocked results in no reapply
     - Compute locked, labels applied, then unlocked results in a reapply
       pods launch on compute only when labels present (likewise for label removal)
     - Pending reapply flag and alarm persist over a controller swact

    Change-Id: I1ae9fdc2afcdf831cf0e7d96f8af14fcb5f6b579
    Closes-Bug: 1837750
    Depends-On: https://review.opendev.org/#/c/678232
    Signed-off-by: Tyler Smith <email address hidden>
    (cherry picked from commit b1895200a44986b1a447c28262bf22edeba5f652)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Paulina Flores (paulina-flores) wrote :

Change tested today by replicating some of the test cases documented above in a Standard system. No problems found and the application is behaving as expected.

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Build info:

OS="centos"
SW_VERSION="19.08"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.2.0"

JOB="STX_BUILD_2.0"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-26 23:30:00 +0000"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.