During upgrade activation, system controller swact and activation failed

Bug #1928135 reported by Jessica Castelino
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
While upgrading the central cloud of a DC system, activation failed. This was because there was an unexpected SWACT to controller-1.

This is due to the etcd upgrade script. Part of this script runs the etcd manifest. This triggers a reload/restart of the etcd service. As this is done outside of the sm, sm saw the process failure and trigged the swact.

Severity
--------
Major

Steps to Reproduce
------------------
1) Follow upgrade procedure to upgrade Central cloud.
2) After installing and unlocking controller-0, swact to controller-1
3) After swact, issue the "upgrade activate" command

Expected Behavior
------------------
There should be no SWACT during upgrade activation

Actual Behavior
----------------
System SWACTs during the upgrade activation

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC

Branch/Pull Time/Commit
-----------------------
2021-05-05 20:02:01 -0400

Last Pass
---------
2021-05-04_15-04-46

Timestamp/Logs
--------------
ansible:
2021-05-07 01:54:05,834 p=662642 u=root | TASK [Applying puppet for enabling etcd security] ******************************

puppet:
2021-05-07T01:54:13.603 Debug: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Config/File[/etc/etcd/etcd.conf]: The container Class[Etcd::Config] will propagate my refresh event
2021-05-07T01:54:13.605 Debug: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: The container Stage[main] will propagate my refresh event
2021-05-07T01:54:13.607 Debug: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: The container Class[Etcd] will propagate my refresh event
2021-05-07T01:54:13.610 Info: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: Scheduling refresh of Class[Etcd::Service]
2021-05-07T01:54:13.612 Info: 2021-05-07 01:54:13 +0000 Class[Etcd::Service]: Scheduling refresh of Service[etcd]
2021-05-07T01:54:13.614 Debug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-active etcd'
2021-05-07T01:54:13.616 Debug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-enabled etcd'
2021-05-07T01:54:13.618 Debug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl stop etcd'
2021-05-07T01:54:13.654 Debug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl disable etcd'
2021-05-07T01:54:13.752 Notice: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Service/Service[etcd]/ensure: ensure changed 'running' to 'stopped'
2021-05-07T01:54:13.754 Debug: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Service/Service[etcd]: The container Class[Etcd::Service] will propagate my refresh event
2021-05-07T01:54:13.756 Debug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-active etcd'
2021-05-07T01:54:16.114 Debug: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: Skipping restart; service is not running
2021-05-07T01:54:16.117 Notice: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: Triggered 'refresh' from 1 events
2021-05-07T01:54:16.119 Debug: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: The container Class[Etcd::Service] will propagate my refresh event
2021-05-07T01:54:16.121 Debug: 2021-05-07 01:54:16 +0000 Class[Etcd::Service]: The container Stage[main] will propagate my refresh event
2021-05-07T01:54:16.123 Debug: 2021-05-07 01:54:16 +0000 Class[Etcd::Service]: The container Class[Etcd] will propagate my refresh event
2021-05-07T01:54:16.125 Debug: 2021-05-07 01:54:16 +0000 Class[Etcd]: The container Stage[main] will propagate my refresh event
2021-05-07T01:54:16.128 Debug: 2021-05-07 01:54:16 +0000 Exec[restart-etcd](provider=posix): Executing '/usr/bin/systemctl restart etcd.service'
2021-05-07T01:54:16.130 Debug: 2021-05-07 01:54:16 +0000 Executing: '/usr/bin/systemctl restart etcd.service'
2021-05-07T01:54:16.405 Notice: 2021-05-07 01:54:16 +0000 /Stage[main]/Platform::Etcd::Upgrade::Runtime/Exec[restart-etcd]/returns: executed successfully
sm:
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.178] log<912> INFO: sm[121740]: sm_service_fsm.c(1451): Service (etcd) process failure, pid=496018, exit_code=-65533.
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.178] log<913> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (process-failure) was in the enabled-active state and is now in the disabled state.
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.276] log<914> INFO: sm[121740]: sm_service_enable.c(461): Started enable action (674049) for service (etcd).
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.276] log<915> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (enable-throttle) was in the disabled state and is now in the enabling state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.676] log<916> INFO: sm[121740]: sm_service_enable.c(363): Action (enable) completed with result (success), state (unknown), status (unknown), and condition (unknown) for service (etcd), reason_text=, exit_code=0.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.676] log<917> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (enable-success) was in the enabling state and is now in the enabled-active state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.687] log<918> INFO: sm[121740]: sm_service_fsm.c(1451): Service (etcd) process failure, pid=674070, exit_code=-65533.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.687] log<919> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (process-failure) was in the enabled-active state and is now in the disabled state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.849] log<920> INFO: sm[121740]: sm_service_disabled_state.c(223): Service (etcd) is failed and has reached max failures (2).

21-05-07T01:54:13.615 | 376 | service-scn | etcd | enabled-active | disabled | process (pid=496018) failed

Test Activity
-------------
Regression Testing

Workaround
----------
SWACT back to controller-0 and issue upgrade activation command again

CVE References

Changed in starlingx:
assignee: nobody → Jessica Castelino (jcasteli)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/790815

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.6.0 / medium - upgrade robustness fix; workaround exists

tags: added: stx.6.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/790815
Committed: https://opendev.org/starlingx/stx-puppet/commit/0c16d288fbc483103b7ba5dad7782e97f59f4e17
Submitter: "Zuul (22348)"
Branch: master

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.