Brief Description
-----------------
While upgrading the central cloud of a DC system, activation failed. This was because there was an unexpected SWACT to controller-1.
This is due to the etcd upgrade script. Part of this script runs the etcd manifest. This triggers a reload/restart of the etcd service. As this is done outside of the sm, sm saw the process failure and trigged the swact.
Severity
--------
Major
Steps to Reproduce
------------------
1) Follow upgrade procedure to upgrade Central cloud.
2) After installing and unlocking controller-0, swact to controller-1
3) After swact, issue the "upgrade activate" command
Expected Behavior
------------------
There should be no SWACT during upgrade activation
Actual Behavior
----------------
System SWACTs during the upgrade activation
Reproducibility
---------------
Intermittent
System Configuration
--------------------
DC
Branch/Pull Time/Commit
-----------------------
2021-05-05 20:02:01 -0400
Last Pass
---------
2021-05-04_15-04-46
Timestamp/Logs
--------------
ansible:
2021-05-07 01:54:05,834 p=662642 u=root | TASK [Applying puppet for enabling etcd security] ******************************
puppet:
2021-05-07T01:54:13.603 [0;36mDebug: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Config/File[/etc/etcd/etcd.conf]: The container Class[Etcd::Config] will propagate my refresh event[0m
2021-05-07T01:54:13.605 [0;36mDebug: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: The container Stage[main] will propagate my refresh event[0m
2021-05-07T01:54:13.607 [0;36mDebug: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: The container Class[Etcd] will propagate my refresh event[0m
2021-05-07T01:54:13.610 [0;32mInfo: 2021-05-07 01:54:13 +0000 Class[Etcd::Config]: Scheduling refresh of Class[Etcd::Service][0m
2021-05-07T01:54:13.612 [0;32mInfo: 2021-05-07 01:54:13 +0000 Class[Etcd::Service]: Scheduling refresh of Service[etcd][0m
2021-05-07T01:54:13.614 [0;36mDebug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-active etcd'[0m
2021-05-07T01:54:13.616 [0;36mDebug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-enabled etcd'[0m
2021-05-07T01:54:13.618 [0;36mDebug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl stop etcd'[0m
2021-05-07T01:54:13.654 [0;36mDebug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl disable etcd'[0m
2021-05-07T01:54:13.752 [mNotice: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Service/Service[etcd]/ensure: ensure changed 'running' to 'stopped'[0m
2021-05-07T01:54:13.754 [0;36mDebug: 2021-05-07 01:54:13 +0000 /Stage[main]/Etcd::Service/Service[etcd]: The container Class[Etcd::Service] will propagate my refresh event[0m
2021-05-07T01:54:13.756 [0;36mDebug: 2021-05-07 01:54:13 +0000 Executing: '/bin/systemctl is-active etcd'[0m
2021-05-07T01:54:16.114 [0;36mDebug: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: Skipping restart; service is not running[0m
2021-05-07T01:54:16.117 [mNotice: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: Triggered 'refresh' from 1 events[0m
2021-05-07T01:54:16.119 [0;36mDebug: 2021-05-07 01:54:16 +0000 /Stage[main]/Etcd::Service/Service[etcd]: The container Class[Etcd::Service] will propagate my refresh event[0m
2021-05-07T01:54:16.121 [0;36mDebug: 2021-05-07 01:54:16 +0000 Class[Etcd::Service]: The container Stage[main] will propagate my refresh event[0m
2021-05-07T01:54:16.123 [0;36mDebug: 2021-05-07 01:54:16 +0000 Class[Etcd::Service]: The container Class[Etcd] will propagate my refresh event[0m
2021-05-07T01:54:16.125 [0;36mDebug: 2021-05-07 01:54:16 +0000 Class[Etcd]: The container Stage[main] will propagate my refresh event[0m
2021-05-07T01:54:16.128 [0;36mDebug: 2021-05-07 01:54:16 +0000 Exec[restart-etcd](provider=posix): Executing '/usr/bin/systemctl restart etcd.service'[0m
2021-05-07T01:54:16.130 [0;36mDebug: 2021-05-07 01:54:16 +0000 Executing: '/usr/bin/systemctl restart etcd.service'[0m
2021-05-07T01:54:16.405 [mNotice: 2021-05-07 01:54:16 +0000 /Stage[main]/Platform::Etcd::Upgrade::Runtime/Exec[restart-etcd]/returns: executed successfully[0m
sm:
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.178] log<912> INFO: sm[121740]: sm_service_fsm.c(1451): Service (etcd) process failure, pid=496018, exit_code=-65533.
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.178] log<913> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (process-failure) was in the enabled-active state and is now in the disabled state.
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.276] log<914> INFO: sm[121740]: sm_service_enable.c(461): Started enable action (674049) for service (etcd).
2021-05-07T01:54:13.000 controller-0 sm: debug time[2748.276] log<915> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (enable-throttle) was in the disabled state and is now in the enabling state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.676] log<916> INFO: sm[121740]: sm_service_enable.c(363): Action (enable) completed with result (success), state (unknown), status (unknown), and condition (unknown) for service (etcd), reason_text=, exit_code=0.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.676] log<917> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (enable-success) was in the enabling state and is now in the enabled-active state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.687] log<918> INFO: sm[121740]: sm_service_fsm.c(1451): Service (etcd) process failure, pid=674070, exit_code=-65533.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.687] log<919> INFO: sm[121740]: sm_service_fsm.c(1032): Service (etcd) received event (process-failure) was in the enabled-active state and is now in the disabled state.
2021-05-07T01:54:16.000 controller-0 sm: debug time[2750.849] log<920> INFO: sm[121740]: sm_service_disabled_state.c(223): Service (etcd) is failed and has reached max failures (2).
21-05-07T01:54:13.615 | 376 | service-scn | etcd | enabled-active | disabled | process (pid=496018) failed
Test Activity
-------------
Regression Testing
Workaround
----------
SWACT back to controller-0 and issue upgrade activation command again
Fix proposed to branch: master /review. opendev. org/c/starlingx /stx-puppet/ +/790815
Review: https:/