Invalid Patching Alarm on Simplex after unlocking

Bug #1847872 reported by Brent Rowsell
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
After deploying an AIO simplex, the following persistent alarm was present

900.002 | Patch installation failed on the following hosts: controller-0 | host=controller | major | 2019-10-12T01:56:40. |
| | | | | 912593 |
| | |

I did not apply any patches

Severity
--------
Major

Steps to Reproduce
------------------
Bootstrap and deploy system

Expected Behavior
------------------
No alarm

Actual Behavior
----------------
Alarm

Reproducibility
---------------
Not sure

System Configuration
--------------------
One node

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-10-11 17:33:33 -0400"

Last Pass
---------
Not sure

Timestamp/Logs
--------------
2019-10-12T01:56:41.000 controller-0 fmManager: info { "event_log_id" : "900.002", "reason_text" : "Patch installation failed on the following hosts: controller-0", "entity_instance_id" : "region=RegionOne.system=43cab8fe-8a08-4482-b921-b1eff2306457.host=controller", "severity" : "major", "state" : "set", "timestamp" : "2019-10-12 01:56:40.912593" }

Test Activity
-------------
Developer testing

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Don to triage. Please contact Brent for the collect.

tags: added: stx.update
Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
Revision history for this message
Don Penney (dpenney) wrote :

The lighttpd server is disabled from systemd startup by the initial puppet apply on the controller. Previously, this occurred during config_controller, before the first reboot, but I believe now only happens after the first reboot when the controller is unlocked.

So now on the first reboot, we see systemd try to bring up lighttpd.

The sw-patch init script, however, on an AIO Simplex, has code to temporarily enable lighttpd so that it is able to check the software repository to look for patches to install (duplex does not do this, as it will attempt to talk to the active controller).

In Brent's lab, we see systemd starting lighttpd at the same time as sw-patch. The collision of sw-patch trying to temporarily start lighttpd alongside the systemd-launched lighttpd results in the server being killed:
https://opendev.org/starlingx/config-files/src/branch/master/lighttpd-config/files/lighttpd.init#L28

So when the patch-agent attempts to retrieve the install_uuid file from the lighttpd server for verification, ahead of checking for patches, it fails because the service sw-patch attempted to launch was actually killed. This failure gets flagged, resulting in the alarm seen.

One option would be to have ansible disable the lighttpd service. This would align with the previous behaviour under config_controller.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - would be nice to fix since it's a bogus alarm.

tags: added: stx.3.0 stx.config
removed: stx.update
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Don Penney (dpenney) → Tee Ngo (teewrs)
Revision history for this message
Don Penney (dpenney) wrote :

Please note that this is not a "bogus" alarm, as such. This is a real issue that can impact patching and should be addressed. If there is an applied patch in the system that has not been installed, this mechanism in the init stage is intended to install it, and will fail.

Don Penney (dpenney)
Changed in starlingx:
assignee: Tee Ngo (teewrs) → Don Penney (dpenney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696410

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696411

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/696410
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4c146818a32da8a3406cb61dbc51093460d33bf7
Submitter: Zuul
Branch: master

commit 4c146818a32da8a3406cb61dbc51093460d33bf7
Author: Don Penney <email address hidden>
Date: Wed Nov 27 16:33:10 2019 -0500

    Add constants for controller config complete to tsconfig

    This update adds INITIAL_CONTROLLER_CONFIG_COMPLETE to the tsconfig
    shell variable subscript.

    Change-Id: I8740c468b002ae2d7da5ac4130d7517d3ea80261
    Partial-Bug: 1847872
    Signed-off-by: Don Penney <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/696411
Committed: https://git.openstack.org/cgit/starlingx/update/commit/?id=bfe2fd5693353b7acb92413b0d19f1f12e8f21b0
Submitter: Zuul
Branch: master

commit bfe2fd5693353b7acb92413b0d19f1f12e8f21b0
Author: Don Penney <email address hidden>
Date: Wed Nov 27 16:35:21 2019 -0500

    Improved patching robustness on AIO-SX

    This update includes the following:
    - Ensure the sw-patch init script does nothing on AIO-SX before
    initial configuration is complete. This protects against misleading
    alarms indicating patch installation failure, stemming from
    pre-configuration networking issues.
    - Enhance install-local support to allow for patch installation after
    ansible bootstrap playbook is applied, but before system configuration
    has been completed.

    Change-Id: Id7a5b4f120449fd3f6e71c83167586dc66e8a91e
    Closes-Bug: 1847872
    Depends-On: https://review.opendev.org/696410
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/696901

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (r/stx.3.0)

Reviewed: https://review.opendev.org/696901
Committed: https://git.openstack.org/cgit/starlingx/update/commit/?id=2542c5539bab060830009d02cbb257cc8bf4a376
Submitter: Zuul
Branch: r/stx.3.0

commit 2542c5539bab060830009d02cbb257cc8bf4a376
Author: Don Penney <email address hidden>
Date: Wed Nov 27 16:35:21 2019 -0500

    Improved patching robustness on AIO-SX

    This update includes the following:
    - Ensure the sw-patch init script does nothing on AIO-SX before
    initial configuration is complete. This protects against misleading
    alarms indicating patch installation failure, stemming from
    pre-configuration networking issues.
    - Enhance install-local support to allow for patch installation after
    ansible bootstrap playbook is applied, but before system configuration
    has been completed.

    Change-Id: Id7a5b4f120449fd3f6e71c83167586dc66e8a91e
    Closes-Bug: 1847872
    Depends-On: https://review.opendev.org/696410
    Signed-off-by: Don Penney <email address hidden>
    (cherry picked from commit bfe2fd5693353b7acb92413b0d19f1f12e8f21b0)

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.