Periodic short-term subcloud out-of-sync alarm (platform shared resources)

Bug #2002171 reported by Gustavo Herzmann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Gustavo Herzmann

Bug Description

Brief Description
-----------------
DC deployment reports a periodic subcloud out-of-sync alarms that occur more or less exactly with 7 days period at the same time (within single seconds) with all subclouds involved.
The out-of-sync status clears in a few seconds.
All subclouds are affected.

It appears this is due to the fernet key rotation sync requests.

Severity
--------
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
No steps to reproduce, just let the system run for more than 7 days and check fm-event.log for 280.002 alarms.

Expected Behavior
------------------
If the fernet key rotation is expected, a major alarm should not be generated in this case since the condition is expected and no corrective action is required.

Actual Behavior
----------------
Every time the fernet keys rotates, a major 280.002 alarm is raised.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Happened with Centos build 2021-05, later confirmed to also be happening with latest Debian build (2023-01-04 master)

Last Pass
---------
NA.

Timestamp/Logs
--------------
Example message from fm-event.log:
2023-01-05T12:48:08.865 controller-0 fmManager: info { "event_log_id" : "280.002", "reason_text" : "subcloud1 platform sync_status is out-of-sync", "entity_instance_id" : "region=RegionOne.system=d52b52ef-9c56-4565-9b9e-f5f0959604e6.subcloud=subcloud1.resource=platform", "severity" : "major", "state" : "set", "timestamp" : "2023-01-05 12:48:08.697005" }

Test Activity
-------------
Normal use

Workaround
----------
NA.

Changed in starlingx:
status: New → In Progress
assignee: nobody → Gustavo Herzmann (gherzman)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/869501

Ghada Khalil (gkhalil)
tags: added: stx.distcloud stx.fault
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/869501
Committed: https://opendev.org/starlingx/distcloud/commit/18f54a44dd0af9471cdb719a31332e441fd9b539
Submitter: "Zuul (22348)"
Branch: master

commit 18f54a44dd0af9471cdb719a31332e441fd9b539
Author: Gustavo Herzmann <email address hidden>
Date: Fri Jan 6 15:41:00 2023 -0300

    Stop fernet key rotations from raising out-of-sync alarms

    Fernet key rotation is expected to occur periodically. Currently the
    280.002 out-of-sync alarm is raised everytime the sync thread receives
    a fernet key rotation request.

    This commit makes the sync thread check the type of requests, setting
    the alarmable parameter to False if all the requests are due to a
    fernet key rotation.

    It also improves the sync function so it doesn't unnecessarily calls
    the is_subcloud_enable() function by providing an early exit when
    there are no pending sync requests.

    Test Plan:
    1. PASS - Verify that the out-of-sync alarm is not raised when the
              fernet keys are rotated;
    2. PASS - Check that the initial sync still works as expected;
    3. PASS - Verify that identity sync due to user triggered identity
              resources change on the central cloud works as expected;
    4. PASS - Check that platform resources sync due to user triggered
              platform resources on the central cloud works as expected;
    5. PASS - Trigger a fernet key sync and at the same time trigger a
              different sync request and verify that alarm gets raised.

    Closes-Bug: #2002171

    Signed-off-by: Gustavo Herzmann <email address hidden>
    Change-Id: I694d6c3791222739921cd0f5141f54791f847414

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.