Sysinv uses stale data when checking configuration state

Bug #2007455 reported by Joshua Kraitberg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Joshua Kraitberg

Bug Description

Brief Description
-----------------
When checking if the current configuration matches
the latest configuration, the values used are passed
in. Because the are passed in, they can be stale.

"fm alarm-list" and "system health-query" produce conflicting outputs when data was stale.
No alarms present but health-query says out-of-date.

[sysadmin@controller-0 ~(keystone_admin)]$ system health-query
System Health:
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [Fail]
Hosts with out of date configurations: controller-0
All hosts are patch current: [OK]
No alarms: [OK]
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
# No output

Severity
--------
Minor

Steps to Reproduce
------------------
Run bootstrap or restore

Expected Behavior
------------------
States match

Actual Behavior
----------------
States don't match

Reproducibility
---------------
100% on slower systems

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
Master

Last Pass
---------
N/A

Timestamp/Logs
--------------
fm-event.log:
# First raise
2023-02-13T16:19:34.066 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date. (applied: aa8fdf08-e266-4561-aa67-1f04cda46801 target: 7a4cf53a-d8df-4ffe-b783-59572272bd25)", "entity_instance_id" : "region=RegionOne.system=vbox.host=controller-0", "severity" : "major", "state" : "set", "timestamp" : "2023-02-13 16:19:33.946143" }
...
# First clear
2023-02-13T16:22:05.572 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date. (applied: aa8fdf08-e266-4561-aa67-1f04cda46801 target: 7a4cf53a-d8df-4ffe-b783-59572272bd25)", "entity_instance_id" : "region=RegionOne.system=vbox.host=controller-0", "severity" : "major", "state" : "clear", "timestamp" : "2023-02-13 16:22:05.571324" }
...
# Second raise
2023-02-13T16:22:28.664 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date. (applied: 7a4cf53a-d8df-4ffe-b783-59572272bd25 target: 645903ac-2a08-41b6-a0cd-5edaaa7860f0)", "entity_instance_id" : "region=RegionOne.system=vbox.host=controller-0", "severity" : "major", "state" : "set", "timestamp" : "2023-02-13 16:22:28.458440" }
...
# Second clear
2023-02-13T16:22:32.079 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date. (applied: 7a4cf53a-d8df-4ffe-b783-59572272bd25 target: 645903ac-2a08-41b6-a0cd-5edaaa7860f0)", "entity_instance_id" : "region=RegionOne.system=vbox.host=controller-0", "severity" : "major", "state" : "clear", "timestamp" : "2023-02-13 16:22:32.078597" }

sysinv.log:
# First raise
sysinv 2023-02-13 16:19:33.925 75883 INFO sysinv.conductor.manager [-] Setting config target of host 'controller-0' to '7a4cf53a-d8df-4ffe-b783-59572272bd25'.
sysinv 2023-02-13 16:19:33.945 75883 WARNING sysinv.conductor.manager [-] controller-0: iconfig out of date: target 7a4cf53a-d8df-4ffe-b783-59572272bd25, applied aa8fdf08-e266-4561-aa67-1f04cda46801
sysinv 2023-02-13 16:19:33.945 75883 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: aa8fdf08-e266-4561-aa67-1f04cda46801 vs. target: 7a4cf53a-d8df-4ffe-b783-59572272bd25.
...
# First clear
sysinv 2023-02-13 16:22:05.399 70253 INFO sysinv.agent.manager [-] Agent config applied 7a4cf53a-d8df-4ffe-b783-59572272bd25
sysinv 2023-02-13 16:22:05.433 75883 INFO sysinv.conductor.manager [-] _remove_config_from_reboot_config_list host: aa9c3c14-784e-4c3c-923a-583f82b888f2,config_uuid: 7a4cf53a-d8df-4ffe-b783-59572272bd25
sysinv 2023-02-13 16:22:05.461 75883 INFO sysinv.conductor.manager [-] _clear_config_from_reboot_config_list host:aa9c3c14-784e-4c3c-923a-583f82b888f2
sysinv 2023-02-13 16:22:05.470 75883 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 7a4cf53a-d8df-4ffe-b783-59572272bd25
...
# Second raise
sysinv 2023-02-13 16:22:28.444 75883 INFO sysinv.conductor.manager [-] Setting config target of host 'controller-0' to '645903ac-2a08-41b6-a0cd-5edaaa7860f0'.
sysinv 2023-02-13 16:22:28.458 75883 WARNING sysinv.conductor.manager [-] controller-0: iconfig out of date: target 645903ac-2a08-41b6-a0cd-5edaaa7860f0, applied 7a4cf53a-d8df-4ffe-b783-59572272bd25
sysinv 2023-02-13 16:22:28.458 75883 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 7a4cf53a-d8df-4ffe-b783-59572272bd25 vs. target: 645903ac-2a08-41b6-a0cd-5edaaa7860f0.
sysinv 2023-02-13 16:22:28.478 75883 INFO sysinv.conductor.manager [-] _config_update_hosts config_uuid=645903ac-2a08-41b6-a0cd-5edaaa7860f0
sysinv 2023-02-13 16:22:28.482 75883 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=645903ac-2a08-41b6-a0cd-5edaaa7860f0, classes: ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime']
...
# Second clear of original target, erases second raise early ????
sysinv 2023-02-13 16:22:32.008 75883 INFO sysinv.conductor.manager [-] _clear_config_from_reboot_config_list host:aa9c3c14-784e-4c3c-923a-583f82b888f2
sysinv 2023-02-13 16:22:32.011 75883 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 7a4cf53a-d8df-4ffe-b783-59572272bd25
...
# Correct second clear
sysinv 2023-02-13 16:27:07.422 70253 INFO sysinv.agent.manager [-] Runtime manifest apply completed for classes ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime'].
sysinv 2023-02-13 16:27:07.451 70253 INFO sysinv.agent.manager [-] Agent config applied 645903ac-2a08-41b6-a0cd-5edaaa7860f0
sysinv 2023-02-13 16:27:07.470 75883 INFO sysinv.conductor.manager [-] _remove_config_from_reboot_config_list host: aa9c3c14-784e-4c3c-923a-583f82b888f2,config_uuid: 645903ac-2a08-41b6-a0cd-5edaaa7860f0
sysinv 2023-02-13 16:27:07.491 75883 INFO sysinv.conductor.manager [-] _clear_config_from_reboot_config_list host:aa9c3c14-784e-4c3c-923a-583f82b888f2
sysinv 2023-02-13 16:27:07.495 75883 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 645903ac-2a08-41b6-a0cd-5edaaa7860f0

Test Activity
-------------
Developer Testing

Workaround
----------
Wait and it'll fix itself eventually.

Changed in starlingx:
assignee: nobody → Joshua Kraitberg (jkraitbe-wr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/874115

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/874115
Committed: https://opendev.org/starlingx/config/commit/378c68fdb39c50f6d00d9c0cd4626280046af2c8
Submitter: "Zuul (22348)"
Branch: master

commit 378c68fdb39c50f6d00d9c0cd4626280046af2c8
Author: Joshua Kraitberg <email address hidden>
Date: Wed Feb 15 18:51:30 2023 -0500

    Fix stale data when checking if configuration out-of-date

    When checking if the current configuration matches
    the latest configuration, the values used are passed
    in. Because the are passed in, they can be stale.

    An example of where this is also guaranteed to occur is:
    https://github.com/starlingx/config/blob/2400ae204e36a68c9627e3446ee7f9e4a194dd54/sysinv/sysinv/sysinv/sysinv/conductor/manager.py#L5931-L5935
    This is a long lived function that has a long interval between
    when it queries it's ihost reference and when it uses it
    to update the alarms.

    To mitigate this, the latest ihost values are now
    always fetched before doing any configuration checking.

    However, the underlying race condition still exists.
    To fix the race condition a much larger change is required.

    To verify is this defect has happened:
    sysadmin@controller-0:~$ grep 'system config alarm' /var/log/sysinv.log
    sysinv 2023-02-15 14:35:37.982 41066 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 645903ac-2a08-41b6-a0cd-5edaaa7860f0 vs. target: 38f223d6-49eb-45d6-9fc5-b3ae40a8cea0.
    sysinv 2023-02-15 14:35:40.880 41066 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 645903ac-2a08-41b6-a0cd-5edaaa7860f0 vs. target: 2ad1c950-7cd6-4d4d-a214-cafdd30456b7.
    sysinv 2023-02-15 14:42:17.444 70736 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 2ad1c950-7cd6-4d4d-a214-cafdd30456b7
    sysinv 2023-02-15 14:44:17.989 87173 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 2ad1c950-7cd6-4d4d-a214-cafdd30456b7 <--- This shouldn't be here

    If there are any consecutive clears then this defect has occurred.

    TEST PLAN
    PASS: Verify fixed after bootstrap AIO-SX
      * Apply fix before bootstrap, verify after unlock
    PASS: Verify fixed after restore AIO-SX
      * Apply fix before restore, verify after unlock
    PASS: Verify fixed after optimized restore AIO-SX
      * Apply fix before restore, verify after unlock

    Closes-Bug: 2007455
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: I8910f4dd4aaf88e904550207c13107ec72a0c09a

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.