250.001 config out-of-date alarm not cleared after system app applied

Bug #1864874 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kristine Bujold

Bug Description

Brief Description
-----------------
On regular system, after stx-monitor app was applied, alarm " 250.001 | compute-1 Configuration is out-of-date. " was raised and not cleared

Severity
--------
Major

Steps to Reproduce
------------------
apply stx-monitor app on a regular system
fm alarm-list

TC-name: stx_monitor/test_stx_monitor.py::test_stx_monitor

Expected Behavior
------------------
250.001 alarm should be cleared

Actual Behavior
----------------
250.001 alarm was not cleared

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
regular node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
2020-02-25_17-07-51

Last Pass
---------
2020-02-24_20-23-53

Timestamp/Logs
--------------
[2020-02-26 06:12:15,056] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-02-26 06:12:16,050] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+

[2020-02-26 06:12:17,800] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-02-26 06:12:18,997] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+---------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+----------+-----------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
+---------------------+---------+-------------------------------+---------------+----------+-----------+

[2020-02-26 06:30:41,588] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-02-26 06:30:42,945] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+----------+-----------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | applied | completed |
+---------------------+---------+-------------------------------+------------------+----------+-----------

[2020-02-26 06:30:43,050] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-02-26 06:30:44,554] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+
| e4c56f6b-c049-4484-b413-7b08767abb6a | 250.001 | compute-1 Configuration is out-of-date. | host=compute-1 | major | 2020-02-26T06:30:26.259345 |
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------+
controller-1:~$

[2020-02-26 07:00:39,360] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-02-26 07:00:40,403] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+------------------------------------------+----------+----------------------------+
| 2516290a-5362-4738-8b11-ca208f29958e | 100.114 | NTP address 64:ff9b::9538:2f3c is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::9538:2f3c | minor | 2020-02-26T06:57:53.391971 |
| e5c39db4-7601-408a-82c4-25546e08badb | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2020-02-26T06:52:53.478787 |
| 35317497-516d-4033-b882-31a7623c4e13 | 100.114 | NTP address 64:ff9b::2d0f:a8c6 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::2d0f:a8c6 | minor | 2020-02-26T06:52:53.397687 |
| 9659760e-dfca-477a-87e9-0e90c95c7314 | 100.114 | NTP address 64:ff9b::2d4c:f4c1 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::2d4c:f4c1 | minor | 2020-02-26T06:52:53.395585 |
| 54475559-b086-4d85-83d0-6491cb7d3a99 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2020-02-26T06:52:52.861037 |
| 6b383510-4ce4-4ffa-95ec-012b144f291c | 100.114 | NTP address 64:ff9b::c06f:9072 is not a valid or a reachable NTP server. | host=controller-0.ntp=64:ff9b::c06f:9072 | minor | 2020-02-26T06:52:52.819062 |
| 2302f478-31d9-49fd-a2a1-e7343f8d123e | 100.114 | NTP address 64:ff9b::d18d:2816 is not a valid or a reachable NTP server. | host=controller-0.ntp=64:ff9b::d18d:2816 | minor | 2020-02-26T06:52:52.816217 |
| a58b525e-582c-42df-bf0c-9f201c9e8ee4 | 250.001 | compute-1 Configuration is out-of-date. | host=compute-1 | major | 2020-02-26T06:32:38.765580 |
+--------------------------------------+----------+----------------------------------------------------------------------------------------------------+------------------------------------------+----------+----------------------------+

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote : Re: 250.001 alarm not cleared after stx-monitor app applied on regular system
summary: - 250.001 alarm raised after stx-monitor app applied on regular system
+ 250.001 alarm not cleared after stx-monitor app applied on regular
+ system
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
description: updated
description: updated
tags: added: stx.config
Revision history for this message
Ghada Khalil (gkhalil) wrote :

It's not clear why the application apply would result in a configuration alarm. It's possible that it is the result of a puppet manifest apply failure. Ideally, puppet should not be involved in containerization application apply.

Asking the reporter to monitor for a re-occurrence, but will also need a developer to go thru the logs and determine root-cause.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

The issue was reproduced again on
Lab: WCP_71_75
Load: 2020-03-02_04-10-00

Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

Application apply/remove of stx-monitor requires a puppet manifest apply to modify the collectd port on all hosts. In this case, the puppet apply failed during the removal the the stx-monitor application. The 2 hosts in question show as "unprovisioned" which is the reason for the puppet failure, and the alarms.

sysinv.log:
sysinv 2020-03-03 16:38:40.454 266891 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=25b3918f-54d9-47f9-9c6a-15d0d9735050, classes: ['platform::collectd::restart']
sysinv 2020-03-03 16:38:40.467 266891 INFO sysinv.puppet.puppet [-] Updating hiera for host: controller-0 with config_uuid: 25b3918f-54d9-47f9-9c6a-15d0d9735050
sysinv 2020-03-03 16:38:42.086 266891 INFO sysinv.puppet.puppet [-] Updating hiera for host: compute-0 with config_uuid: 25b3918f-54d9-47f9-9c6a-15d0d9735050
sysinv 2020-03-03 16:38:43.657 266891 INFO sysinv.conductor.manager [-] Cannot regenerate the configuration for compute-1, the node is not ready. invprovision=unprovisioned
sysinv 2020-03-03 16:38:43.658 266891 INFO sysinv.conductor.manager [-] Cannot regenerate the configuration for compute-2, the node is not ready. invprovision=unprovisioned

Revision history for this message
John Kung (john-kung) wrote :

The provisioing state should have transitioned to 'provisioning' on the host-unlock. However, in this case,
it was observed to be still 'unprovisioned'.

In sysinv/api/controllers/v1/host.py::state_administrative_update() transition to PROVISIONED ('provisionsed') could be made less restrictive, to also allow transition when constants.UNPROVISIONED (not just the current constants.PROVISIONING) since the operational state is being set to 'enabled’ (which can only be True if it is unlocked-enabled, i.e. after provisioning).

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue w/ stuck alarms which is resulting from the provisioning state not being set properly on the initial host unlock

tags: added: stx.4.0
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_71_75
Load: 2020-04-19_20-00-00

Log added at
https://files.starlingx.kube.cengn.ca/launchpad/1864874

Yang Liu (yliu12)
summary: - 250.001 alarm not cleared after stx-monitor app applied on regular
- system
+ 250.001 config out-of-date alarm not cleared after system app applied
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_71_75
Load: 2020-05-12_20-00-00

Log added on:
https://files.starlingx.kube.cengn.ca/launchpad/1864874

Revision history for this message
Difu Hu (difuhu) wrote :

Issue was reproduced on
Lab: DC-1
Load: 2020-05-15_20-00-00

Log added on:
https://files.starlingx.kube.cengn.ca/launchpad/1864874

Revision history for this message
Difu Hu (difuhu) wrote :

Similar issue occurred after hello-kitty app is applied.
Lab: DC-1
Load: 2020-05-15_20-00-00

Log added on:
https://files.starlingx.kube.cengn.ca/launchpad/1864874

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Kristine to review the logs. This may have been addressed by other fixes in this area.

Changed in starlingx:
assignee: nobody → Kristine Bujold (kbujold)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739628

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/739628
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=5ee1dc120b6fc1fa8cc4011b8779fe4da525fd58
Submitter: Zuul
Branch: master

commit 5ee1dc120b6fc1fa8cc4011b8779fe4da525fd58
Author: Kristine Bujold <email address hidden>
Date: Mon Jul 6 20:55:48 2020 -0400

    Relax rules when transitioning to provisioned

    Relaxed the rules for stage_administrative_update() as per
    recommendation in the Launchpad.

    The provisioning state should have transitioned to 'provisioning' on
    the host-unlock, it was observed to be still 'unprovisioned'. This
    commit makes the transition to PROVISIONED ('provisioned') less
    restrictive, to also allow transition when constants.UNPROVISIONED
    (not just the current constants.PROVISIONING) since the operational
    state is being set to 'enabled’ (which can only be True if it is
    unlocked-enabled, i.e. after provisioning).

    The transition to PROVISIONING should've occured on the host-unlock
    operation e.g., when following log occurs
    "stage_administrative_update: provisioning"; which checks for whether
    UNPROVISIONED or None there so this would align the OPERATIONAL_ENABLED
    status with the PROVISIONED as well.

    Closes-Bug: 1864874
    Change-Id: Iaf093046dd3c315b6f22007e81b2b3f468f3e629
    Signed-off-by: Kristine Bujold <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

 no longer seen in recent sanity:

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per information from Yang Liu and the test team, this issue has not been seen since May. Moving to stx.5.0. Fix has merged in stx master, but there is no plan to port it to the r/stx.4.0 branch

tags: added: stx.5.0
removed: stx.4.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.