Configuration is out-of-date alarm was not cleared after SX host lock/unlock

Bug #1841620 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
On SX system, after host lock/unlock, 250.001 | controller-0 Configuration is out-of-date alarm raised and not cleared in 10 mins.
The alarm was cleared after another host lock/unlock.

Severity
--------
Major

Steps to Reproduce
------------------
host lock/unlock
check alarm-list

TC-name: test_lock_unlock_host.py::test_lock_unlock_host[controller]

Expected Behavior
------------------
no 250.001 alarm

Actual Behavior
----------------
250.001 not cleared

Reproducibility
---------------
Seen once

System Configuration
--------------------
One node system

Lab-name: SM-3

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-26_20-59-00

Last Pass
---------
2019-08-24_20-59-00

Timestamp/Logs
--------------
[2019-08-27 08:04:52,217] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-27 08:04:54,343] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| ea1dbc10-83aa-4e49-8c1c-43d639a997f9 | 100.114 | NTP address 2607:5300:60:33 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607:5300:60:33 | minor | 2019-08-27T06:57:53.632882 |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+

[2019-08-27 08:04:59,419] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-08-27 08:05:51,293] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-08-27 08:27:36,096] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-27 08:27:38,150] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| 06925597-8869-4ae6-a4b0-7b6ca3bbbe2f | 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-08-27T08:16:08.672238 |

[2019-08-27 08:37:30,775] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-27 08:37:32,723] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| 06925597-8869-4ae6-a4b0-7b6ca3bbbe2f | 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-08-27T08:16:08.672238 |

[2019-08-27 08:37:55,560] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-08-27 08:38:45,705] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-08-27 08:53:55,523] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-27 08:53:57,643] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+-------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+-------------------------------------+----------+----------------------------+
| fbeb4356-2e96-4bf1-ab51-1bb7d5e69bdf | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2019-08-27T08:46:26.735548 |
| e68e85c9-97b6-432d-b50a-c406bed8b5d1 | 100.114 | NTP address 207.34.49.172 is not a valid or a reachable NTP server. | host=controller-0.ntp=207.34.49.172 | minor | 2019-08-27T08:46:26.694524 |
| 03ab993f-9150-4bff-b8cb-068b4d7a9140 | 100.114 | NTP address 192.95.27.155 is not a valid or a reachable NTP server. | host=controller-0.ntp=192.95.27.155 | minor | 2019-08-27T08:46:26.653520 |
| 98cc5502-6d02-4bdb-91d3-c2c44f7bb3c9 | 100.114 | NTP address 192.99.2.8 is not a valid or a reachable NTP server. | host=controller-0.ntp=192.99.2.8 | minor | 2019-08-27T08:46:26.612198 |
+--------------------------------------+----------+------------------------------------------------------------------------+-------------------------------------+----------+----------------------------+
controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: nobody → Stefan Dinescu (stefandinescu)
importance: Undecided → Medium
Dariush Eslimi (deslimi)
tags: added: stx.3.0 stx.config
Changed in starlingx:
status: New → Triaged
Revision history for this message
Al Bailey (albailey1974) wrote :

The stuck alarm occurs near this log

2019-08-27 08:15:31.123 95305 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 3ac0c536-57ad-4e49-ba37-6877ef030622 {'classes': ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime', 'platform::sysinv::runtime'], 'force': False, 'personalities': ['controller'], 'host_uuids': [u'94146b1f-8449-4cd8-888b-0cffa07293f8']} to agent
2019-08-27 08:15:31.132 95305 INFO sysinv.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2019-08-27 08:15:31.223 95305 INFO sysinv.conductor.manager [-] drbd-overview: pgsql-40.0, cgcs-20.0, extension-0.96875, patch-vault-0, etcd-4.8, dockerdistribution-16.0
2019-08-27 08:15:31.223 95305 INFO sysinv.conductor.manager [-] lvdisplay: pgsql-40.0, cgcs-20.0, extension-1.0, patch-vault-0, etcd-5.0, dockerdistribution-16.0
2019-08-27 08:15:31.235 95305 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 328f0cd7-a654-4419-8320-18f619dc2ea4
2019-08-27 08:16:03.390 14322 INFO sysinv.agent.manager [-] get_ihost_by_macs rpc Timeout.
2019-08-27 08:16:03.399 14322 INFO sysinv.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2019-08-27 08:16:03.417 14322 WARNING sysinv.openstack.common.loopingcall [-] task run outlasted interval by 1.728777 sec

We can determine what area of the code is being run, but not why there is no puppet log for that time period.

Revision history for this message
Al Bailey (albailey1974) wrote :

Another possibility is that sysinv AIO-SX is sending this puppet apply command before the agent is ready to process the RPC commands.
This was the stuck alarm
2019-08-27 08:16:08.672 95305 WARNING sysinv.conductor.manager [req-53e08163-cced-45f8-a6fd-74ad84d09d1e None None] SYS_I Raise system config alarm: host controller-0 config applied: 328f0cd7-a654-4419-8320-18f619dc2ea4 vs. target: 3ac0c536-57ad-4e49-ba37-6877ef030622.

This might indicate why the rpc command was never handled.
2019-08-27 08:16:11.719 14322 INFO sysinv.agent.manager [-] Initial inventory completed host 94146b1f-8449-4cd8-888b-0cffa07293f8

Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

The command is received by the agent, but before the agent inventory is complete, so the agent doesn't know the UUID of the host it is running on.

Thus, the agent receives the call, but doesn't know it is supposed to handle it and just ignores the request. So this is a timing issue. I am currently working on a possible fix for this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679796

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/679796
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=86abe22b804af6ea569295265adb39c8d15d04f7
Submitter: Zuul
Branch: master

commit 86abe22b804af6ea569295265adb39c8d15d04f7
Author: Stefan Dinescu <email address hidden>
Date: Tue Sep 3 16:31:36 2019 +0300

    Retry puppet runtime apply in case of missing info

    There are cases where sysinv-agents gets runtime puppet
    apply requests before it gets the host_uuid from the
    sysinv-conductor.

    This causes the received request to be dropped because the
    agent thinks the command is not for itself.

    The fix for this is to expand the retry logic of the
    config_apply_runtime_manifest method to retry if any
    of the needed inventory info is needed.

    Tests done:
    - install of an SX setup and 10 lock-unlocks
    - install of a 2+2 setup and 2 lock-unlocks on each node

    Change-Id: I130e73881bd572398072aff7c46b3b7f165ca6b9
    Closes-bug: 1841620
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on wcp_68

cat /etc/build.info
###
### Wind River Titanium Cloud
### Release 19.10
###
### Wind River Systems, Inc.
###

SW_VERSION="19.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-09-18_20-00-00"
SRC_BUILD_ID="16"

JOB="TC_19.10_Build"
BUILD_BY="jenkins"
BUILD_NUMBER="16"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2019-09-18 20:01:51 -0400"

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.