250.001 controller Configuration is out-of-date alarm not cleared after DX installation

Bug #1851874 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Austin Sun

Bug Description

Brief Description
-----------------
After DX fresh installed. Alarm "250.001 | controller-0 Configuration is out-of-date." was listed and not cleared.

Severity
--------
Major

Steps to Reproduce
------------------
install DX system

TC-name:

Expected Behavior
------------------
no 250.001 alarm

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system
IPv6

Lab-name: WCP_78-79

Branch/Pull Time/Commit
-----------------------
2019-11-04_20-00-00

Last Pass
---------
not sure

Timestamp/Logs
--------------
[2019-11-08 08:58:14,558] 311 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-11-08 08:58:16,572] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| 3cba2cdb-fd95-4ab2-9939-4ba6b6db1f3a | 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-1.ntp | minor | 2019-11-08T08:50:50.668608 |
| 3c61f744-a456-4f0d-b0bb-d7f51e7c0633 | 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-11-08T08:10:46.354822 |
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
summary: 250.001 controller Configuration is out-of-date alarm not cleared after
- hbsClient killed and recovered
+ DX installation
Peng Peng (ppeng)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority as this would result in blocking subsequent swact operations.

tags: added: stx.config
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.3.0
Yang Liu (yliu12)
tags: added: stx.retestneeded
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :

pengpeng:
  from statement ,
  250.001 controller Configuration is out-of-date alarm not cleared after
  - hbsClient killed and recovered
  + DX installation

  do you kill hbsClient manually and recovered manually ?
  would you like share the whole test log ?

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Austin Sun (sunausti) wrote :

From the sysinv log, the sequence should be:
L501:config_target:a29dbaf3-d60d-402b-8ad6-6e8147dd416b
L510:config_target:f5d83753-c485-4221-b4aa-e986f5421d6d
L528:config_applied change to a29dbaf3-d60d-402b-8ad6-6e8147dd416b.
L600~: controller-0 is unlocking and rebooting
L992:config_target a9c89b7e-51fe-40cb-8700-e22c83ed5fa9, but config_uuid=29c89b7e-51fe-40cb-8700-e22c83ed5fa9, since currently config_target(f5d83753-c485-4221-b4aa-e986f5421d6d) is reboot requested(but actually reboot already happened)
L1040:config_applied change to f5d83753-c485-4221-b4aa-e986f5421d6d.
L1048:config_applied=f5d83753-c485-4221-b4aa-e986f5421d6d, and target is a9c89b7e-51fe-40cb-8700-e22c83ed5fa9
L1060:Agent report config applied 29c89b7e-51fe-40cb-8700-e22c83ed5fa9
but found the config_applied(29c89b7e-51fe-40cb-8700-e22c83ed5fa9) is not equal config_target(a9c89b7e-51fe-40cb-8700-e22c83ed5fa9), so continue report alarm. but in this time, it should clear alarm.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696030

Changed in starlingx:
status: Incomplete → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving unresolved medium priority bugs (< 100 days OR recently reproduced) from stx.3.0 to stx.4.0

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/696030
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=238b196d3a9d244ce6e86f0fc7a2d14c2b3878d7
Submitter: Zuul
Branch: master

commit 238b196d3a9d244ce6e86f0fc7a2d14c2b3878d7
Author: Sun Austin <email address hidden>
Date: Thu Nov 21 21:23:00 2019 +0800

    Additional condition is comparison when doing config check out of date

    This will fix below scenario:
    Controller-0 is installed and some configs were applied. Config
    f5d83753-c485-4221-b4aa-e986f5421d6d is the target config and requests
    boot.
    After controller-0 is unlocked and reboot.another request with uuid
    29c89b7e-51fe-40cb-8700-e22c83ed5fa9 is coming,then the config_target is
    set to a9c89b7e-51fe-40cb-8700-e22c83ed5fa9;
    Once 29c89b7e-51fe-40cb-8700-e22c83ed5fa9 was applied,the config should
    be cleared, but because 29c89b7e-51fe-40cb-8700-e22c83ed5fa9 is not same
    as a9c89b7e-51fe-40cb-8700-e22c83ed5fa9,the alarm can not be cleared.

    In this fix, tracking list for reboot config and clear it once config is
    applied. the out of date config will be clear if reboot config tracking
    is empty and config_applied w/ reboot config flag is equal config_target

    Closes-Bug: 1851874
    Change-Id: Iabeab338bc3fb4615cefcff9e4ae9402e4216321
    Signed-off-by: Sun Austin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not seeing this issue recently

Peng Peng (ppeng)
tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (35.0 KiB)

Reviewed: https://review.opendev.org/705837
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=8ac6ec70cb8a787a274fd7227eb34d2b7bcd5f5b
Submitter: Zuul
Branch: f/centos8

commit 7995dd436954b92f1c4e3f760a7609af670c84c8
Author: Jessica Castelino <email address hidden>
Date: Mon Feb 3 12:07:26 2020 -0500

    Unit test cases for helm charts

    Test cases added for API endpoints used by:
     1. helm-override-delete
     2. helm-override-show
     3. helm-override-list
     4. helm-override-update
     5. helm-chart-attribute-modify

    Story: 2007082
    Task: 38012
    Change-Id: I86763496bb41084c006f2486702c3b15bde039d2
    Signed-off-by: Jessica Castelino <email address hidden>

commit 7e2fda010299f7305b630d6db97bbe1e169a38b1
Author: Angie Wang <email address hidden>
Date: Wed Jan 29 21:18:18 2020 -0500

    Finish kubernetes networking upgrade support

    The commit completes the RPC kube_upgrade_networking
    in sysinv-conductor to run ansible playbook
    upgrade-k8s-networking.yml to upgrade networking pods
    and also updates the networking upgrade function called
    as part of sysinv-conductor startup to provide a current
    kubernetes version when running the upgrade playbook.

    The second control plane upgrade can only be performed
    after the networking upgrade is done, fix the semantic
    check in sysinv api.

    Change-Id: I8dcf5a2baedfaefb0a7ca037eb47bf7cacd686f8
    Story: 2006781
    Task: 37584
    Depends-On: https://review.opendev.org/#/c/705310/
    Signed-off-by: Angie Wang <email address hidden>

commit 52c37a35d2cd62fa1cc1933765c76c1ba8616864
Author: Jerry Sun <email address hidden>
Date: Fri Jan 31 16:10:25 2020 -0500

    Add Unit Tests for Dex Sysinv Changes

    Add unit tests for the dex helm chart changes under the same story
    and task

    Story: 2006711
    Task: 37857

    Depends-On: https://review.opendev.org/#/c/705297/

    Change-Id: I3a0e1c490e56188adfbd614fd6ebb21bfdddf49e
    Signed-off-by: Jerry Sun <email address hidden>

commit 144587a6ac9fc48b9249be99abadd35dfa49e7a7
Author: Teresa Ho <email address hidden>
Date: Fri Jan 31 15:35:04 2020 -0500

    Tox tests for OIDC client helm overrides

    Added some tox tests for OIDC client helm overrides.

    Story: 2006711
    Task: 38481

    Change-Id: If4aeaf0010c7076d1d83bacd00d6fd0122d4ffad
    Signed-off-by: Teresa Ho <email address hidden>

commit 763ddeadd4e83af6cebf752d693ee3e7d3b005b1
Author: Thomas Gao <email address hidden>
Date: Wed Jan 29 16:30:40 2020 -0500

    Fixed errors in address deletion

    Allowed address deletion despite missing associated interface or host.

    Enabled relevant unit test.

    Closes-Bug: 1860186

    Change-Id: Ie6e6358aa75091e92914a8b581b4d5203a596f56
    Signed-off-by: Thomas Gao <email address hidden>

commit 61463608169e75601b8a4f9db7c98190788d6f6a
Author: Thomas Gao <email address hidden>
Date: Tue Jan 28 15:32:58 2020 -0500

    Fixed broken sysinv address get-all api call

    Removed unexpected keyword argument that caused the error....

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.