250.001 alarm raised and not clearing

Bug #1859845 reported by Wendy Mitchell
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Kristine Bujold

Bug Description

Brief Description
-----------------
Alarm "Configuration is out-of-date" is not clearing even after lock/unlock operation on single host (controller/worker)

Severity
--------
Minor

Steps to Reproduce
------------------
The testcase creates flavors, launches instances on the controller/worker
Reboots the host with reboot -f command
Waits for reboot to succeed and state to reach:
['controller-0'] have reached state(s): {'availability': ['available', 'degraded']}

Note: The alarm was raised @ 2020-01-15T16:06:29.734 and did not clear

The test then deletes the instances and flavors that were created and checks the alarms.

300 seconds later @ [2020-01-15 16:18:13,883] the 250.001 alarm is still there

Expected Behavior
------------------
The alarm "Configuration is out-of-date" should be cleared but does not

Actual Behavior
----------------
The alarm "Configuration is out-of-date" (250.001) is not cleared ever (even after lock/unlock operation)

Reproducibility
---------------
yes

(failed teardown in test
nova/test_evacuate_vms.py::TestOneHostAvail::test_reboot_only_host)

System Configuration
--------------------
tested on
single node system

Branch/Pull Time/Commit
-----------------------
20200111T023000Z

Last Pass
---------

Timestamp/Logs
--------------

see Fm-manager.log when the alarm first appears

2020-01-15T16:06:29.734 fmMsgServer.cpp(398): Raising Alarm/Log, (250.001) (host=controller-0)

2020-01-15T16:06:29.735 fmMsgServer.cpp(421): Alarm created/updated: (250.001) (host=controller-0) (3) (58e6c0f9-097a-4167-b586-61466fd3c934)

2020-01-15T16:06:29.735 fmMsgServer.cpp(430): Send response for create fault, uuid:(58e6c0f9-097a-4167-b586-61466fd3c934) (0)

see Fm-manager.log after lock/unlock attempt

2020-01-15T16:36:56.327 fmMsgServer.cpp(421): Alarm created/updated: (250.001) (host=controller-0) (3) (7d0feb43-50e4-4117-a4ef-2218da2e04f0)

2020-01-15T16:36:56.327 fmMsgServer.cpp(430): Send response for create fault, uuid:(7d0feb43-50e4-4117-a4ef-2218da2e04f0) (0)

Test Activity
-------------
LP retest

Workaround
----------

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

lab: SM-2

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority -- stale alarm, but seems to be reproducible

tags: added: stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → yong hu (yhu6)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Yong, is there somebody from your team who can look at this? It's not clear whether this is related to the launched VMs before the reboot or not.

Revision history for this message
yong hu (yhu6) wrote :

Ghada, I will have someone to have a similar test without launching VMs and see if any alarms are there. But at the same time, please ask someone from Flock team checking this issue too.

Ghada Khalil (gkhalil)
tags: added: stx.4.0
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Ghada,

Do we have any update from Flock team? Thanks!

Zhipeng

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Hi Zhipeng, I don't know enough details about this issue to follow up. What follow-up is requested? It's not efficient to have two teams look at the same issue. What are the results of the investigation done by your team to date?

Revision history for this message
zhipeng liu (zhipengs) wrote :

Ghada,
We have not seen it. Ran will further check it and update.

Thanks!
Zhipeng

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Kristine Bujold, please review the logs for this config-out-of-date occurrence to see if it's already fixed by the code changes already made in this area. It would also be good to check with Wendy if the issue is still seen. This was reported back in Jan. Thanks.

Changed in starlingx:
assignee: yong hu (yhu6) → Kristine Bujold (kbujold)
Revision history for this message
Kristine Bujold (kbujold) wrote :
Download full text (3.6 KiB)

This error is different. It appears to be related to generating helm overrides which failed. I do not think a lock/unlock would clear this type of error.

sysinv 2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task [-] Error during ConductorManager._conductor_audit: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/tmp/tmptaAfuN', '--values', '/tmp/tmpUZG1RF', '/tmp/tmpo_h9t2']' returned non-zero exit status 1: CalledProcessError: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/tmp/tmptaAfuN', '--values', '/tmp/tmpUZG1RF', '/tmp/tmpo_h9t2']' returned non-zero exit status 1
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task Traceback (most recent call last):
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/periodic_task.py", line 180, in run_periodic_tasks
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task task(self, context)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 4836, in _conductor_audit
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task self._controller_config_active_apply(context)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 4572, in _controller_config_active_apply
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task context, config_uuid, config_dict)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 8266, in _config_apply_runtime_manifest
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task self.evaluate_app_reapply(context, app_name)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 10446, in evaluate_app_reapply
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task armada_format=True, armada_chart_info=app.charts, combined=True)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 42, in _wrapper
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task return func(self, *args, **kwargs)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 578, in generate_helm_application_overrides
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task file_overrides=file_overrides)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 450, in merge_overrides
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstack.common.periodic_task output = subprocess.check_output(cmd, env=env)
2020-01-15 16:37:08.073 88572 ERROR sysinv.openstac...

Read more...

Revision history for this message
Kristine Bujold (kbujold) wrote :

I ran test nova/test_evacuate_vms.py::TestOneHostAvail::test_reboot_only_host on SM-2 and was not able to reproduce this issue.

Revision history for this message
Kristine Bujold (kbujold) wrote :

Sorry forgot to add the load to my comment

Cengn master BUILD_ID="20200711T013416Z"

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the issue is not reproducible

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Wendy, if similar issues are reported again in stx master, please open a new LP. There have been many changes since this was first reported. It's not likely that any future issues would be the same as the original issue.

Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.