alarm for https_enabled clears five minutes before config is updated

Bug #1812399 reported by Michel Thebeau [WIND]
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Lin Shuicheng

Bug Description

Title
------
alarm for https_enabled clears five minutes before config is updated

Brief Description
-----------------

alarm for https_enabled clears five minutes before config is updated; fm alarm-list shows 250.001 alarm is absent promptly after running 'system modify --https_enabled true', but the system continues respond to http rather than https for several minutes.

Severity
--------
Minor

Steps to Reproduce
------------------
# system modify --https_enabled true; date;
# fm alarm-list | grep 250.001
# grep "\(Raise\|Clear\) system config alarm" /var/log/sysinv.log
# while true; do { grep "bind.*ssl" /etc/haproxy/haproxy.cfg && break; sleep 1; }; done; date

Expected Behavior
------------------
Alarm remains until cfg file us updated

Actual Behavior
----------------
Alarm clears promptly, but config file is not updated for another 5 minutes

Reproducibility
---------------
Reproducible

System Configuration
--------------------
standard controller

Branch/Pull Time/Commit
-----------------------
2019-01-16_20-18-01
starlingx/master

Timestamp/Logs
--------------
# system modify --https_enabled true; date;
...
| https_enabled | True |
...
Fri Jan 18 13:43:52 UTC 2019

grplist="$( echo "update https to.*https_enabled
req-6e6ec72b-a833-4a8e-8485-625e616d7bf3
281203db-dcb2-4843-94ca-6d0bbbbe188a
6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd
Clear system config alarm
Raise system config alarm" | sed ':a;N;$!ba;s/\n/\\\|/g' )"

grep "$grplist" /var/log/sysinv.log"

2019-01-18 13:43:46.693 7253 INFO sysinv.api.controllers.v1.system [-] update https to {u'vswitch_type': u'ovs-dpdk', u'sdn_enabled': False, u'shared_services': u'[]', u'https_enabled': True, u'kubernetes_enabled': False, u'region_config': False}
2019-01-18 13:43:46.730 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] _config_selfsigned_certificate mode=ssl file=/etc/ssl/private/self-signed-server-cert.pem
2019-01-18 13:43:46.734 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] config_certificate signature=ssl_18300269710244239027
2019-01-18 13:43:46.789 3994 WARNING sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] controller-0: iconfig out of date: target 281203db-dcb2-4843-94ca-6d0bbbbe188a, applied 413b831f-52dc-4885-ab3c-4b87e3d6aa92
2019-01-18 13:43:46.789 3994 WARNING sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] SYS_I Raise system config alarm: host controller-0 config applied: 413b831f-52dc-4885-ab3c-4b87e3d6aa92 vs. target: 281203db-dcb2-4843-94ca-6d0bbbbe188a.
2019-01-18 13:43:46.906 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] _config_update_hosts config_uuid=281203db-dcb2-4843-94ca-6d0bbbbe188a
2019-01-18 13:43:49.647 8394 INFO sysinv.agent.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] Agent config applied 281203db-dcb2-4843-94ca-6d0bbbbe188a
2019-01-18 13:43:49.746 3994 WARNING sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] controller-0: iconfig out of date: target 6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd, applied 413b831f-52dc-4885-ab3c-4b87e3d6aa92
2019-01-18 13:43:49.747 3994 WARNING sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] SYS_I Raise system config alarm: host controller-0 config applied: 413b831f-52dc-4885-ab3c-4b87e3d6aa92 vs. target: 6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd.
2019-01-18 13:43:49.810 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] _config_update_hosts config_uuid=6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd
2019-01-18 13:43:49.898 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] SYS_I Clear system config alarm: controller-0
2019-01-18 13:43:52.814 8394 INFO sysinv.agent.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] config_apply_runtime_manifest: 6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd {u'classes': [u'platform::haproxy::runtime', u'openstack::keystone::endpoint::runtime', u'openstack::horizon::runtime', u'openstack::nova::api::runtime', u'openstack::heat::engine::runtime'], u'force': False, u'personalities': [u'controller']} controller
2019-01-18 13:43:52.815 8394 INFO sysinv.agent.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] controller-active
2019-01-18 13:43:52.816 8394 INFO sysinv.agent.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] _apply_runtime_manifest with hieradata_path = '/opt/platform/puppet/19.01/hieradata'
2019-01-18 13:46:42.767 7252 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-01-18-13-46-42 patch
2019-01-18 13:46:42.768 7252 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end. No changes from mtce/1.0.
2019-01-18 13:50:10.581 8394 INFO sysinv.agent.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] Agent config applied 6f80f124-3d7d-4aa2-9ab9-18e4b9391bfd
2019-01-18 13:50:10.679 3994 INFO sysinv.conductor.manager [req-6e6ec72b-a833-4a8e-8485-625e616d7bf3 admin admin] SYS_I Clear system config alarm: controller-0

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Minor issue; does not gate the upcoming release, but could be a good learning opportunity for someone ramping up on StarlingX.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.config stx.fault
Changed in starlingx:
assignee: nobody → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Lin Shuicheng (shuicheng)
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Michel, I cannot reproduce this issue with latest master code. In my test, config in haproxy.cfg will be updated first, then 250.001 alarm will be cleared after it. Could you help double check the issue with latest build? Thanks. I verified multi node 1+1, and duplex mode.
I run "watch -n5 "fm alarm-list | grep 250; grep "bind.*ssl" /etc/haproxy/haproxy.cfg"", and find cfg file is updated first, before the 250.001 log be cleared.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Log for you reference:
cfg file be updated:
[wrsroot@controller-0 log(keystone_admin)]$ ls -lh /etc/haproxy/haproxy.cfg
-rw-r----- 1 root root 5.4K Mar 10 07:58 /etc/haproxy/haproxy.cfg
250.001 cleared:
2019-03-10 07:59:23.296 8596 INFO sysinv.conductor.manager [req-da29444c-b291-407e-b587-381a7974b7af admin admin] SYS_I Clear system config alarm: controller-0

Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :

Acknowledged Shuicheng. I would offer the following observation based on a build date 2019-02-28_20-18-00:

controller-0# ls -ltr /etc/lighttpd/lighttpd.conf /etc/haproxy/haproxy.cfg
-rw-r----- 1 root root 14577 Mar 13 22:18 /etc/lighttpd/lighttpd.conf
-rw-r----- 1 root root 6695 Mar 13 22:22 /etc/haproxy/haproxy.cfg

controller-1# ls -ltr /etc/haproxy/haproxy.cfg /etc/lighttpd/lighttpd.conf
-rw-r----- 1 root root 14577 Mar 13 22:18 /etc/lighttpd/lighttpd.conf
-rw-r----- 1 root root 6695 Mar 13 22:18 /etc/haproxy/haproxy.cfg

Event history list:
2019-03-13T22:18:55.896458 clear 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-13T22:17:54.376838 set 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-13T22:17:54.334414 clear 250.001 controller-0 Configuration is out-of-date. host=controller-0 major
2019-03-13T22:17:54.292987 clear 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-13T22:17:53.965026 set 250.001 controller-0 Configuration is out-of-date. host=controller-0 major
2019-03-13T22:17:49.860134 set 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-13T22:17:49.711380 set 250.001 controller-0 Configuration is out-of-date. host=controller-0 major

Today I was busy-polling fm alarm-list and /etc/haproxy/haproxy.cfg on the active controller. I observed the condition against haproxy. I forgot to check lighttpd (horizon) until a few minutes into waiting for haproxy config to update. I observed no alarm against the active controller (controller-0). The event history shows I had about 5 seconds to catch it.

I'll look for a newer green build and do the check again.

Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :

I'm getting a different behaviour on load 20190314T003000Z, but still the controller-0 config out-of-date alarm is cleared without a configuration change.

I waited 30 minutes for a change to either the lighttpd or haproxy configurations (neither changed on either controllers). The events looked like this which is consistent:

2019-03-15T14:48:00.314454 set 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-15T14:48:00.227059 clear 250.001 controller-0 Configuration is out-of-date. host=controller-0 major
2019-03-15T14:48:00.185738 clear 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-15T14:47:59.781890 set 250.001 controller-0 Configuration is out-of-date. host=controller-0 major
2019-03-15T14:47:55.415614 set 250.001 controller-1 Configuration is out-of-date. host=controller-1 major
2019-03-15T14:47:55.271132 set 250.001 controller-0 Configuration is out-of-date. host=controller-0 major

I stopped waiting at 15:24

I'm not sure if it is because I only configured up-to unlock of controller-1 and compute-0. I'll intend to redo the test after configuration up-to healthy ceph -s. For this load I had followed the new wiki for container deployment:
https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard

After giving up as above, I tried lock and unlock of controller-1; the alarm cleared for controller-1 and it responds to https instead of http. controller-0 is configured as http and there is no alarm for controller-0. (haproxy is still http as well; only .2 responds to haproxy ports)

Add clear event for controller-1:
2019-03-15T15:30:56.504524 clear 250.001 controller-1 Configuration is out-of-date. host=controller-1 major

Swact controller-0; and haproxy response is https (controller-1) as expected.

I'll put that aside and repeat as noted "test after configuration up-to healthy ceph -s"

M

Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :

Oh... I tried to swact back to controller-0 but:

$ system host-swact controller-1
controller-0 target Config 3c7c2b9d-af17-4c8a-a6ec-165ff69225b0 not yet applied. Apply target Config via Lock/Unlock prior to Swact

So it knows the configuration is pending, it's the alarm that's missing.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Michel, we are in the same page now. I have the same status as your now with container deployment image.
There is manifest apply failure in sysinv.log like below:

2019-03-28 06:53:36.235 11359 TRACE sysinv.puppet.common CalledProcessError: Command '['/usr/local/bin/puppet-manifest-apply.sh', '/opt/platform/puppet/19.01/hieradata', '192.168.204.3', 'controller', 'runtime', '/tmp/tmpaHW4GZ.yaml']' returned non-zero exit status 1
2019-03-28 06:53:36.235 11359 TRACE sysinv.puppet.common
2019-03-28 06:53:36.236 11359 ERROR sysinv.agent.manager [req-002edb2a-e88e-4b50-b56a-317b82354645 admin admin] failed to apply runtime manifest

2019-03-28 06:53:36.238 11359 TRACE sysinv.openstack.common.rpc.amqp SysinvException: Failed to execute runtime manifest for host 192.168.204.3

Here is a summary for this issue:
With latest container duplex deployment, "/etc/haproxy/haproxy.cfg" cannot be updated in both controller-0 and controller-1 by cmd "system modify --https_enabled true/false".
I am checking the issue now.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

The manifest apply failure is fixed with below patch which is already merged:

commit 2336e855bd982fbf6762b7c883d6b2bd0d0653b3
Author: Teresa Ho <email address hidden>
Date: Fri Mar 8 12:11:52 2019 -0500

    Optional https for containerized openstack

Will have a try with latest code whether this issue still exist or not.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Michel,
With latest code, I successfully deployed duplex environment, and cannot reproduce the original issue "alarm for https_enabled clears five minutes before config is updated".
In my test, haproxy.cfg will be updated first, then alarm be cleared.
Please help test the issue again with latest master code.
Thanks.

Please notice upper patch is merged at " Tue Mar 26 18:43:36 2019 +0000".
So you need use build after the day to make sure the fix is included.

commit 617ca1de591bbed47c0af7834539f1e7ad220d17
Merge: 6f82495 2336e85
Author: Zuul <email address hidden>
Date: Tue Mar 26 18:43:36 2019 +0000

    Merge "Optional https for containerized openstack"

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Michel,
Please double check the issue with latest build. Thanks.

Mark the issue as "Incomplete", since cannot reproduce it, and wait submitter's confirmation.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Michel Thebeau [WIND] (mthebeau) wrote :

Hi Shuicheng,

I retested with containers using load 20190404T013000Z. The configuration files for haproxy and lighttpd end up with timestamps that precede the alarm cleared event for the respective controllers. Neither did I notice a delay in the https protocol becoming enabled after the alarms cleared.

If there is no particular change set to refer to for the fix, then I agree to close the bug in whatever manner the subscribed users recommend.

M

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Michel,
Thanks for the confirmation. I am not sure which patch fix the issue.
Let's just close it first, and check it later if it happen again.

Changed in starlingx:
status: Incomplete → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding the stx.2.0 release label as the fix is included in that release

tags: added: stx.2.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.