HAproxy is restarted every 5-10 minutes when using reactive charms with multiple network space bindings

Bug #1737776 reported by Tytus Kurek
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Gnocchi Charm
Fix Released
High
James Page
OpenStack Designate Charm
Fix Released
High
Liam Young
charms.openstack
Fix Released
High
James Page

Bug Description

There is designate application running in the clustered mode based on 17.11 charm release. The following is being observed:

root@juju-bd82de-28-lxd-13:~# cat /var/log/juju/unit-designate-hacluster-58.log | grep "Stopping haproxy" | tail
2017-12-12 13:50:24 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 13:55:49 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:07:24 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:12:30 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:22:56 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:27:46 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:37:22 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:43:07 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 14:54:02 DEBUG ha-relation-changed * Stopping haproxy haproxy
2017-12-12 15:00:02 DEBUG ha-relation-changed * Stopping haproxy haproxy

HAproxy is being restarted every 5-10 minutes. Attaching log files from all 3 units.

Revision history for this message
Tytus Kurek (tkurek) wrote :
Revision history for this message
Tytus Kurek (tkurek) wrote :

This issue was originally reported in bug 1698814 and is claimed to be fixed in designate charm 17.08 release.

Revision history for this message
Tytus Kurek (tkurek) wrote :

Perhaps, the "ha-relation-changed" hook should not be run at all. I've just captured the following Juju logs from another environment where the issue is also present:

# cat /var/log/juju/logsink.log | grep "got a relation units change" | grep unit-designate-hacluster-63
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 11:22:31 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3833}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 11:28:16 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3834}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 11:38:38 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3835}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 11:44:19 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3836}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 11:59:35 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3837}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 12:16:12 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3838}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 12:18:37 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3839}] []}}
a5178681-c521-4519-8826-cfeed5baab69: unit-designate-hacluster-63 2017-12-18 12:23:23 DEBUG juju.worker.uniter.remotestate watcher.go:416 got a relation units change: {507 {map[designate/9:{3840}] []}}

Revision history for this message
Liam Young (gnuoy) wrote :

I'll see if I can reproduce this.

Changed in charm-designate:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

The string representations of the resource_params and resources dictionaries presented by designate are continually changing order which is causing the ha-relation-changed hook to fire. Below is the raw relation data from a hacluster subordinate:

14:08:57
clones: '{''cl_res_designate_haproxy'': ''res_designate_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "4440"
groups: '{''grp_designate_vips'': ''res_designate_ens3_vip''}'
init_services: '[''haproxy'', ''haproxy'']'
private-address: 10.5.0.8
res_designate_ens3_vip: ocf:heartbeat:IPaddr2
res_designate_haproxy: lsb:haproxy
resource_params: '{''res_designate_ens3_vip'': '' params ip="10.5.100.10" nic="ens3"
  cidr_netmask="255.255.0.0"'', ''res_designate_haproxy'': '' op monitor interval="5s"''}'
resources: '{''res_designate_ens3_vip'': ''ocf:heartbeat:IPaddr2'', ''res_designate_haproxy'':
  ''lsb:haproxy''}'

14:11:35
clones: '{''cl_res_designate_haproxy'': ''res_designate_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "4440"
groups: '{''grp_designate_vips'': ''res_designate_ens3_vip''}'
init_services: '[''haproxy'', ''haproxy'']'
private-address: 10.5.0.8
res_designate_ens3_vip: ocf:heartbeat:IPaddr2
res_designate_haproxy: lsb:haproxy
resource_params: '{''res_designate_ens3_vip'': '' params ip="10.5.100.10" nic="ens3"
  cidr_netmask="255.255.0.0"'', ''res_designate_haproxy'': '' op monitor interval="5s"''}'
resources: '{''res_designate_ens3_vip'': ''ocf:heartbeat:IPaddr2'', ''res_designate_haproxy'':
  ''lsb:haproxy''}'

Changed in charm-designate:
status: New → Confirmed
Revision history for this message
Liam Young (gnuoy) wrote :

Argh, the second entry should have been:

14:11:35

clones: '{''cl_res_designate_haproxy'': ''res_designate_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "4440"
groups: '{''grp_designate_vips'': ''res_designate_ens3_vip''}'
init_services: '[''haproxy'']'
private-address: 10.5.0.8
res_designate_ens3_vip: ocf:heartbeat:IPaddr2
res_designate_haproxy: lsb:haproxy
resource_params: '{''res_designate_haproxy'': '' op monitor interval="5s"'', ''res_designate_ens3_vip'':
  '' params ip="10.5.100.10" nic="ens3" cidr_netmask="255.255.0.0"''}'
resources: '{''res_designate_haproxy'': ''lsb:haproxy'', ''res_designate_ens3_vip'':
  ''ocf:heartbeat:IPaddr2''}'

Changed in charm-designate:
importance: Undecided → High
milestone: none → 18.02
Revision history for this message
Tytus Kurek (tkurek) wrote :

This bug is actually causing more problems than I initially thought. It looks like the Pacemaker resource action (res_designate_haproxy:2_monitor_5000) fails when executed at the same time haproxy is being restarted causing LRMD termination:

Jan 17 20:47:57 juju-a2afdd-25-lxd-16 crmd[12508]: notice: te_rsc_command: Initiating action 13: monitor res_designate_haproxy:2_monitor_5000 on juju-a2afdd-25-lxd-16 (local)
Jan 17 20:47:57 juju-a2afdd-25-lxd-16 lrmd[12507]: error: crm_abort: operation_finished: Triggered fatal assert at services_linux.c:269 : op->pid == pid
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 pacemakerd[13843]: error: child_waitpid: Managed process 12507 (lrmd) dumped core
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 pacemakerd[13843]: notice: pcmk_child_exit: Child process lrmd terminated with signal 6 (pid=12507, core=1)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 pacemakerd[13843]: error: pcmk_process_exit: Child respawn count exceeded by lrmd
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: error: internal_ipc_get_reply: Server disconnected client lrmd while waiting for msg id 9691
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: notice: crm_ipc_send: Connection to lrmd closed: Transport endpoint is not connected (-107)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: error: lrmd_send_command: Couldn't perform lrmd_rsc_info operation (timeout=0): -107: Transport endpoint is not connected (107)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: error: lrmd_send_command: LRMD disconnected
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: notice: do_lrm_invoke: Not creating resource for a delete event: (null)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: warning: qb_ipcs_event_sendv: new_event_notification (12508-30736-14): Broken pipe (32)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: error: crm_ipc_read: Connection to lrmd failed
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: error: mainloop_gio_callback: Connection to lrmd[0x55b22e68bbb0] closed (I/O condition=25)
Jan 17 20:47:58 juju-a2afdd-25-lxd-16 crmd[12508]: crit: lrm_connection_destroy: LRM Connection failed

This results with the node going offline in "crm status".

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This bug has been subscribed to the Canonical "Field High" list.

Revision history for this message
Chris Gregan (cgregan) wrote :

@Jason
Do we see this using next charms? OS team believes they have put fixes in for this for 18.02

Revision history for this message
James Page (james-page) wrote :

OK I think I managed to grab a diff from a single unit gnocchi deployment:

--- /home/ubuntu/haproxy.cfg 2018-02-15 18:21:16.449872354 +0000
+++ /etc/haproxy/haproxy.cfg 2018-02-15 18:21:24.301899854 +0000
@@ -29,17 +29,17 @@ listen stats
 frontend tcp-in_gnocchi-api_admin
     bind *:8041
     bind :::8041
- acl net_172.20.5.86 dst 172.20.5.86/255.255.255.0
- use_backend gnocchi-api_admin_172.20.5.86 if net_172.20.5.86
     acl net_172.20.2.69 dst 172.20.2.69/255.255.255.0
     use_backend gnocchi-api_admin_172.20.2.69 if net_172.20.2.69
+ acl net_172.20.5.86 dst 172.20.5.86/255.255.255.0
+ use_backend gnocchi-api_admin_172.20.5.86 if net_172.20.5.86
     default_backend gnocchi-api_admin_172.20.2.69

-backend gnocchi-api_admin_172.20.5.86
- balance leastconn
- server gnocchi-0 172.20.5.86:8031 check
-
 backend gnocchi-api_admin_172.20.2.69
     balance leastconn
     server gnocchi-0 172.20.2.69:8031 check

+backend gnocchi-api_admin_172.20.5.86
+ balance leastconn
+ server gnocchi-0 172.20.5.86:8031 check
+

This has the original fix for bug 1698814 but the order of the backends still keeps switching.

Changed in charm-gnocchi:
status: New → Confirmed
importance: Undecided → High
summary: HAproxy is restarted every 5-10 minutes when using clustered designate
- application from 17.11 charm release
+ application
summary: - HAproxy is restarted every 5-10 minutes when using clustered designate
- application
+ HAproxy is restarted every 5-10 minutes when using designate application
Revision history for this message
James Page (james-page) wrote : Re: HAproxy is restarted every 5-10 minutes when using designate application
Changed in charm-gnocchi:
assignee: nobody → James Page (james-page)
Changed in charms.openstack:
assignee: nobody → James Page (james-page)
Changed in charm-gnocchi:
milestone: none → 18.02
Changed in charms.openstack:
status: New → In Progress
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote :

Note that the combination of {clustered|single unit} + multiple network spaces causes this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charms.openstack (master)

Reviewed: https://review.openstack.org/545101
Committed: https://git.openstack.org/cgit/openstack/charms.openstack/commit/?id=af9a92f189e0b5c515d2bcae6bdec60957d7bc73
Submitter: Zuul
Branch: master

commit af9a92f189e0b5c515d2bcae6bdec60957d7bc73
Author: James Page <email address hidden>
Date: Thu Feb 15 19:36:38 2018 +0000

    Further use of ordered dicts.

    For consistent rendering of the haproxy template under Python 3,
    ensure that ordered dictionaries are used so that dict iteration
    always results in the same template output.

    Change-Id: I119027810e7f7976c794ff7f34a902d256f8963d
    Closes-Bug: 1737776

Changed in charms.openstack:
status: In Progress → Fix Released
Revision history for this message
James Page (james-page) wrote : Re: HAproxy is restarted every 5-10 minutes when using designate application

I've pushed rebuilds for gnocchi and designate with the proposed fix to:

  cs:~james-page/gnocchi
  cs:~james-page/designate

Testing now.

Changed in charm-gnocchi:
status: Confirmed → In Progress
Revision history for this message
James Page (james-page) wrote :

The residual issue with hacluster charm restarting things will be fixed by the latest commit on the hacluster interface which switches any consuming charms to use the new json formatted keys which will always produce the same data on the relation, avoiding hook executions.

summary: - HAproxy is restarted every 5-10 minutes when using designate application
+ HAproxy is restarted every 5-10 minutes when using reactive charms with
+ multiple network space bindings
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Rebuilds for gnocchi and designate charms have been published to cs: next, flagging fix-committed.

Changed in charm-designate:
status: Confirmed → In Progress
status: In Progress → Fix Committed
Changed in charm-gnocchi:
status: In Progress → Fix Committed
Revision history for this message
Tytus Kurek (tkurek) wrote :

To be honest, I don't see the issue being resolved after upgrading affected designate installation to the following version: cs:~openstack-charmers-next/designate-68

root@juju-baab69-29-lxd-8:~# cat /var/log/juju/unit-designate-hacluster-63.log | grep "Stopping haproxy" | tail -n 5
2018-02-26 14:40:09 DEBUG ha-relation-changed * Stopping haproxy haproxy
2018-02-26 14:40:34 DEBUG ha-relation-changed * Stopping haproxy haproxy
2018-02-26 14:43:51 DEBUG ha-relation-changed * Stopping haproxy haproxy
2018-02-26 14:48:51 DEBUG ha-relation-changed * Stopping haproxy haproxy
2018-02-26 14:54:07 DEBUG ha-relation-changed * Stopping haproxy haproxy

Attaching log files from all 3 units.

Revision history for this message
Tytus Kurek (tkurek) wrote :
Download full text (3.6 KiB)

OK, so it looks like there's one more issue which causes the "ha-relation-changed" hook being run: the "json_init_services" keeps being populated with "haproxy" values. Please take a look on the following tries:

TRY 1:

root@juju-baab69-29-lxd-8:/var/lib/juju/agents/unit-designate-hacluster-63/charm# relation-get -r 507 - designate/9
clones: '{''cl_res_designate_haproxy'': ''res_designate_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "4440"
groups: '{''grp_designate_vips'': ''res_designate_eth0_vip''}'
init_services: '[''haproxy'']'
json_clones: '{"cl_res_designate_haproxy": "res_designate_haproxy"}'
json_groups: '{"grp_designate_vips": "res_designate_eth0_vip"}'
json_init_services: '["haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
...
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy"]'
json_res_designate_eth0_vip: '"ocf:heartbeat:IPaddr2"'
json_res_designate_haproxy: '"lsb:haproxy"'
json_resource_params: '{"res_designate_eth0_vip": " params ip=\"100.86.0.11\" nic=\"eth0\"
  cidr_netmask=\"255.255.240.0\"", "res_designate_haproxy": " op monitor interval=\"5s\""}'
json_resources: '{"res_designate_eth0_vip": "ocf:heartbeat:IPaddr2", "res_designate_haproxy":
  "lsb:haproxy"}'
private-address: 100.107.2.78
res_designate_eth0_vip: ocf:heartbeat:IPaddr2
res_designate_haproxy: lsb:haproxy
resource_params: '{''res_designate_haproxy'': '' op monitor interval="5s"'', ''res_designate_eth0_vip'':
  '' params ip="100.86.0.11" nic="eth0" cidr_netmask="255.255.240.0"''}'
resources: '{''res_designate_haproxy'': ''lsb:haproxy'', ''res_designate_eth0_vip'':
  ''ocf:heartbeat:IPaddr2''}'

TRY 2:

root@juju-baab69-29-lxd-8:/var/lib/juju/agents/unit-designate-hacluster-63/charm# relation-get -r 507 - designate/9
clones: '{''cl_res_designate_haproxy'': ''res_designate_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "4440"
groups: '{''grp_designate_vips'': ''res_designate_eth0_vip''}'
init_services: '[''haproxy'']'
json_clones: '{"cl_res_designate_haproxy": "res_designate_haproxy"}'
json_groups: '{"grp_designate_vips": "res_designate_eth0_vip"}'
json_init_services: '["haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
...
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy"]'
json_res_designate_eth0_vip: '"ocf:heartbeat:IPaddr2"'
json_res_designate_haproxy: '"lsb:haproxy"'
json_resource_params: '{"res_designate_eth0_vip": " params ip=\"100.86.0.11\" nic=\"eth0\"
  cidr_netmask=\"255.255.240.0\"", "res_designate_haproxy": " op monitor interval=\"5s\""}'
json_resources: '{"res_designate_eth0_vip": "ocf:heartbeat:IPaddr2", "res_designate_haproxy":
  "lsb:haproxy"}'
private-address: 100.107.2.78
res_designate_eth0_vip: ocf:heartbeat:IPaddr2
res_designate_haproxy: lsb:haproxy
resource_params: '{''res_designate_haproxy'': '' op ...

Read more...

Revision history for this message
Tytus Kurek (tkurek) wrote :

I can confirm that I see the same on all four environments which I own at the moment. The issue is also present on fresh LXD-based installation. The "json_init_services" parameter keeps being populated with "haproxy" values.

Revision history for this message
James Page (james-page) wrote :

I think this is a bug in the hacluster interface

Revision history for this message
Tytus Kurek (tkurek) wrote :

James,

Should I raise separate bug report for the issue with "json_init_services"?

Revision history for this message
Tytus Kurek (tkurek) wrote :

I have a quick patch for the "json_init_services" parameter being filled in with the "haproxy" values:

$ diff charm-interface-hacluster/common.py charm-interface-hacluster/common_patched.py
320c320,322
< self['init_services'].extend(resources)
---
> for resource in resources:
> if resource not in self['init_services']:
> self['init_services'].extend(resource)

but I still have the following concerns:

1) Should "json_*" parameters be present at all?
2) Should "cluster_connected" function from "charm-designate/reactive/designate_handlers.py" be run continuously which is the actual root cause of the "haproxy" values being populated?

I am happy to continue working on the patch. I am happy to create a new bug report. Just please let me know what the directions should be.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I filed bug 1754149, which may be what Tytus is seeing in recent runs; I haven't seen it live, only in a CI run.

Ryan Beisner (1chb1n)
Changed in charm-designate:
status: Fix Committed → Fix Released
Changed in charm-gnocchi:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.