Bug #1513889 “Re-configuring resource parameters dysfunctional (...” : Bugs : OpenStack HA Cluster Charm

Revision history for this message

Giorgio Di Guardia (giorgiodg) wrote on 2017-02-03:

#1

+1

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-02-03:

#2

This issue not only happens with vip_cidr parameter, but it also happens with vip.

OTOH, when removing the relation (ie. mysql mysql-hacluster), directory /var/lib/pacemaker doesn't get removed (as it does happen with /var/lib/corosync). Re-creating the relation will use old data.

James Page (james-page) on 2017-02-23

Changed in hacluster (Juju Charms Collection):
status:	New → Invalid

Dmitrii Shcherbakov (dmitriis) on 2017-07-15

tags:	added: cpe cpe-sa
summary:	- Re-configuring resource parameters dysfunctional + Re-configuring resource parameters dysfunctional (cannot reconfigure + VIPs)

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-15:

#3

Download full text (6.7 KiB)

+1 on this

if you ever configure an incorrect VIP, you won't be able to change it afterwards. Relation data gets passed from a primary unit to a subordinate hacluster but it does not react properly to the config change.

http://paste.ubuntu.com/25096416/

Although normally VIPs do not change much, when they do it is impossible to change them without manual intervention. Also, config-changes silently succeed making you wonder what is wrong with the service.

It seems that the only workaround is to remove a relation between a primary application and a subordinate, wait until hacluster is cleaned up, remove leftovers and VIPs from interfaces and re-add the relation.

After doing remove-relation hangs for ~ 2 minutes after printing 'Purging configuration files for corosync':

(openstack-client) ubuntu@maas:~/bundles$ juju debug-log -i unit-hacluster-heat*
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-0: 10:04:20 DEBUG unit.hacluster-heat/0.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-2: 10:04:20 DEBUG unit.hacluster-heat/2.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-0: 10:04:20 DEBUG unit.hacluster-heat/0.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-2: 10:04:20 DEBUG unit.hacluster-heat/2.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-1: 10:04:19 DEBUG unit.hacluster-heat/1.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-1: 10:04:19 DEBUG unit.hacluster-heat/1.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:06:23 ERROR unit.hacluster-heat/0.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-0: 10:06:23 INFO juju.worker.uniter.operation ran "stop" hook
unit-hacluster-heat-2: 10:06:23 ERROR unit.hacluster-heat/2.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-1: 10:06:22 ERROR unit.hacluster-heat/1.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-1: 10:...

+1 on this

if you ever configure an incorrect VIP, you won't be able to change it afterwards. Relation data gets passed from a primary unit to a subordinate hacluster but it does not react properly to the config change.

http://paste.ubuntu.com/25096416/

Although normally VIPs do not change much, when they do it is impossible to change them without manual intervention. Also, config-changes silently succeed making you wonder what is wrong with the service.

It seems that the only workaround is to remove a relation between a primary application and a subordinate, wait until hacluster is cleaned up, remove leftovers and VIPs from interfaces and re-add the relation.

After doing remove-relation hangs for ~ 2 minutes after printing 'Purging configuration files for corosync':

(openstack-client) ubuntu@maas:~/bundles$ juju debug-log -i unit-hacluster-heat*
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-0: 10:04:20 DEBUG unit.hacluster-heat/0.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-2: 10:04:20 DEBUG unit.hacluster-heat/2.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-0: 10:04:21 DEBUG unit.hacluster-heat/0.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
unit-hacluster-heat-2: 10:04:21 DEBUG unit.hacluster-heat/2.stop Processing triggers for man-db (2.7.5-1) ...
unit-hacluster-heat-0: 10:04:20 DEBUG unit.hacluster-heat/0.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-2: 10:04:20 DEBUG unit.hacluster-heat/2.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-1: 10:04:19 DEBUG unit.hacluster-heat/1.stop Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
unit-hacluster-heat-1: 10:04:19 DEBUG unit.hacluster-heat/1.stop Removing corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-1: 10:04:20 DEBUG unit.hacluster-heat/1.stop Purging configuration files for corosync (2.3.5-3ubuntu1) ...
unit-hacluster-heat-0: 10:06:23 ERROR unit.hacluster-heat/0.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-0: 10:06:23 INFO juju.worker.uniter.operation ran "stop" hook
unit-hacluster-heat-2: 10:06:23 ERROR unit.hacluster-heat/2.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-1: 10:06:22 ERROR unit.hacluster-heat/1.juju-log Pacemaker is down. Please manually start it.
unit-hacluster-heat-1: 10:06:22 INFO juju.worker.uniter.operation ran "stop" hook
unit-hacluster-heat-1: 10:06:23 INFO juju.worker.uniter unit "hacluster-heat/1" shutting down: agent should be terminated
unit-hacluster-heat-0: 10:06:24 INFO juju.worker.uniter unit "hacluster-heat/0" shutting down: agent should be terminated
unit-hacluster-heat-2: 10:06:24 INFO juju.worker.uniter.operation ran "stop" hook
unit-hacluster-heat-2: 10:06:24 INFO juju.worker.uniter unit "hacluster-heat/2" shutting down: agent should be terminated

As Alvaro mentioned /var/lib/pacemaker/ stays there although a purge was done via apt-get:

/var/log/apt/history.log
...
Start-Date: 2017-07-15  14:02:18
Commandline: apt-get --assume-yes purge corosync pacemaker
Purge: corosync:amd64 (2.3.5-3ubuntu1), pacemaker:amd64 (1.1.14-2ubuntu1.1), crmsh:amd64 (2.2.0-1), pacemaker-cli-utils:amd64 (1.1.14-2ubuntu1.1)
End-Date: 2017-07-15  14:04:20

/var/log/apt/term.log

Log started: 2017-07-15  14:02:18
(Reading database ... ^M(Reading database ... 5%^M(Reading database ... 10%^M(Reading database ... 15%^M(Reading database ... 20%^M(Reading database ... 25%^M(Reading database ... 30%^M(Reading database ... 35%^M(Reading database ... 40%^M^M(Reading database ... 45%^M(Reading database ... 50%^M(Reading database ... 55%^M(Reading database ... 60%^M(Reading database ... 65%^M(Reading database ... 70%^M(Reading database ... 75%^M(Reading database ... 80%^M(Reading database ... 85%^M(Reading database ... 90%^M(Reading database ... 95%^M(Reading database ... 100%^M(Reading database ... 44654 files and directories currently installed.)
Removing crmsh (2.2.0-1) ...
Purging configuration files for crmsh (2.2.0-1) ...
Removing pacemaker-cli-utils (1.1.14-2ubuntu1.1) ...
Purging configuration files for pacemaker-cli-utils (1.1.14-2ubuntu1.1) ...
Removing pacemaker (1.1.14-2ubuntu1.1) ...
Purging configuration files for pacemaker (1.1.14-2ubuntu1.1) ...
Removing corosync (2.3.5-3ubuntu1) ...
Purging configuration files for corosync (2.3.5-3ubuntu1) ...
dpkg: warning: while removing corosync, directory '/etc/corosync' not empty so not removed
Processing triggers for man-db (2.7.5-1) ...
Log ended: 2017-07-15  14:04:20

----

To sum up:

* ha-relation-changed on hacluster does not result in proper resource changes
* removing a relation does not result in removal of /var/lib/pacemaker, /etc/corosync, /etc/pacemaker
* removing a relation does not result in VIPs being removed from related network interfaces on a corosync leader node

----

As for the latter point, pacemaker will fail to start as it will not be able to use the same VIP if you had one correct VIP and one incorrect VIP configured previously on the same host. If a different host is selected for a leader, there will be a duplicate IP address problem on a network.
 
Jul 15 14:17:49 juju-606269-3-lxd-6 pengine[499862]:    error: Could not bind AF_UNIX (): Address already in use (98)

--

juju remove-relation <application-name> <hacluster-application-name>
juju run --application <application-name> 'sudo rm -rf /etc/{corosync,pacemaker} /var/lib/pacemaker'
# remove VIPs from the corosync leader (master) node's interfaces via ip a d <addr> dev <iface>
juju config heat vip='<vip1> <vip2> <vipn>'
juju add-relation <application-name> <hacluster-application-name>

juju remove-relation heat hacluster-heat
juju run --application heat 'sudo rm -rf /etc/{corosync,pacemaker} /var/lib/pacemaker'
# remove VIPs from the corosync leader (master) node's interfaces via ip a d <addr> dev <iface>
juju config heat vip='192.0.2.100 203.0.113.100'
juju add-relation heat hacluster-heat

Changed in charm-hacluster:
status:	New → Confirmed

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-15:

#4

Also, in addition to steps in #3 I had to manually do 'pkill -f corosync' on each node before I could properly `systemctl restart corosync` or before charm could report something other than

hacluster-heat/7 blocked executing <addr> Pacemaker is down. Please manually start it.

. It appears to be that even after packages are purged, the related processes are not killed.

Dmitrii Shcherbakov (dmitriis) on 2017-07-20

tags:

added: cpec

Ante Karamatić (ivoks) on 2017-09-27

tags:

added: cpe-onsite
removed: cpe cpe-sa cpec

James Page (james-page) on 2017-10-02

Changed in charm-hacluster:
importance:	Undecided → Medium
status:	Confirmed → Triaged

Revision history for this message

Xav Paice (xavpaice) wrote on 2021-09-13:

#5

We changed the vip on an application recently, and did not remove/add the hacluster relation. The new VIP (.43) was configured as an extra resource, outside the resource group, and the old VIP (.42) was left alone:

vault-2# crm configure edit

node 1000: vault-1
node 1001: vault-2
node 1002: vault-3
primitive res_vault-ext_225dab8_vip IPaddr2 \
        params ip=172.27.84.42 \
        meta migration-threshold=INFINITY failure-timeout=5s \
        op monitor timeout=20s interval=10s depth=0
primitive res_vault-ext_8f67fe7_vip IPaddr2 \
        params ip=172.27.84.43 \
        meta migration-threshold=INFINITY failure-timeout=5s \
        op monitor timeout=20s interval=10s depth=0
group grp_vault-ext_vips res_vault-ext_225dab8_vip
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.18-2b07d5c5a9 \
        cluster-infrastructure=corosync \
        cluster-name=debian \
        no-quorum-policy=stop \
        cluster-recheck-interval=60 \
        stonith-enabled=false \
        last-lrm-refresh=1631512275
rsc_defaults rsc-options: \
        resource-stickiness=100 \
        failure-timeout=180

I then deleted res_vault-ext_225dab8_vip and added res_vault-ext_8f67fe7_vip to grp_vault-ext_vips manually as a clean up. The pacemaker status is good, and the cluster operates correctly, however charm status is now "waiting", with "Resource: res_vault-ext_225dab8_vip not yet configured".

This prevents a cleanup even if the charm itself does not support removing resources no longer needed, because (as pointed out by the author of LP:#1943422):

juju run -u hacluster-vault/0 -- relation-get -r 300 - vault/0
corosync_mcastport: "4440"
egress-subnets: 172.27.84.62/32
ingress-address: 172.27.84.62
json_groups: '{"grp_vault-ext_vips": "res_vault-ext_225dab8_vip res_vault-ext_8f67fe7_vip"}'
json_resource_params: '{"res_vault-ext_225dab8_vip": " params ip=\"172.27.84.42\" meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\"", "res_vault-ext_8f67fe7_vip": " params ip=\"172.27.84.43\" meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\""}'
json_resources: '{"res_vault-ext_225dab8_vip": "ocf:heartbeat:IPaddr2", "res_vault-ext_8f67fe7_vip":
  "ocf:heartbeat:IPaddr2"}'

We changed the vip on an application recently, and did not remove/add the hacluster relation.  The new VIP (.43) was configured as an extra resource, outside the resource group, and the old VIP (.42) was left alone:

vault-2# crm configure edit

node 1000: vault-1
node 1001: vault-2
node 1002: vault-3
primitive res_vault-ext_225dab8_vip IPaddr2 \
        params ip=172.27.84.42 \
        meta migration-threshold=INFINITY failure-timeout=5s \
        op monitor timeout=20s interval=10s depth=0
primitive res_vault-ext_8f67fe7_vip IPaddr2 \
        params ip=172.27.84.43 \
        meta migration-threshold=INFINITY failure-timeout=5s \
        op monitor timeout=20s interval=10s depth=0
group grp_vault-ext_vips res_vault-ext_225dab8_vip
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.18-2b07d5c5a9 \
        cluster-infrastructure=corosync \
        cluster-name=debian \
        no-quorum-policy=stop \
        cluster-recheck-interval=60 \
        stonith-enabled=false \
        last-lrm-refresh=1631512275
rsc_defaults rsc-options: \
        resource-stickiness=100 \
        failure-timeout=180

I then deleted res_vault-ext_225dab8_vip and added res_vault-ext_8f67fe7_vip to grp_vault-ext_vips manually as a clean up.  The pacemaker status is good, and the cluster operates correctly, however charm status is now "waiting", with "Resource: res_vault-ext_225dab8_vip not yet configured".

This prevents a cleanup even if the charm itself does not support removing resources no longer needed, because (as pointed out by the author of LP:#1943422):

juju run -u hacluster-vault/0 -- relation-get -r 300 - vault/0
corosync_mcastport: "4440"
egress-subnets: 172.27.84.62/32
ingress-address: 172.27.84.62
json_groups: '{"grp_vault-ext_vips": "res_vault-ext_225dab8_vip res_vault-ext_8f67fe7_vip"}'
json_resource_params: '{"res_vault-ext_225dab8_vip": "  params ip=\"172.27.84.42\"  meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\"  op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\"", "res_vault-ext_8f67fe7_vip": "  params ip=\"172.27.84.43\"  meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\"  op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\""}'
json_resources: '{"res_vault-ext_225dab8_vip": "ocf:heartbeat:IPaddr2", "res_vault-ext_8f67fe7_vip":
  "ocf:heartbeat:IPaddr2"}'

Revision history for this message

Joe Guo (guoqiao) wrote on 2021-09-14:

#6

I tried to remove `res_vault-ext_225dab8_vip` from the relation data with relation-set, but it doesn't work.

The cli I have tried:

1) full cmd:

juju run -u hacluster-vault/0 -- relation-set -r 300 json_groups='{"grp_vault-ext_vips": "res_vault-ext_8f67fe7_vip"}' json_resource_params='{"res_vault-ext_8f67fe7_vip": " params ip=\"172.27.84.43\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\" interval=\"10s\" depth=\"0\""}' json_resources='{"res_vault-ext_8f67fe7_vip": "ocf:heartbeat:IPaddr2"}'

2) cmd with --file optiton:

cat 0.yaml

json_groups: '{"grp_vault-ext_vips": "res_vault-ext_8f67fe7_vip"}'
json_resource_params: '{"res_vault-ext_8f67fe7_vip": " params ip=\"172.27.84.43\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\" interval=\"10s\" depth=\"0\""}'
json_resources: '{"res_vault-ext_8f67fe7_vip": "ocf:heartbeat:IPaddr2"}'

juju scp 0.yaml hacluster-vault/0:/tmp/
juju run -u hacluster-vault/0 -- relation-set -r 300 --file /tmp/0.yaml

3) only modify 1 single key:

juju run -u hacluster-vault/0 -- relation-set -r 300 json_resources='{"res_vault-ext_8f67fe7_vip": "ocf:heartbeat:IPaddr2"}'

with any of above methods, there is no error, but no effect either, the relation data remains the same.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-10-14:

#7

I was able to successfully delete old resources without removing the relations. Basically a relation-set needs to used as this:

juju run -u glance/1 -- relation-set -r ha:43 json_delete_resources='["res_glance_ens3_vip","res_glance_242d562_vip"]'

notice that I'm applying the relation-set on glance, because it is glance that writes the data into that relation.

So there's this json_delete_resources property that triggers the cleanup of VIPs in the hacluster charm. When VIPs are changed, on most charms (like glance), it will replace the relation-data with the new VIP data, which hacluster will then apply, however, it does not clean up the old ones on pacemaker, so you end up with the new VIP + the old ones you already had.

On some other charms, like placement, it is different. Every time the VIP for placement is changed, it appends to the relation-data instead of overriding, so the relation-set will be a bit similar to what Joe Guo posted above. Therefore it is necessary to reset the relation-data to the only VIPs you want, manually removing the old ones from the relation-data, in addition to the json_delete_resources to clean up the old one from pacemaker:

juju run -u placement/0 -- relation-set -r ha:48 json_groups='{"grp_placement_vips": "res_placement_df14adf_vip"}' json_resources='{"res_placement_df14adf_vip": "ocf:heartbeat:IPaddr2", "res_placement_haproxy": "lsb:haproxy"}' json_delete_resources='["res_placement_ens3_vip","res_placement_72e5569_vip"]'

But those charms that append the data to the relation (like placement), they do that every time they write to the relation, despite the VIP config being changed, so they save the old VIPs somewhere. We have ongoing investigation indicating that it is sqlite, but I still need to test further to devise a proper series of steps to clean that up.

tags:

added: sts

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-23: Fix proposed to charm-hacluster (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/818996

Changed in charm-hacluster:
status:	Triaged → In Progress

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-23:

#9

The above change addresses the issue from the hacluster point of view. However, reactive charms such as placement and vault use the charm-interface-hacluster interface and therefore behave differently than classic charms. Classic charms always provide the value of the VIP in config into the relation data. Reactive charms using that interface append the reconfigured values and provide multiple VIPs into the relation data. The interface and/or the charms using the interface need to have a separate fix so the hacluster charm always receive the correct data from the relation.

no longer affects:

charm-placement

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-05: Change abandoned on charm-hacluster (master)

#10

Change abandoned by "Rodrigo Barbieri <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/818996
Reason: this is still a very nice improvement to have, but we do not currently have a meaningful demand or motivation for further working on it at the moment. Abandoning it

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	In Progress	Medium	Unassigned
	hacluster (Juju Charms Collection)	Invalid	Undecided	Unassigned

OpenStack HA Cluster Charm

Re-configuring resource parameters dysfunctional (cannot reconfigure VIPs)

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches