Bug #1838528 “Upgrade from Xenial to Bionic breaks pacemaker res...” : Bugs : OpenStack HA Cluster Charm

Revision history for this message

Nick Niehoff (nniehoff) wrote on 2019-10-08:

#1

This also occurs when upgrading charms.

Reproducer:

1. Deploy bundle:

http://paste.ubuntu.com/p/KRTgNsr9Kh/

2. Verify "crm_mon -Af -1" before continuing:

# crm_mon -Af -1
Last updated: Tue Oct 8 16:13:57 2019 Last change: Tue Oct 8 16:11:00 2019 by hacluster via crmd on juju-93c0af-default-0
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Resource Group: grp_ks_vips
res_ks_ens3_vip (ocf::heartbeat:IPaddr2): Started juju-93c0af-default-0
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

3. Upgrade keystone

juju upgrade-charm keystone

4. Once complete, upgrade hacluster

juju upgrade-charm hacluster

5. The charm is now stuck in a waiting "Resource: res_ks_242d562_vip not yet configured" state
6. Investigate "crm_mon -Af -1"

# crm_mon -Af -1
Last updated: Tue Oct 8 16:32:19 2019 Last change: Tue Oct 8 16:22:15 2019 by root via cibadmin on juju-93c0af-default-3
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Resource Group: grp_ks_vips
res_ks_ens3_vip (ocf::heartbeat:IPaddr2): Started juju-93c0af-default-0
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

This also occurs when upgrading charms.

Reproducer:

1.  Deploy bundle:

http://paste.ubuntu.com/p/KRTgNsr9Kh/

2.  Verify "crm_mon -Af -1" before continuing:

# crm_mon -Af -1
Last updated: Tue Oct  8 16:13:57 2019          Last change: Tue Oct  8 16:11:00 2019 by hacluster via crmd on juju-93c0af-default-0
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Resource Group: grp_ks_vips
     res_ks_ens3_vip    (ocf::heartbeat:IPaddr2):       Started juju-93c0af-default-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

3.  Upgrade keystone

juju upgrade-charm keystone

4.  Once complete, upgrade hacluster

juju upgrade-charm hacluster

5.  The charm is now stuck in a waiting "Resource: res_ks_242d562_vip not yet configured" state
6.  Investigate "crm_mon -Af -1"

# crm_mon -Af -1
Last updated: Tue Oct  8 16:32:19 2019          Last change: Tue Oct  8 16:22:15 2019 by root via cibadmin on juju-93c0af-default-3
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Resource Group: grp_ks_vips
     res_ks_ens3_vip    (ocf::heartbeat:IPaddr2):       Started juju-93c0af-default-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

Felipe Reyes (freyes) on 2019-10-09

tags:

added: sts

Edward Hope-Morley (hopem) on 2019-10-09

Changed in charm-hacluster:
milestone:	none → 19.10

Felipe Reyes (freyes) on 2019-10-10

Changed in charm-hacluster:
assignee:	nobody → Felipe Reyes (freyes)

Revision history for this message

Felipe Reyes (freyes) wrote on 2019-10-10:

#2

keystone added the key "son_delete_resources" , but there was an issue during the deletion:

2019-10-10 20:23:08 DEBUG juju-log ha:4: Deleting Resources
2019-10-10 20:23:09 DEBUG juju-log ha:4: Cleanuping and deleting resource res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Waiting for 3 replies from the CRMd... OK
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:11 DEBUG ha-relation-changed waiting for Stopping res_ks_eth0_vip to finish . done
2019-10-10 20:23:11 DEBUG ha-relation-changed ERROR: resource res_ks_eth0_vip is running, can't delete it

$ juju run --unit keystone/0 'relation-get -r ha:4 - keystone/0'
clones: '{''cl_ks_haproxy'': ''res_ks_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "5434"
egress-subnets: 192.168.10.106/32
groups: '{''grp_ks_vips'': ''res_ks_eth0_vip''}'
ingress-address: 192.168.10.106
init_services: '{''res_ks_haproxy'': ''haproxy''}'
json_clones: '{"cl_ks_haproxy":"res_ks_haproxy"}'
json_delete_resources: '["res_ks_eth0_vip"]'
json_groups: '{"grp_ks_vips":"res_ks_e2590a7_vip"}'
json_init_services: '{"res_ks_haproxy":"haproxy"}'
json_resource_params: '{"res_ks_e2590a7_vip":"params ip=\"192.168.10.99\" op monitor
depth=\"0\" timeout=\"20s\" interval=\"10s\"","res_ks_haproxy":"op monitor interval=\"5s\""}'
json_resources: '{"res_ks_e2590a7_vip":"ocf:heartbeat:IPaddr2","res_ks_haproxy":"lsb:haproxy"}'
private-address: 192.168.10.106
resource_params: '{''res_ks_eth0_vip'': ''params ip="192.168.10.99" cidr_netmask="255.255.255.0"
nic="eth0"'', ''res_ks_haproxy'': ''op monitor interval="5s"''}'
resources: '{''res_ks_eth0_vip'': ''ocf:heartbeat:IPaddr2'', ''res_ks_haproxy'': ''lsb:haproxy''}'

keystone added the key "son_delete_resources" , but there was an issue during the deletion:

2019-10-10 20:23:08 DEBUG juju-log ha:4: Deleting Resources
2019-10-10 20:23:09 DEBUG juju-log ha:4: Cleanuping and deleting resource res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Waiting for 3 replies from the CRMd... OK
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:11 DEBUG ha-relation-changed waiting for Stopping res_ks_eth0_vip to finish . done
2019-10-10 20:23:11 DEBUG ha-relation-changed ERROR: resource res_ks_eth0_vip is running, can't delete it

$  juju run --unit keystone/0 'relation-get -r ha:4 - keystone/0'
clones: '{''cl_ks_haproxy'': ''res_ks_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "5434"
egress-subnets: 192.168.10.106/32
groups: '{''grp_ks_vips'': ''res_ks_eth0_vip''}'
ingress-address: 192.168.10.106
init_services: '{''res_ks_haproxy'': ''haproxy''}'
json_clones: '{"cl_ks_haproxy":"res_ks_haproxy"}'
json_delete_resources: '["res_ks_eth0_vip"]'
json_groups: '{"grp_ks_vips":"res_ks_e2590a7_vip"}'
json_init_services: '{"res_ks_haproxy":"haproxy"}'
json_resource_params: '{"res_ks_e2590a7_vip":"params ip=\"192.168.10.99\" op monitor
  depth=\"0\" timeout=\"20s\" interval=\"10s\"","res_ks_haproxy":"op monitor interval=\"5s\""}'
json_resources: '{"res_ks_e2590a7_vip":"ocf:heartbeat:IPaddr2","res_ks_haproxy":"lsb:haproxy"}'
private-address: 192.168.10.106
resource_params: '{''res_ks_eth0_vip'': ''params ip="192.168.10.99" cidr_netmask="255.255.255.0"
  nic="eth0"'', ''res_ks_haproxy'': ''op monitor interval="5s"''}'
resources: '{''res_ks_eth0_vip'': ''ocf:heartbeat:IPaddr2'', ''res_ks_haproxy'': ''lsb:haproxy''}'

Revision history for this message

Felipe Reyes (freyes) wrote on 2019-10-10:

#3

Download full text (4.6 KiB)

The problem is that the resource is not being stopped before attempting the deletion.

root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# #crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
waiting for Stopping res_ks_eth0_vip to finish . done
ERROR: resource res_ks_eth0_vip is running, can't delete it
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
waiting for stop to finish . done
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
DEBUG: remove object group:grp_ks_vips
DEBUG: remove object primitive:res_ks_eth0_vip
DEBUG: create configuration section rsc_defaults
DEBUG: Input: <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.4" epoch="39" num_updates="4" admin_epoch="0" cib-last-written="Thu Oct 10 21:17:07 2019" update-origin="juju-9a5dd5-1" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="1002">
<configuration><crm_config><cluster_property_set id="cib-bootstrap-options"><nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/><nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.14-70404b0"/><nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/><nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="debian"/><nvpair name="no-quorum-policy" value="stop" id="cib-bootstrap-options-no-quorum-policy"/><nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/><nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1570742227"/><nvpair name="cluster-recheck-interval" value="60" id="cib-bootstrap-options-cluster-recheck-interval"/></cluster_property_set></crm_confi...

The problem is that the resource is not being stopped before attempting the deletion.

root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# #crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip  && crm -d -w -F configure delete res_ks_eth0_vip
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
waiting for Stopping res_ks_eth0_vip to finish . done
ERROR: resource res_ks_eth0_vip is running, can't delete it
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip  && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
waiting for stop to finish . done
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
DEBUG: remove object group:grp_ks_vips
DEBUG: remove object primitive:res_ks_eth0_vip
DEBUG: create configuration section rsc_defaults
DEBUG: Input: <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.4" epoch="39" num_updates="4" admin_epoch="0" cib-last-written="Thu Oct 10 21:17:07 2019" update-origin="juju-9a5dd5-1" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="1002">
  <configuration><crm_config><cluster_property_set id="cib-bootstrap-options"><nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/><nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.14-70404b0"/><nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/><nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="debian"/><nvpair name="no-quorum-policy" value="stop" id="cib-bootstrap-options-no-quorum-policy"/><nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/><nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1570742227"/><nvpair name="cluster-recheck-interval" value="60" id="cib-bootstrap-options-cluster-recheck-interval"/></cluster_property_set></crm_config><nodes><node id="1002" uname="juju-9a5dd5-2"/><node id="1000" uname="juju-9a5dd5-1"/><node id="1001" uname="juju-9a5dd5-0"/></nodes><resources><clone id="cl_ks_haproxy"><primitive id="res_ks_haproxy" class="lsb" type="haproxy"><operations><op name="monitor" interval="5s" id="res_ks_haproxy-monitor-5s"/></operations></primitive></clone><primitive id="res_ks_e2590a7_vip" class="ocf" provider="heartbeat" type="IPaddr2"><instance_attributes id="res_ks_e2590a7_vip-instance_attributes"><nvpair name="ip" value="192.168.10.99" id="res_ks_e2590a7_vip-instance_attributes-ip"/></instance_attributes><operations><op name="monitor" timeout="20s" interval="10s" id="res_ks_e2590a7_vip-monitor-10s"><instance_attributes id="res_ks_e2590a7_vip-monitor-10s-instance_attributes"><nvpair name="depth" value="0" id="res_ks_e2590a7_vip-monitor-10s-instance_attributes-depth"/></instance_attributes></op></operations></primitive></resources><constraints/><rsc_defaults><meta_attributes id="rsc-options"><nvpair name="resource-stickiness" value="100" id="rsc-options-resource-stickiness"/><nvpair name="failure-timeout" value="0" id="rsc-options-failure-timeout"/></meta_attributes></rsc_defaults></configuration></cib>
DEBUG: pipe through crm_diff --no-version -o /tmp/tmpJ6FcNV.xml -n -
DEBUG: Diff: <diff format="2">
  <change operation="delete" path="/cib/configuration/resources/group[@id=&apos;grp_ks_vips&apos;]"/>
</diff>

DEBUG: piping string to cibadmin -p -P
DEBUG: CIB commit successful at 1570742229.81

Changed in charm-hacluster:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix proposed to charm-hacluster (master)

#4

Fix proposed to branch: master
Review: https://review.opendev.org/687987

Edward Hope-Morley (hopem) on 2019-10-12

Changed in charm-hacluster:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-17: Fix merged to charm-hacluster (master)

#5

Reviewed: https://review.opendev.org/687987
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=666055844e13b556ded97f4c92f3089e272507e8
Submitter: Zuul
Branch: master

commit 666055844e13b556ded97f4c92f3089e272507e8
Author: Felipe Reyes <email address hidden>
Date: Thu Oct 10 18:24:53 2019 -0300

Stop resource before deleting it.

Pacemaker will refuse to delete a resource that it's running, so it needs
to be stopped always before deleting it.

Change-Id: I3c6acdef401e9ec18fedc65e9c77db4719fe60ec
Closes-Bug: #1838528

Changed in charm-hacluster:
status:	In Progress → Fix Committed

David Ames (thedac) on 2019-10-24

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

Revision history for this message

Trent Lloyd (lathiat) wrote on 2020-03-05:

#6

The fix for this bug does not work if you upgrade the principal charm *before* the hacluster charm. It only works if you upgrade hacluster first. See Bug #1866145 for more details.

OpenStack HA Cluster Charm

Upgrade from Xenial to Bionic breaks pacemaker resource naming

Bug Description

Other bug subscribers

Remote bug watches