Upgrade from Xenial to Bionic breaks pacemaker resource naming

Bug #1838528 reported by Alvaro Uria
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
High
Felipe Reyes

Bug Description

A Cloud environment was upgraded from Xenial to Bionic using "juju upgrade-series" and manually running "do-release-upgrade", following the Juju team playbook.

Containers have ethX interfaces, and the pacemaker resource name on all the containers that use charm-hacluster are of the type "res_eth0_vip".

After the upgrade to Bionic, I had to manually fix the duplicated VIP by removing the resource that was not configured on any group:
https://pastebin.ubuntu.com/p/pb7nKhS4Kg/

However, I think I will run into the same issue because the "ha" relation shows the new resource name ("res_nova_d3367e9_vip").

I think charm-hacluster should use the already configured resource or reconfigure (rename) it.

~$ u=ncc-hacluster/5;r=ha; juju run --unit $u "relation-ids $r| xargs -I_@ sh -c 'relation-list -r _@|xargs -I_U sh -c \"relation-get -r _@ - _U |sed s,^,_U:, 2>&1\"'"
nova-cloud-controller/3:clones: '{''cl_nova_haproxy'': ''res_nova_haproxy''}'
nova-cloud-controller/3:clustered: "yes"
nova-cloud-controller/3:colocations: '{}'
nova-cloud-controller/3:corosync_bindiface: eth0
nova-cloud-controller/3:corosync_mcastport: "5404"
nova-cloud-controller/3:delete_resources: '[''vip_consoleauth'', ''res_nova_consoleauth'']'
nova-cloud-controller/3:egress-subnets: 10.28.2.52/32
nova-cloud-controller/3:groups: '{''grp_nova_vips'': ''res_nova_eth0_vip''}'
nova-cloud-controller/3:ingress-address: 10.28.2.52
nova-cloud-controller/3:init_services: '{}'
nova-cloud-controller/3:json_clones: '{"cl_nova_haproxy":"res_nova_haproxy"}'
nova-cloud-controller/3:json_delete_resources: '["vip_consoleauth","res_nova_consoleauth","res_nova_eth0_vip"]'
nova-cloud-controller/3:json_groups: '{"grp_nova_vips":"res_nova_d3367e9_vip"}'
nova-cloud-controller/3:json_init_services: '{"res_nova_haproxy":"haproxy"}'
nova-cloud-controller/3:json_resource_params: '{"res_nova_d3367e9_vip":"params ip=\"10.28.3.242\" op monitor
nova-cloud-controller/3: depth=\"0\" timeout=\"20s\" interval=\"10s\"","res_nova_haproxy":"op monitor interval=\"5s\""}'
nova-cloud-controller/3:json_resources: '{"res_nova_d3367e9_vip":"ocf:heartbeat:IPaddr2","res_nova_haproxy":"lsb:haproxy"}'
nova-cloud-controller/3:private-address: 10.28.2.52
nova-cloud-controller/3:resource_params: '{}'
nova-cloud-controller/3:resources: '{}'

Revision history for this message
Nick Niehoff (nniehoff) wrote :

This also occurs when upgrading charms.

Reproducer:

1. Deploy bundle:

http://paste.ubuntu.com/p/KRTgNsr9Kh/

2. Verify "crm_mon -Af -1" before continuing:

# crm_mon -Af -1
Last updated: Tue Oct 8 16:13:57 2019 Last change: Tue Oct 8 16:11:00 2019 by hacluster via crmd on juju-93c0af-default-0
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

 Resource Group: grp_ks_vips
     res_ks_ens3_vip (ocf::heartbeat:IPaddr2): Started juju-93c0af-default-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

3. Upgrade keystone

juju upgrade-charm keystone

4. Once complete, upgrade hacluster

juju upgrade-charm hacluster

5. The charm is now stuck in a waiting "Resource: res_ks_242d562_vip not yet configured" state
6. Investigate "crm_mon -Af -1"

# crm_mon -Af -1
Last updated: Tue Oct 8 16:32:19 2019 Last change: Tue Oct 8 16:22:15 2019 by root via cibadmin on juju-93c0af-default-3
Stack: corosync
Current DC: juju-93c0af-default-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

 Resource Group: grp_ks_vips
     res_ks_ens3_vip (ocf::heartbeat:IPaddr2): Started juju-93c0af-default-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-93c0af-default-0 juju-93c0af-default-2 juju-93c0af-default-3 ]

Node Attributes:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:
* Node juju-93c0af-default-3:

Migration Summary:
* Node juju-93c0af-default-3:
* Node juju-93c0af-default-0:
* Node juju-93c0af-default-2:

Felipe Reyes (freyes)
tags: added: sts
Changed in charm-hacluster:
milestone: none → 19.10
Felipe Reyes (freyes)
Changed in charm-hacluster:
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Felipe Reyes (freyes) wrote :

keystone added the key "son_delete_resources" , but there was an issue during the deletion:

2019-10-10 20:23:08 DEBUG juju-log ha:4: Deleting Resources
2019-10-10 20:23:09 DEBUG juju-log ha:4: Cleanuping and deleting resource res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Waiting for 3 replies from the CRMd... OK
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:10 DEBUG ha-relation-changed Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
2019-10-10 20:23:11 DEBUG ha-relation-changed waiting for Stopping res_ks_eth0_vip to finish . done
2019-10-10 20:23:11 DEBUG ha-relation-changed ERROR: resource res_ks_eth0_vip is running, can't delete it

$ juju run --unit keystone/0 'relation-get -r ha:4 - keystone/0'
clones: '{''cl_ks_haproxy'': ''res_ks_haproxy''}'
corosync_bindiface: eth0
corosync_mcastport: "5434"
egress-subnets: 192.168.10.106/32
groups: '{''grp_ks_vips'': ''res_ks_eth0_vip''}'
ingress-address: 192.168.10.106
init_services: '{''res_ks_haproxy'': ''haproxy''}'
json_clones: '{"cl_ks_haproxy":"res_ks_haproxy"}'
json_delete_resources: '["res_ks_eth0_vip"]'
json_groups: '{"grp_ks_vips":"res_ks_e2590a7_vip"}'
json_init_services: '{"res_ks_haproxy":"haproxy"}'
json_resource_params: '{"res_ks_e2590a7_vip":"params ip=\"192.168.10.99\" op monitor
  depth=\"0\" timeout=\"20s\" interval=\"10s\"","res_ks_haproxy":"op monitor interval=\"5s\""}'
json_resources: '{"res_ks_e2590a7_vip":"ocf:heartbeat:IPaddr2","res_ks_haproxy":"lsb:haproxy"}'
private-address: 192.168.10.106
resource_params: '{''res_ks_eth0_vip'': ''params ip="192.168.10.99" cidr_netmask="255.255.255.0"
  nic="eth0"'', ''res_ks_haproxy'': ''op monitor interval="5s"''}'
resources: '{''res_ks_eth0_vip'': ''ocf:heartbeat:IPaddr2'', ''res_ks_haproxy'': ''lsb:haproxy''}'

Revision history for this message
Felipe Reyes (freyes) wrote :
Download full text (4.6 KiB)

The problem is that the resource is not being stopped before attempting the deletion.

root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# #crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
waiting for Stopping res_ks_eth0_vip to finish . done
ERROR: resource res_ks_eth0_vip is running, can't delete it
root@juju-9a5dd5-1:/var/lib/juju/agents/unit-hacluster-0/charm# crm resource cleanup res_ks_eth0_vip && crm -w -F resource stop res_ks_eth0_vip && crm -d -w -F configure delete res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-0, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-1, removing fail-count-res_ks_eth0_vip
Cleaning up res_ks_eth0_vip on juju-9a5dd5-2, removing fail-count-res_ks_eth0_vip
Waiting for 3 replies from the CRMd... OK
waiting for stop to finish . done
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: Using crm_resource for agent discovery
DEBUG: resolve_references: res_ks_eth0_vip -> primitive:res_ks_eth0_vip
DEBUG: resolve_references: res_ks_haproxy -> primitive:res_ks_haproxy
DEBUG: remove object group:grp_ks_vips
DEBUG: remove object primitive:res_ks_eth0_vip
DEBUG: create configuration section rsc_defaults
DEBUG: Input: <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.4" epoch="39" num_updates="4" admin_epoch="0" cib-last-written="Thu Oct 10 21:17:07 2019" update-origin="juju-9a5dd5-1" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="1002">
  <configuration><crm_config><cluster_property_set id="cib-bootstrap-options"><nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/><nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.14-70404b0"/><nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/><nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="debian"/><nvpair name="no-quorum-policy" value="stop" id="cib-bootstrap-options-no-quorum-policy"/><nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/><nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1570742227"/><nvpair name="cluster-recheck-interval" value="60" id="cib-bootstrap-options-cluster-recheck-interval"/></cluster_property_set></crm_confi...

Read more...

Changed in charm-hacluster:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687987

Changed in charm-hacluster:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/687987
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=666055844e13b556ded97f4c92f3089e272507e8
Submitter: Zuul
Branch: master

commit 666055844e13b556ded97f4c92f3089e272507e8
Author: Felipe Reyes <email address hidden>
Date: Thu Oct 10 18:24:53 2019 -0300

    Stop resource before deleting it.

    Pacemaker will refuse to delete a resource that it's running, so it needs
    to be stopped always before deleting it.

    Change-Id: I3c6acdef401e9ec18fedc65e9c77db4719fe60ec
    Closes-Bug: #1838528

Changed in charm-hacluster:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-hacluster:
status: Fix Committed → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

The fix for this bug does not work if you upgrade the principal charm *before* the hacluster charm. It only works if you upgrade hacluster first. See Bug #1866145 for more details.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.