OpenStack HA Cluster Charm

adding a 3rd hacluster unit frequently makes ha-relation-changed to loop on crm node list

Bug #1424048 reported by JuanJo Ciarlante on 2015-02-20

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Confirmed	High	Unassigned
	hacluster (Juju Charms Collection)	Invalid	High	Unassigned	Juju Charms Collection 17.01

Bug Description

FYI this happens when deploying HA openstack with 1501 release
charms, using hacluster in unicast mode. It's a staged deployment
where we 1st deploy all HA services with 2 units, relate them
(for OS service), and finally add the 3rd unit to all HA'd ones.

We're repeatedly seeing issues with hacluster not settling on the 3rd
unit (/2) - drilling down, found that the 3rd unit is carrying an
incomplete corosync.conf with "two_node: 1" and only 2 nodes there
for unicast, while the others are already running with the 3 nodes
setup: http://paste.ubuntu.com/10329322/

Then the charm loops on 'crm node list' which never settles, not even
a manual corosync,pacemaker kill + restart works, as /2 can't join
the 3-node cluster (as expected by the other units).
Manually copying corosync.conf from /0 into /2 and restarting
corosync+pacemaker works, it can then succeed on 'crm node list',
and join the cluster.

Tags:

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2015-02-20:

FYI this looks like a race, as we have the same (repeated) deployment
sometimes failing on different hacluster subordinates (keystone,
cinder, glance).

tags:

added: canonical-bootstack

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-23:

I've seen this several times during deploys. However, the loop on 'crm node list' is not a definitive indicator - there are some cases where pacemaker fails to start on a node which has a correct corosync.conf, and all that is needed is to restart pacemaker.

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2015-02-27:

FYI this has corosync_transport: unicast.

Changed our deployment sequence to deploy the 3 HA
units at once, then relate openstack services, and got
a different issue, on some of them (affected units
changed each time I redeployed , tried couple times):

$ juju run --timeout=10s --service=keystone 'sudo crm status 2>/dev/null|egrep Started:'
- Error: command timed out
  MachineId: 0/lxc/6
  Stdout: ""
  UnitId: keystone/0
- MachineId: 1/lxc/3
  Stdout: ' Started: [ juju-machine-1-lxc-3 juju-machine-2-lxc-3 ]

'
UnitId: keystone/1
- MachineId: 2/lxc/3
Stdout: ' Started: [ juju-machine-1-lxc-3 juju-machine-2-lxc-3 ]

'
UnitId: keystone/2

Logging into the timed out unit shows pacemaker not started,
then hanode-relation-changed looping endlessly on failing
crm node list, after starting pacemaker there the hook could
complete ok: http://paste.ubuntu.com/10454923/

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-27:

JuanJo or Paul, Do you have any juju logs or syslogs you could attach for analysis? sosreport works as well.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-27:

JuanJo have you set the cluster_count config-option in the hacluster charm? If not, when you deploy a 3 node cluster at once, try setting the cluster_count to 3 as well so that the charm is looking to see 3 nodes before it starts the clustering process.

Revision history for this message

James Page (james-page) wrote on 2015-03-10:

JuanJo

I think I've tracked this problem down to the following situation - I can repro on a 3 node cluster expansion:

1) charms deployed with hacluster with unicast/cluster_count=3 (which is right for an initial three node cluster)

bootstraps fine - cluster forms OK

2) juju add-unit <service>

Additional unit spins - corosync is unable to startup correctly, to crm node list just sits in the loop.

Looking at the debug output of corosync and the local corosync.conf, the new unit only has a partial node list of three nodes, one being itself, and this is confusing the votequorum function.

I tried with what appears to be success to override the expected_votes calculation based on the nodelist with an explicit configuration with votequourm (which was already done for multicast) and I can now expand the cluster reliably.

Changed in hacluster (Juju Charms Collection):
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

James Page (james-page) wrote on 2015-03-10:

Hmm - I think that this actually breaks depending on which unit is the current DC owner - the existing units have the new unit in their nodelist, so start to send data, and if the current owner is not in the list, then we get this situation.

Based on that, I still get breaks on new units joining the cluster.

Revision history for this message

James Page (james-page) wrote on 2015-03-10:

workaround for now is to increase the cluster_count configuration inline with the target cluster size prior to adding units.

James Page (james-page) on 2015-03-10

Changed in hacluster (Juju Charms Collection):
milestone:	none → 15.04

James Page (james-page) on 2015-04-23

tags:	added: openstack
Changed in hacluster (Juju Charms Collection):
milestone:	15.04 → 15.07

James Page (james-page) on 2015-08-10

Changed in hacluster (Juju Charms Collection):
milestone:	15.07 → 15.10

James Page (james-page) on 2015-10-22

Changed in hacluster (Juju Charms Collection):
milestone:	15.10 → 16.01

James Page (james-page) on 2016-01-28

Changed in hacluster (Juju Charms Collection):
milestone:	16.01 → 16.04

James Page (james-page) on 2016-04-22

Changed in hacluster (Juju Charms Collection):
milestone:	16.04 → 16.07

Liam Young (gnuoy) on 2016-07-29

Changed in hacluster (Juju Charms Collection):
milestone:	16.07 → 16.10

James Page (james-page) on 2016-10-14

Changed in hacluster (Juju Charms Collection):
milestone:	16.10 → 17.01

James Page (james-page) on 2017-02-23

Changed in charm-hacluster:
importance:	Undecided → High
status:	New → Confirmed
Changed in hacluster (Juju Charms Collection):
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.