HA deploys unreliable, pacemaker dies

Bug #1421488 reported by Paul Gear
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Undecided
Billy Olsen
hacluster (Juju Charms Collection)
Invalid
Undecided
Billy Olsen
keystone (Juju Charms Collection)
Invalid
Undecided
Unassigned

Bug Description

In fresh deploys of HA keystone, I'm getting unreliable behaviour. One of the most common problems is that pacemaker dies on the 3rd node. Symptoms of this are: higher load on the failing node, pacemaker down, and the hanode-relation-joined hook constantly running 'crm node list'. Trying "juju run --service keystone 'uname -a'" times out (presumably due to the hook still running).

Revision history for this message
Paul Gear (paulgear) wrote :

Here's the output of 'crm status' on all 3 nodes:

root@juju-machine-0-lxc-6:~# crm status
Last updated: Fri Feb 13 02:09:47 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-1-lxc-3:~# crm status
Last updated: Fri Feb 13 02:10:51 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-2-lxc-6:~# crm status
Could not establish cib_ro connection: Connection refused (111)
ERROR: crm_mon exited with code 107 and said: Connection to cluster failed: Transport endpoint is not connected

And, restarting pacemaker on node 2:

root@juju-machine-2-lxc-6:~# service pacemaker restart
Pacemaker Cluster Manager is already stopped[ OK ]
Starting Pacemaker Cluster Manager: [ OK ]
root@juju-machine-2-lxc-6:~# crm status
Last updated: Fri Feb 13 02:12:05 2015
Last change: Fri Feb 13 02:12:04 2015 via crmd on juju-machine-1-lxc-3
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

tags: added: canonical-bootstack
Revision history for this message
Paul Gear (paulgear) wrote :

After fixing pacemaker on node 2, "juju run --service keystone 'uname -a'" works normally.

Revision history for this message
Paul Gear (paulgear) wrote :

On the next deploy, immediately afterwards with no configuration changes, keystone units 0 and 1 had errors. The deployment script ran 'juju resolved --retry' on all errored units, which resolved unit 0 but not unit 1, which reports: hook failed: "identity-service-relation-changed" for nova-cloud-controller:identity-service. Unit 2's pacemaker is down again, and "juju run --service keystone 'uname -a'" hangs as before.

Attached are corosync & pacemaker diagnostics from each node and juju status of keystone service before restarting pacemaker on unit 2. I've also kept complete copies of /etc and /var/log from all nodes on this deploy, which can be provided if needed.

Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Liam and I have also seen this behavior. Pacemaker becomes wedged and can't communicate with corosync. There must be a timeout somewhere because in our tests we've seen after ~20 min that the issue resolves itself.

Marking as confirmed.

Changed in keystone (Juju Charms Collection):
status: New → Confirmed
status: Confirmed → Invalid
Changed in hacluster (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Moving to the hacluster charm, as this is not specific to keystone.

Changed in hacluster (Juju Charms Collection):
assignee: nobody → Billy Olsen (billy-olsen)
tags: added: backport-potential
Revision history for this message
Billy Olsen (billy-olsen) wrote :

@Paul - when deploying the hacluster, are you setting the cluster_count config option to the number of nodes you are deploying?

Revision history for this message
Paul Gear (paulgear) wrote :

@billy-olsen I'm no longer working on that project, but @brad-marshal reports that they are using cluster_count now and seeing more reliable results. So you can probably close this for now and we can revisit later if necessary.

Revision history for this message
Paul Gear (paulgear) wrote :

That should be @brad-marshall :-)

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Thanks for the info Paul. I'd rather not close it as I do believe its a problem as its too easy not to set the cluster_count.

I think we should revisit a restoration of the stop/sleep/start logic for the corosync restart sequence rather than the straight up restart that exists in the 15.01 version of the charms. There's some indication that a similar scenario was why the sequence was originally stop/sleep/start.

I think we should also possibly change the cluster_count default size to 3. The current default is 2, but I think we may want to align the default with what typical deployment practices are and have folks who are deploying less than 3 node clusters to intentionally decrement the count. This option may be somewhat contentious as it a value of 2 will allow a second unit to start forming a cluster. However there are serious considerations to be taken into account when deploying a 2 node cluster.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

I no longer suspect the corosync restart sequence. I've checked the syslog of a recreate case and see several cases of totem "FAILED TO RECEIVE" messages which aren't really related to a stop/sleep/start sequence. Additionally it was on the first startup of the cluster that it failed so it cannot be a restart sequence triggering it. Additionally, the cluster_count was 3 in this scenario so that may not help either.

tags: removed: backport-potential
James Page (james-page)
Changed in charm-hacluster:
assignee: nobody → Billy Olsen (billy-olsen)
status: New → Confirmed
Changed in hacluster (Juju Charms Collection):
status: Confirmed → Invalid
Revision history for this message
James Page (james-page) wrote :

Cluster cluster_count now defaults to 3; As there have been no updates on this ticket since 2015 I'm going to close it as a 'Fix Released' based on the comments in #11.

Changed in charm-hacluster:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.