Bug #1421488 “HA deploys unreliable, pacemaker dies” : Bugs : OpenStack HA Cluster Charm

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#1

Here's the output of 'crm status' on all 3 nodes:

root@juju-machine-0-lxc-6:~# crm status
Last updated: Fri Feb 13 02:09:47 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

Resource Group: grp_ks_vips
res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-1-lxc-3:~# crm status
Last updated: Fri Feb 13 02:10:51 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

Resource Group: grp_ks_vips
res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-2-lxc-6:~# crm status
Could not establish cib_ro connection: Connection refused (111)
ERROR: crm_mon exited with code 107 and said: Connection to cluster failed: Transport endpoint is not connected

And, restarting pacemaker on node 2:

root@juju-machine-2-lxc-6:~# service pacemaker restart
Pacemaker Cluster Manager is already stopped[ OK ]
Starting Pacemaker Cluster Manager: [ OK ]
root@juju-machine-2-lxc-6:~# crm status
Last updated: Fri Feb 13 02:12:05 2015
Last change: Fri Feb 13 02:12:04 2015 via crmd on juju-machine-1-lxc-3
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

Resource Group: grp_ks_vips
res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-0-lxc-6
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

Here's the output of 'crm status' on all 3 nodes:

root@juju-machine-0-lxc-6:~# crm status
Last updated: Fri Feb 13 02:09:47 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

Resource Group: grp_ks_vips
     res_ks_eth0_vip    (ocf::heartbeat:IPaddr2):       Started juju-machine-0-lxc-6 
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-1-lxc-3:~# crm status
Last updated: Fri Feb 13 02:10:51 2015
Last change: Fri Feb 13 01:39:53 2015 via crmd on juju-machine-0-lxc-6
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

Resource Group: grp_ks_vips
     res_ks_eth0_vip    (ocf::heartbeat:IPaddr2):       Started juju-machine-0-lxc-6 
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 ]

root@juju-machine-2-lxc-6:~# crm status
Could not establish cib_ro connection: Connection refused (111)
ERROR: crm_mon exited with code 107 and said: Connection to cluster failed: Transport endpoint is not connected

And, restarting pacemaker on node 2:

root@juju-machine-2-lxc-6:~# service pacemaker restart
Pacemaker Cluster Manager is already stopped[  OK  ]
Starting Pacemaker Cluster Manager: [  OK  ]
root@juju-machine-2-lxc-6:~# crm status
Last updated: Fri Feb 13 02:12:05 2015
Last change: Fri Feb 13 02:12:04 2015 via crmd on juju-machine-1-lxc-3
Stack: corosync
Current DC: juju-machine-1-lxc-3 (1000) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Online: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

Resource Group: grp_ks_vips
     res_ks_eth0_vip    (ocf::heartbeat:IPaddr2):       Started juju-machine-0-lxc-6 
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-0-lxc-6 juju-machine-1-lxc-3 juju-machine-2-lxc-6 ]

tags:

added: canonical-bootstack

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#2

After fixing pacemaker on node 2, "juju run --service keystone 'uname -a'" works normally.

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#3

On the next deploy, immediately afterwards with no configuration changes, keystone units 0 and 1 had errors. The deployment script ran 'juju resolved --retry' on all errored units, which resolved unit 0 but not unit 1, which reports: hook failed: "identity-service-relation-changed" for nova-cloud-controller:identity-service. Unit 2's pacemaker is down again, and "juju run --service keystone 'uname -a'" hangs as before.

Attached are corosync & pacemaker diagnostics from each node and juju status of keystone service before restarting pacemaker on unit 2. I've also kept complete copies of /etc and /var/log from all nodes on this deploy, which can be provided if needed.

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#4

keystone-0.txt Edit (2.5 KiB, text/plain)

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#5

keystone-1.txt Edit (2.5 KiB, text/plain)

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#6

keystone-2.txt Edit (1.9 KiB, text/plain)

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-13:

#7

juju-status.txt Edit (3.5 KiB, text/plain)

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-03-19:

#8

Liam and I have also seen this behavior. Pacemaker becomes wedged and can't communicate with corosync. There must be a timeout somewhere because in our tests we've seen after ~20 min that the issue resolves itself.

Marking as confirmed.

Changed in keystone (Juju Charms Collection):
status:	New → Confirmed
status:	Confirmed → Invalid
Changed in hacluster (Juju Charms Collection):
status:	New → Confirmed

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-03-19:

#9

Moving to the hacluster charm, as this is not specific to keystone.

Changed in hacluster (Juju Charms Collection):
assignee:	nobody → Billy Olsen (billy-olsen)
tags:	added: backport-potential

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-03-19:

#10

@Paul - when deploying the hacluster, are you setting the cluster_count config option to the number of nodes you are deploying?

Revision history for this message

Paul Gear (paulgear) wrote on 2015-03-19:

#11

@billy-olsen I'm no longer working on that project, but @brad-marshal reports that they are using cluster_count now and seeing more reliable results. So you can probably close this for now and we can revisit later if necessary.

Revision history for this message

Paul Gear (paulgear) wrote on 2015-03-19:

#12

That should be @brad-marshall :-)

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-03-19:

#13

Thanks for the info Paul. I'd rather not close it as I do believe its a problem as its too easy not to set the cluster_count.

I think we should revisit a restoration of the stop/sleep/start logic for the corosync restart sequence rather than the straight up restart that exists in the 15.01 version of the charms. There's some indication that a similar scenario was why the sequence was originally stop/sleep/start.

I think we should also possibly change the cluster_count default size to 3. The current default is 2, but I think we may want to align the default with what typical deployment practices are and have folks who are deploying less than 3 node clusters to intentionally decrement the count. This option may be somewhat contentious as it a value of 2 will allow a second unit to start forming a cluster. However there are serious considerations to be taken into account when deploying a 2 node cluster.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-03-19:

#14

I no longer suspect the corosync restart sequence. I've checked the syslog of a recreate case and see several cases of totem "FAILED TO RECEIVE" messages which aren't really related to a stop/sleep/start sequence. Additionally it was on the first startup of the cluster that it failed so it cannot be a restart sequence triggering it. Additionally, the cluster_count was 3 in this scenario so that may not help either.

Edward Hope-Morley (hopem) on 2015-09-16

tags:

removed: backport-potential

James Page (james-page) on 2017-02-23

Changed in charm-hacluster:
assignee:	nobody → Billy Olsen (billy-olsen)
status:	New → Confirmed
Changed in hacluster (Juju Charms Collection):
status:	Confirmed → Invalid

Revision history for this message

James Page (james-page) wrote on 2017-10-02:

#15

Cluster cluster_count now defaults to 3; As there have been no updates on this ticket since 2015 I'm going to close it as a 'Fix Released' based on the comments in #11.

Changed in charm-hacluster:
status:	Confirmed → Fix Released

	Status	Importance	Assigned to
OpenStack HA Cluster Charm	Fix Released	Undecided	Billy Olsen
hacluster (Juju Charms Collection)	Invalid	Undecided	Billy Olsen
keystone (Juju Charms Collection)	Invalid	Undecided	Unassigned

OpenStack HA Cluster Charm

HA deploys unreliable, pacemaker dies

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches