CRM resource grp_ks_vips not found

Bug #1418982 reported by Xiang Hui
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
keystone (Juju Charms Collection)
Expired
Medium
Unassigned

Bug Description

Deploy 3 keystone + hacluster:

juju deploy cs:trusty/keystone
juju deploy cs:trusty/hacluster

add-unit..

juju set keystone vip=192.168.100.190
juju set hacluster corosync_transport=udpu

juju add-relation keystone hacluster
after keystone started, got split brain.
keystone/1/2:
root@juju-machine-5-lxc-0:~# crm status
Online: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-4-lxc-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]
keystone/0:
root@juju-machine-1-lxc-0:~# crm status
Last updated: Fri Feb 6 08:18:26 2015
Last change: Fri Feb 6 07:17:28 2015 via crmd on juju-machine-1-lxc-0
Stack: corosync
Current DC: juju-machine-1-lxc-0 (1002) - partition WITHOUT quorum
Version: 1.1.10-42f2063
1 Nodes configured
2 Resources configured

Online: [ juju-machine-1-lxc-0 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-1-lxc-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-1-lxc-0 ]

check some log..
restart corosync on keystone/0, cluster back to normal

root@juju-machine-5-lxc-0:~# crm status
Online: [ juju-machine-1-lxc-0 juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

 Resource Group: grp_ks_vips
     res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-4-lxc-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-1-lxc-0 juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

juju remove-relation keystone hacluster
successfully

juju add-relation keystone hacluster
Then:
2015-02-06 10:41:41 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Syncing known_hosts @ /home/juju_keystone/.ssh/known_hosts.
2015-02-06 10:41:41 DEBUG unit.keystone/3.juju-log cmd.go:247 cluster:0: Peer echo overrides: {}
2015-02-06 10:41:41 DEBUG unit.keystone/3.juju-log cmd.go:247 cluster:0: Peer echo whitelist: [u'admin_passwd', u'ssl-cert-master']
2015-02-06 10:41:42 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Retrying 'is_crm_leader' 5 more times (delay=2)
2015-02-06 10:41:45 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Retrying 'is_crm_leader' 4 more times (delay=4)
2015-02-06 10:41:50 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Retrying 'is_crm_leader' 3 more times (delay=6)
2015-02-06 10:41:57 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Retrying 'is_crm_leader' 2 more times (delay=8)
2015-02-06 10:42:05 INFO unit.keystone/3.juju-log cmd.go:247 cluster:0: Retrying 'is_crm_leader' 1 more times (delay=10)
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 Traceback (most recent call last):
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/cluster-relation-changed", line 536, in <module>
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 main()
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/cluster-relation-changed", line 530, in main
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 hooks.execute(sys.argv)
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/charmhelpers/core/hookenv.py", line 544, in execute
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 self._hooks[hook_name]()
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/charmhelpers/core/host.py", line 312, in wrapped_f
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 f(*args)
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/cluster-relation-changed", line 374, in cluster_changed
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 if is_elected_leader(CLUSTER_RES) or is_ssl_cert_master():
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 73, in is_elected_leader
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 if not is_crm_leader(resource):
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 return f(*args, **kwargs)
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-keystone-3/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 116, in is_crm_leader
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-02-06 10:42:16 INFO unit.keystone/3.cluster-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_ks_vips not found
2015-02-06 10:42:16 ERROR juju.worker.uniter uniter.go:608 hook "cluster-relation-changed" failed: exit status 1

Tags: openstack sts
Xiang Hui (xianghui)
Changed in keystone (Juju Charms Collection):
importance: Undecided → Critical
tags: added: cts openstack
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Can you please provide output of sudo crm status

Changed in keystone (Juju Charms Collection):
status: New → Incomplete
Revision history for this message
Xiang Hui (xianghui) wrote :
Xiang Hui (xianghui)
description: updated
Changed in keystone (Juju Charms Collection):
status: Incomplete → Triaged
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hui, can you please provide the output of 'sudo crm status'. The reason I ask is twofold:

1. It is possible that your corosync cluster is slow and grp_ks_vips is taking a while to appear/settle. This is exacerbated by corosync being repeatedly restarted during deploy (something we need to resolve). The error you are seeing is the consequence of a timeout after a number of retries to detect that this resource exists (in order to establish a 'leader').

2. You have said that you did initsially manage to deploy successfully but, for the record, if the hacluster corosync_bindiface setting is not correct, the grp_ks_vips resource will not be created.

Changed in keystone (Juju Charms Collection):
status: Triaged → Incomplete
status: Incomplete → Triaged
Revision history for this message
Edward Hope-Morley (hopem) wrote :

The corosync status output pasted above does show grp_ks_vips and there is a vip primitve for each node but yopu also appear to have a split brain since not all nodes are reporting the same status. I am guessing that this is a result of you also testing bug 1379484

Revision history for this message
Xiang Hui (xianghui) wrote :

The environment has been destroyed, set it as low priority until it appears again.

Changed in keystone (Juju Charms Collection):
importance: Critical → Low
Changed in keystone (Juju Charms Collection):
importance: Low → High
Revision history for this message
Adam Collard (adam-collard) wrote :

Bumping to High since I've just hit what appears to be the same issue.

3 keystone units, one failed.

Unit log from failed unit: http://paste.ubuntu.com/10722649/

crm status: http://paste.ubuntu.com/10722644/

Revision history for this message
Adam Collard (adam-collard) wrote :

unit logs from the hacluster unit: http://paste.ubuntu.com/10722676/

Note the timing compared to the keystone logs above

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Adam, can you please confirm a something for me. The crm status you provided above seems good so I suspect that a 'juju resolved -r keystone/<unit>' would have brought the unit back to a good state. This kind if error is a generally result of a cluster that is frequently failing over which is a currently a likely consequence of repeated corosync restarts at it gets configured - something we are aiming to fix. Another possibility I have seem is that you have split brain in your cluster which is not as easily resolvable. If you could confirm this that would be appreciated. Thanks.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I have same problem

crm status & juju status --format tabular

http://paste.ubuntu.com/10996853/

I have this error on 4th node

I tried "juju resolved --retry keystone/4"

but same error pop up

with 3 nodes, was ok

I changed config cluster_count to 4 on hacluster because i was testing https://bugs.launchpad.net/charms/+source/hacluster/+bug/1424048 issue

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

after "juju resolved --retry keystone/4"

crm status of all have changed

#######################################################################
Last updated: Wed May 6 12:22:27 2015
Last change: Wed May 6 12:18:32 2015 via juju-openstack-machine-5 on juju-openstack-machine-3
Stack: corosync
Current DC: juju-openstack-machine-3 (1002) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Online: [ juju-openstack-machine-1 juju-openstack-machine-2 juju-openstack-machine-3 ]
OFFLINE: [ juju-openstack-machine-5 ]

#######################################################################

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

after about an hour later

status changed

http://pastebin.ubuntu.com/10997163/

I'm not sure it's helpful but just paste this

Felipe Reyes (freyes)
tags: added: sts
tags: removed: cts
James Page (james-page)
Changed in keystone (Juju Charms Collection):
importance: High → Medium
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for keystone (Juju Charms Collection) because there has been no activity for 60 days.]

Changed in keystone (Juju Charms Collection):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.