Removing lead unit from existing service deployment causes failure in related services due to incomplete contexts

Bug #1355848 reported by Gareth Woolridge
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
keystone (Juju Charms Collection)
Fix Released
High
Liam Young
percona-cluster (Juju Charms Collection)
Fix Released
High
Liam Young
rabbitmq-server (Juju Charms Collection)
Fix Released
High
Liam Young

Bug Description

We have an OpenStack deployment using Trusty/Icehouse and 3 node HA clusters for the core infra services.

The HA instances are shmooshed onto 3 physical MAAS deployed nodes using LXC.

We recently required to remove a percona unit due to hardware issues on the parent physical node for replacement with new hardware.

After removing a unit with juju remove-unit percona-cluster/1 and terminating the machine we experienced DB related issues although percona and corosync etc looked OK.

Nova list returned no instances, juju status failed due to access to deployed environment bootstrap nodes as this affected the neutron gateway database access, which caused an outage for all VMs/environments deployed due to lack of tenant networking.

We were able to confirm that nova.conf, neutron.conf etc on the infrastructure services was missing the sql_connection details, and juju unit logs for nova-cloud-controller, nova-compute, quantum-gateway, glance, cinder etc showed:

2014-08-11 11:49:59 INFO juju-log shared-db:142: Missing required data: database_password database_host

Manually causing a config-changed hook to fire with a juju set did not resolve the issue, ultimately across all infrastructure services with a database relation we removed the shared-db relation, and then added it again. After doing this and bouncing the neutron nodes service was restored.

Charms that support HA and clustering need to support scale down as well as scale up without knock on effects to dependent services.

Related branches

summary: - Removing unit from existing cluster causes shared-db failure to related
- services
+ Removing unit from existing percona cluster causes shared-db failure to
+ related services
description: updated
James Page (james-page)
Changed in charms:
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote : Re: Removing unit from existing percona cluster causes shared-db failure to related services

I see the issue; only the leader sets the username/password, and as we don't have hooks (yet) in Juju for when the leader of a service changes, we can't set it somewhere else easily when the existing leader is removed.

This will be easily fixed when the leader election features land in Juju; in the meantime we need to work something into the charms to deal with this.

I think the same problem will happen with any charm that provides credentials so raising a bug task for rabbitmq and keystone as well.

Changed in charms:
status: New → Triaged
affects: charms → rabbitmq-server (Juju Charms Collection)
no longer affects: keystone (Ubuntu)
Changed in keystone (Juju Charms Collection):
importance: Undecided → High
status: New → Triaged
Changed in percona-cluster (Juju Charms Collection):
status: New → Triaged
importance: Undecided → High
James Page (james-page)
summary: - Removing unit from existing percona cluster causes shared-db failure to
+ Removing lead unit from existing service deployment causes failure to
related services
summary: - Removing lead unit from existing service deployment causes failure to
- related services
+ Removing lead unit from existing service deployment causes failure in
+ related services due to incomplete contexts
Liam Young (gnuoy)
Changed in percona-cluster (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

Reproduced for rabbitmq-server: http://paste.ubuntu.com/8078466/

Liam Young (gnuoy)
Changed in rabbitmq-server (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

Reproduced for keystone: http://paste.ubuntu.com/8088061/

Changed in keystone (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
James Page (james-page)
Changed in percona-cluster (Juju Charms Collection):
status: Triaged → Fix Committed
Changed in rabbitmq-server (Juju Charms Collection):
status: Triaged → In Progress
Changed in keystone (Juju Charms Collection):
status: Triaged → In Progress
JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack canonical-is
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI did a deployment with this branch (actually with [1] which has it in), now seeing creds ok among relations:
http://paste.ubuntu.com/8468270/

[1] lp:~jjo/charms/trusty/rabbitmq-server/gnuoy-lp1355848_jjo-lp1274947

tags: added: openstack
Changed in keystone (Juju Charms Collection):
status: In Progress → Fix Released
Changed in percona-cluster (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.