Removing penultimate unit from percona-cluster service renders service unusable

Bug #1514472 reported by Mario Splivalo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Triaged
Medium
Unassigned
percona-cluster (Juju Charms Collection)
Invalid
Medium
Unassigned

Bug Description

When units are removed from percona-cluster service, the removing unit's mysql/percona is not shut down. When this is the penultimate unit in the cluster the remaining unit will loose quorum and switch to 'Disconnected' state and will disallow any queries to the database.

Of course, adding new units will fail too.

The way to recover from this is to shut down mysql/percona on the remaining unit, and then bootstrap it, with:

# service mysql bootstrap-pxc

After the unit is up and running it will allow both read and write queries. From there one can 'juju add-unit' to add more units to the service.

This is unfixable at the moment as juju doesn't support hooks that are run before -departed hooks. (See bug https://bugs.launchpad.net/juju-core/+bug/1417874)

The workaround is to stop mysql on the unit to be removed, prior running 'juju remove-unit'. That way percona unit being stopped will signalize to the rest of the cluster (actually to the only remaining node) that it is being shut down in controlled manner and the remaining unit will continue to operate normally.

Once 'about-to-depart' (or similar) hook is implemented in juju this bug will be fixed.

Revision history for this message
James Page (james-page) wrote :

Mario

Looking at the date on this bug report, I think this was a trusty install; I tried to reproduce on xenial, and the last remaining unit went into state 'Initialized' not 'Disconnected'. When I then re-added another two units, they did correctly cluster with the remaining unit, and it did become the donor for the other two units.

I know there have been some improvements in this area between 5.5 -> 5.6, so this might be a much better story on xenial now.

That said, we probably should shutdown and purge pxc from any unit that is removed from a cluster; this is do-able via the 'stop' hook which is run on each unit as applications/services are destroyed.

Changed in percona-cluster (Juju Charms Collection):
status: New → Triaged
importance: Undecided → Medium
James Page (james-page)
Changed in charm-percona-cluster:
importance: Undecided → Medium
status: New → Triaged
Changed in percona-cluster (Juju Charms Collection):
status: Triaged → Invalid
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Hi, James.

I'm sorry to resurrect this one, but the issue remains on xenial too.

So, I deployed two-node percona-cluster. After deployment settled down I removed one of the units. That, indeed, left the remaining unit in non-operational state:

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+-------------+
| Variable_name | Value |
+---------------------------+-------------+
| wsrep_local_state_comment | Initialized |
+---------------------------+-------------+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 1 |
+--------------------+-------+
1 row in set (0.00 sec)

mysql> select 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
mysql>

The proper 'local_state_comment' should be 'Synced'. As can be seen, remaining percona unit is non-operational as you can not query the user data.

With percona-5.6 it is easier to move the node back to the operational state:

mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Query OK, 0 rows affected (0.00 sec)

mysql> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.00 sec)

mysql>

Again, this happened because when juju was removing unit it didn't issue 'controlled shutdown' on the unit that's leaving relation - it merely removed the unit (shutting down the machine). Because parting unit was not cleanly shut down it could not notify remaining unit of it's state so remaining unit has no idea what happened - from remaining unit's perspective a network partition could have happened. To protect data integrity remaining unit switched into 'will not serve any data' mode.

However, the workaround for this is quite simple - before removing the unit the operator should ssh into the unit that is to be removed and simply stop mysqld service there. Once mysqld politely shut down juju can be used to remove the unit.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.