remove HA etcd application in error state

Bug #1835537 reported by Ashley Lai
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
Triaged
Medium
Unassigned

Bug Description

With 3 units etcd HA, remove application removed 2 units and leave one in error state.

etcd 3.1.10 error 1 etcd jujucharms 434 ubuntu
etcd/2* error idle 2/lxd/1 10.244.245.235 2379/tcp hook failed: "cluster-relation-broken"

2019-07-05 05:11:39 DEBUG cluster-relation-broken Traceback (most recent call last):
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/charm/hooks/cluster-relation-broken", line 22, in <module>
2019-07-05 05:11:39 DEBUG cluster-relation-broken main()
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/__init__.py", line 73, in main
2019-07-05 05:11:39 DEBUG cluster-relation-broken bus.dispatch(restricted=restricted_mode)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 379, in dispatch
2019-07-05 05:11:39 DEBUG cluster-relation-broken _invoke(hook_handlers)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 359, in _invoke
2019-07-05 05:11:39 DEBUG cluster-relation-broken handler.invoke()
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 181, in invoke
2019-07-05 05:11:39 DEBUG cluster-relation-broken self._action(*args)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/charm/reactive/etcd.py", line 519, in perform_self_unregistration
2019-07-05 05:11:39 DEBUG cluster-relation-broken etcdctl.unregister(members[unit_name]['unit_id'], leader_address)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "lib/etcdctl.py", line 75, in unregister
2019-07-05 05:11:39 DEBUG cluster-relation-broken return self.run(command)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "lib/etcdctl.py", line 160, in run
2019-07-05 05:11:39 DEBUG cluster-relation-broken raise EtcdCtl.CommandFailed() from e
2019-07-05 05:11:39 DEBUG cluster-relation-broken etcdctl.CommandFailed
2019-07-05 05:11:39 ERROR juju.worker.uniter.operation runhook.go:129 hook "cluster-relation-broken" failed: exit status 1
2019-07-05 05:11:39 DEBUG juju.machinelock machinelock.go:180 machine lock released for uniter (run relation-broken (3) hook)

Ashley Lai (alai)
summary: - remove etcd application in error state
+ remove HA etcd application in error state
Revision history for this message
George Kraft (cynerva) wrote :

We've never encountered this, but I see how it could happen. The last unit is trying to unregister itself from a cluster that no longer exists. That happens here: https://github.com/charmed-kubernetes/layer-etcd/blob/aca040b46ac80e97da8ea3135b46216cf6bb854c/reactive/etcd.py#L598

Changed in charm-etcd:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Márton Kiss (marton-kiss) wrote :

I have the same issue in a customer environment, where I scale out the units from 3 to 6, and removed first the two original non-leader ones (etcd/0 and etcd/2), then tried to remove etcd/1 who is the leader.

In this case the etcd is showing that etcd have 4 members (etcd/1, etcd/3, etcd/4, etcd/5), however the etcdctl member remove is failing because it is trying to use the already removed endpoint (etcd/0).

This can cause a problem not just during full etcd removals, but for day 2 operations where the etcd units must be relocated.

Revision history for this message
Márton Kiss (marton-kiss) wrote :

The very dirty workaround for the above was the execution of several resolve commands with no-retry option to prevent running of the cluster-relation-hook:

$ juju resolve etcd/1 --no-retry

As a side effect juju finally had 3 etcd units, however the member remove was not able to run due to wrong leader address in the hook, so etcdctl was still showing 4 units. The etcd/1 required a manual removal:

snap.etcdctl --endpoint https://<new-leader-ip>:2379 remove member <etcd-1-membership-id>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.