cluster Unhealthy after replacing a unit

Bug #2063100 reported by macchese
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
New
Undecided
Unassigned

Bug Description

ubuntu@juju:~$ juju status etcd
Model Controller Cloud/Region Version SLA Timestamp
openstack maas-controller1 maas-cloud/default 3.2.4 unsupported 10:35:09Z

App Version Status Scale Charm Channel Rev Exposed Message
etcd 3.4.22 blocked 3 etcd stable 760 no UnHealthy with 3 known peers

Unit Workload Agent Machine Public address Ports Message
etcd/7* blocked idle 7/lxd/26 192.168.70.93 2379/tcp UnHealthy with 3 known peers
etcd/8 blocked idle 8/lxd/27 192.168.70.92 2379/tcp UnHealthy with 3 known peers
etcd/10 waiting idle 3/lxd/47 192.168.70.155 Waiting to retry etcd registration

Machine State Address Inst id Base AZ Message
3 started 192.168.6.113 op4 ubuntu@22.04 default Deployed
3/lxd/47 started 192.168.70.155 juju-206dbb-3-lxd-47 ubuntu@22.04 default Container started
7 started 192.168.6.101 xen01 ubuntu@22.04 xensrv Deployed
7/lxd/26 started 192.168.70.93 juju-206dbb-7-lxd-26 ubuntu@22.04 xensrv Container started
8 started 192.168.6.110 op1 ubuntu@22.04 default Deployed
8/lxd/27 started 192.168.70.92 juju-206dbb-8-lxd-27 ubuntu@22.04 default Container started

I removed an unit using --force (without it didn't work) and then added again one unit (etcd/10).
looking into the new unit (etcd/10) log:

ubuntu@juju:~$ juju debug-log --include etcd/10
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/etcd.py:440:register_node_with_leader
unit-etcd-10: 10:28:30 ERROR unit.etcd/10.juju-log ['/snap/bin/etcd.etcdctl', '--endpoint', 'https://192.168.70.93:2379', 'member', 'add', 'etcd10', 'https://192.168.70.155:2380']
unit-etcd-10: 10:28:30 ERROR unit.etcd/10.juju-log {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-10: 10:28:30 ERROR unit.etcd/10.juju-log b'client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://192.168.70.92:2379 has no leader\n; error #1: client: endpoint https://192.168.70.91:2379 exceeded header timeout\n; error #2: client: etcd member https://192.168.70.93:2379 has no leader\n\n'
unit-etcd-10: 10:28:30 ERROR unit.etcd/10.juju-log None
unit-etcd-10: 10:28:30 WARNING unit.etcd/10.juju-log Notice: Unit failed self registration
unit-etcd-10: 10:28:30 INFO unit.etcd/10.juju-log etcdctl.register failed, will retry
unit-etcd-10: 10:28:30 INFO unit.etcd/10.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:80:joined:certificates
unit-etcd-10: 10:28:30 INFO unit.etcd/10.juju-log status-set: waiting: Waiting to retry etcd registration
unit-etcd-10: 10:28:31 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
unit-etcd-10: 10:28:27 INFO unit.etcd/10.juju-log Reactive main running for hook update-status
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Initializing Snap Layer
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Initializing Leadership Layer (is follower)
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/tls_client.py:18:store_ca
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/tls_client.py:44:store_server
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/tls_client.py:71:store_client
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/etcd.py:144:set_app_version
unit-etcd-10: 10:28:28 INFO unit.etcd/10.juju-log Invoking reactive handler: reactive/etcd.py:158:prepare_tls_certificates

It seems to me that the removed unit remains int the cluster (client: endpoint https://192.168.70.91:2379 exceeded header timeout) and the cluster has no leader since I removed the unit.

How to recover?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.