Upgrading etcd charm can result in lost data / destroyed cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Etcd Charm |
Fix Released
|
Critical
|
George Kraft |
Bug Description
Overview:
When upgrading an existing cluster from 1.14 to 1.15, the act of updating the etcd charm results in no working etcd servers, 0 Peers available, along with data loss, destroying all cluster state, effectively wiping the entire cluster.
What I did:
Performed an upgrade from cs:~containers/
juju upgrade-charm etcd --path $(pwd)/etcd
What I expected:
I expected that after some time, the cluster would update to the newest charm, and the service would continue to work with the data it had previously
What happened instead:
All three etcd units went down with error: '0 peers online' All attempts to bring them back online failed. Even attempting to reload the v3 snapshot I took immediately prior to attempting the upgrade failed.
Impact:
One one occasion, this bug resulted in the complete destruction of the cluster. In the second, If I was not able to manually rebuild the etcd master, the entire cluster data would have been lost and all pods, servicemaps, deployments, etc would have been gone. This would have wiped the cluster clean and destroyed all the services that were running in it, just as it did in the previous one.
Mitigation:
Because I encountered a similar issue previously, I immediately shut down the kubeapi services on the kubernetes-master units so that they would not start making changes to the kubernetes cluster.
I was able to restore the master from a manual backup I had made previously, and then manually edit /var/snap/
Once that was done, I was able to add two new units, and then manually add them to the cluster via etcdctl / manual snap configuration on each unit. I then had to destroy the old secondary units with juju remove-machine # --force to get them to go away.
Details which may be relevant:
1) Post upgrade, all the etcd units had new /var/snap/
2) The 'initial-
2) Attempting a restore of a v3 snapshot made immediately prior to the upgrade did not resolve the issue.
3) Manually copying the db file from the snapshot also did not resolve the issue.
4) No variation of initial-
5) This message appeared in the logs immediately following the attempted upgrade:
cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls
6) Prior to the upgrade, the master unit was using /var/snap/
description: | updated |
Changed in charm-etcd: | |
assignee: | nobody → George Kraft (cynerva) |
status: | New → Triaged |
Changed in charm-etcd: | |
status: | Triaged → In Progress |
Changed in charm-etcd: | |
milestone: | none → 1.16+ck1 |
Changed in charm-etcd: | |
status: | In Progress → Fix Committed |
Changed in charm-etcd: | |
status: | Fix Committed → Fix Released |
Speculation:
Perhaps the charm did not take into account the config that existed, or the location of the data, and generated a fresh config?