Etcd Charm

Bug #1843497
Activity log

Activity log for bug #1843497

Date	Who	What changed	Old value	New value	Message
2019-09-10 20:37:47	Jay Kuri	bug			added bug
2019-09-10 20:44:58	Jay Kuri	bug			added subscriber The Canonical Sysadmins
2019-09-10 20:45:42	Jay Kuri	description	Overview: When upgrading an existing cluster, the act of updating the etcd charm results in no working etcd servers, 0 Peers available, along with data loss, destroying all cluster state, effectively wiping the entire cluster. What I did: Performed an upgrade from cs:~containers/etcd-431 to cs:~containers/etcd-449 using the following command (from the local directory the etcd charm was in) juju upgrade-charm etcd --path $(pwd)/etcd What I expected: I expected that after some time, the cluster would update to the newest charm, and the service would continue to work with the data it had previously What happened instead: All three etcd units went down with error: '0 peers online' All attempts to bring them back online failed. Even attempting to reload the v3 snapshot I took immediately prior to attempting the upgrade failed. Impact: One one occasion, this bug resulted in the complete destruction of the cluster. In the second, If I was not able to manually rebuild the etcd master, the entire cluster data would have been lost and all pods, servicemaps, deployments, etc would have been gone. This would have wiped the cluster clean and destroyed all the services that were running in it, just as it did in the previous one. Mitigation: Because I encountered a similar issue previously, I immediately shut down the kubeapi services on the kubernetes-master units so that they would not start making changes to the kubernetes cluster. I was able to restore the master from a manual backup I had made previously, and then manually edit /var/snap/etcd/common/etcd.conf.yml to cause the unit to come back online in a single-unit cluster. Once that was done, I was able to add two new units, and then manually add them to the cluster via etcdctl / manual snap configuration on each unit. I then had to destroy the old secondary units with juju remove-machine # --force to get them to go away. Details which may be relevant: 1) Post upgrade, all the etcd units had new /var/snap/etcd/common/etcd.conf.yml files. None of these files had empty initial-cluster: entries. 2) The 'initial-cluster-token' was different on the upgraded master than it was on the slaves. 2) Attempting a restore of a v3 snapshot made immediately prior to the upgrade did not resolve the issue. 3) Manually copying the db file from the snapshot also did not resolve the issue. 4) No variation of initial-cluster-state, initial-cluster or force-new-cluster would allow the master unit to come online. 5) This message appeared in the logs immediately following the attempted upgrade: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls 6) Prior to the upgrade, the master unit was using /var/snap/etcd/current/etcd0.etcd/member as the root of the data. After the data, it was using /var/snap/etcd/current/member.	Overview: When upgrading an existing cluster from 1.14 to 1.15, the act of updating the etcd charm results in no working etcd servers, 0 Peers available, along with data loss, destroying all cluster state, effectively wiping the entire cluster. What I did: Performed an upgrade from cs:~containers/etcd-431 to cs:~containers/etcd-449 using the following command (from the local directory the etcd charm was in) juju upgrade-charm etcd --path $(pwd)/etcd What I expected: I expected that after some time, the cluster would update to the newest charm, and the service would continue to work with the data it had previously What happened instead: All three etcd units went down with error: '0 peers online' All attempts to bring them back online failed. Even attempting to reload the v3 snapshot I took immediately prior to attempting the upgrade failed. Impact: One one occasion, this bug resulted in the complete destruction of the cluster. In the second, If I was not able to manually rebuild the etcd master, the entire cluster data would have been lost and all pods, servicemaps, deployments, etc would have been gone. This would have wiped the cluster clean and destroyed all the services that were running in it, just as it did in the previous one. Mitigation: Because I encountered a similar issue previously, I immediately shut down the kubeapi services on the kubernetes-master units so that they would not start making changes to the kubernetes cluster. I was able to restore the master from a manual backup I had made previously, and then manually edit /var/snap/etcd/common/etcd.conf.yml to cause the unit to come back online in a single-unit cluster. Once that was done, I was able to add two new units, and then manually add them to the cluster via etcdctl / manual snap configuration on each unit. I then had to destroy the old secondary units with juju remove-machine # --force to get them to go away. Details which may be relevant: 1) Post upgrade, all the etcd units had new /var/snap/etcd/common/etcd.conf.yml files. None of these files had empty initial-cluster: entries. 2) The 'initial-cluster-token' was different on the upgraded master than it was on the slaves. 2) Attempting a restore of a v3 snapshot made immediately prior to the upgrade did not resolve the issue. 3) Manually copying the db file from the snapshot also did not resolve the issue. 4) No variation of initial-cluster-state, initial-cluster or force-new-cluster would allow the master unit to come online. 5) This message appeared in the logs immediately following the attempted upgrade: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls 6) Prior to the upgrade, the master unit was using /var/snap/etcd/current/etcd0.etcd/member as the root of the data. After the data, it was using /var/snap/etcd/current/member.
2019-09-11 14:05:05	Dean Henrichsmeyer	charm-etcd: importance	Undecided	Critical
2019-09-11 14:05:47	Dean Henrichsmeyer	bug			added subscriber Canonical Field Critical
2019-09-13 19:40:31	Tim Van Steenburgh	charm-etcd: assignee		George Kraft (cynerva)
2019-09-13 19:41:19	Tim Van Steenburgh	charm-etcd: status	New	Triaged
2019-09-13 20:25:07	George Kraft	charm-etcd: status	Triaged	In Progress
2019-09-25 20:03:19	Tim Van Steenburgh	charm-etcd: milestone		1.16+ck1
2019-09-26 14:28:36	George Kraft	charm-etcd: status	In Progress	Fix Committed
2019-10-04 20:14:00	Tim Van Steenburgh	charm-etcd: status	Fix Committed	Fix Released