Upgrading etcd charm can result in lost data / destroyed cluster

Bug #1843497 reported by Jay Kuri on 2019-09-10
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
Critical
George Kraft

Bug Description

Overview:

When upgrading an existing cluster from 1.14 to 1.15, the act of updating the etcd charm results in no working etcd servers, 0 Peers available, along with data loss, destroying all cluster state, effectively wiping the entire cluster.

What I did:

Performed an upgrade from cs:~containers/etcd-431 to cs:~containers/etcd-449 using the following command (from the local directory the etcd charm was in)

juju upgrade-charm etcd --path $(pwd)/etcd

What I expected:

I expected that after some time, the cluster would update to the newest charm, and the service would continue to work with the data it had previously

What happened instead:

All three etcd units went down with error: '0 peers online' All attempts to bring them back online failed. Even attempting to reload the v3 snapshot I took immediately prior to attempting the upgrade failed.

Impact:

One one occasion, this bug resulted in the complete destruction of the cluster. In the second, If I was not able to manually rebuild the etcd master, the entire cluster data would have been lost and all pods, servicemaps, deployments, etc would have been gone. This would have wiped the cluster clean and destroyed all the services that were running in it, just as it did in the previous one.

Mitigation:

Because I encountered a similar issue previously, I immediately shut down the kubeapi services on the kubernetes-master units so that they would not start making changes to the kubernetes cluster.

I was able to restore the master from a manual backup I had made previously, and then manually edit /var/snap/etcd/common/etcd.conf.yml to cause the unit to come back online in a single-unit cluster.

Once that was done, I was able to add two new units, and then manually add them to the cluster via etcdctl / manual snap configuration on each unit. I then had to destroy the old secondary units with juju remove-machine # --force to get them to go away.

Details which may be relevant:

1) Post upgrade, all the etcd units had new /var/snap/etcd/common/etcd.conf.yml files. None of these files had empty initial-cluster: entries.

2) The 'initial-cluster-token' was different on the upgraded master than it was on the slaves.

2) Attempting a restore of a v3 snapshot made immediately prior to the upgrade did not resolve the issue.

3) Manually copying the db file from the snapshot also did not resolve the issue.

4) No variation of initial-cluster-state, initial-cluster or force-new-cluster would allow the master unit to come online.

5) This message appeared in the logs immediately following the attempted upgrade:

   cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls

6) Prior to the upgrade, the master unit was using /var/snap/etcd/current/etcd0.etcd/member as the root of the data. After the data, it was using /var/snap/etcd/current/member.

Jay Kuri (jk0ne) wrote :

Speculation:

Perhaps the charm did not take into account the config that existed, or the location of the data, and generated a fresh config?

Jay Kuri (jk0ne) on 2019-09-10
description: updated
Dean Henrichsmeyer (dean) wrote :

Subscribing ~field-critical for obvious reasons.

Changed in charm-etcd:
importance: Undecided → Critical
Changed in charm-etcd:
assignee: nobody → George Kraft (cynerva)
status: New → Triaged
George Kraft (cynerva) on 2019-09-13
Changed in charm-etcd:
status: Triaged → In Progress
Jay Kuri (jk0ne) wrote :

Note - in the initial bug report, this sentence:

1) Post upgrade, all the etcd units had new /var/snap/etcd/common/etcd.conf.yml files. None of these files had empty initial-cluster: entries.

Is incorrect. It should have been 'none of these files had non-empty initial-cluster entries'

George Kraft (cynerva) wrote :

Do you know if this cluster ran etcd 2.3 at any point in the past?

What version of etcd is it running now? (see Version column in `juju status` output)

Right now it looks like I can reproduce this if I start with etcd 2.3, migrate to etcd 3.0, and then upgrade the etcd charm from rev 431 to 449. Doing that put all my units in an "Errored with 0 known peers" state with symptoms similar to those described in the bug description. I have so far been unable to produce any errors if I start the deployment with etcd 3.0 or 3.2 instead.

George Kraft (cynerva) wrote :

For now I'm moving forward with the assumption that your clusters ran etcd 2.3 once in the past. Some key points of information:

1. Etcd 2.3 stores data in /var/snap/etcd/current/etcd0.etcd/
2. Etcd 3.x stores data in /var/snap/etcd/current/
3. If you upgrade from etcd 2.3 to etcd 3.0, then the snap generates a "migration config"[1] that includes an adjusted data-dir field to keep the data in /var/snap/etcd/current/etcd0.etcd/
4. Usually, the etcd charm does not regenerate its configuration, even on upgrade-charm, so the "migration config" continues to be used.
5. However, etcd-449 includes a PR[2] that causes the config to be regenerated. When that happens, the data-dir is changed to /var/snap/etcd/current/ but the data is not moved. As far as etcd is concerned, all data is lost.

This was a time bomb. The charm needs to be able to regenerate the etcd config as needed, but the etcd2->3 upgrade makes doing that a disaster. We haven't encountered this until now because of how rare it is for the etcd charm to actually regenerate its config.

I am still looking into solutions, but I think what needs to happen here is that the charm needs to detect this case and complete the migration such that it's no longer dependent on a special migration config to function.

[1]: https://github.com/juju-solutions/etcd-snaps/blob/d53089eb425db715c5514186cd5ee108a8671332/bin/snap-wrap.sh#L38-L77
[2]: https://github.com/charmed-kubernetes/layer-etcd/pull/158

Jay Kuri (jk0ne) wrote :

Hello!

Sorry for the delay. Yes, this cluster had been 2.3 at one point, though it was already on 3.0.17 when this upgrade began.

I agree regarding what needs to happen. The one thing I would point out is that there are existing deployments in this state... having a 3.0.17 DB, but it living in etcd0.etcd... so whatever solution is used needs to account for that state.

Also - I think the regeneration of the config is the core issue here, because the cluster token was lost as well, so even if the data had been moved, the cluster would not have recovered because the token was different. Perhaps the token could be generated from the juju identifiers somehow, so it is always consistent.

George Kraft (cynerva) wrote :

> the cluster token was lost as well

The charm is supposed to keep track of this by storing the token in leadership data. However, it looks like when a new etcd unit becomes leader, it generates a new token instead of re-using the existing one.

George Kraft (cynerva) wrote :

Fix PR: https://github.com/charmed-kubernetes/layer-etcd/pull/160
Doc PR: https://github.com/charmed-kubernetes/kubernetes-docs/pull/281

The fix makes it possible to upgrade the etcd charm after the 2.3->3.x migration without losing data, and it prevents newly elected leader units from clobbering the existing cluster-token.

Given that upgrading to etcd 449 will still be dangerous, and users may attempt to do so as part of upgrading through CK 1.15, we will add a special upgrade note for this case.

Changed in charm-etcd:
milestone: none → 1.16+ck1
George Kraft (cynerva) on 2019-09-26
Changed in charm-etcd:
status: In Progress → Fix Committed
Changed in charm-etcd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers