A failed deploy can cause permanent failure to deploy

Bug #1915466 reported by Jay Kuri
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-k8s-discourse
Undecided
Unassigned

Bug Description

If a deployment or update fails, in certain cases the environment may be unable recover and will permanently wait for the migrate lock to disappear.

Details:

During a deployment or upgrade, the image makes use of a redis lock to ensure only one unit is running the discourse database migration step. This lock is removed once the discourse migration step completes. If, for some reason, this step is interrupted by kubernetes (such as pod kill or other scenario that halts the startup script) the lock is left in place and from that point on no new unit can deploy.

Detection:

If you look at the pod logs and see messages like this on all of your deploying units:

Pod setup starting...
Migrate lock found, Migrate running on discourse-odm-kb-55fb58cc87-k6s6w, waiting 90s.
Migrate lock found, Migrate running on discourse-odm-kb-55fb58cc87-k6s6w, waiting 90s.

Look for the pod named in the log line. If that pod does not exist, you are affected by this bug.

Workaround:

Access the redis server that your deployment makes use of and remove the `discourse_migrate_lock` key. You will then need to restart your pods to trigger the migraiton.

Revision history for this message
Tom Haddon (mthaddon) wrote :

A few possible ways to address this:

- See if it's feasible for the charm to clean up the lock if an update or deployment fails (not sure we can be aware of all the failure modes and whether this is feasible though).
- Make sure if this lock does exist there's an easy mechanism to communicate that to charm operators (possibly we should set the charm to blocked status).
- Give operators an easy way to remove it, possibly via a charm action, and make it obvious that that's how to do it.

I think having a lock expiry is potentially dangerous because we're playing a guessing game as to how long that should be. Set it too low or too high and there are downsides. I'd be open to being convinced otherwise, but that's my initial instinct here.

Revision history for this message
Jay Kuri (jk0ne) wrote :

I take your point and will investigate how we might clean up the lock in the case of a failure.

I think adding lock expiry won't hurt though, here's why:

Lock expiry should work with a long timeout (like 3 hours), because during a deployment the lock only matters when the pods first spin up.

The lock state provides two pieces of information:

1) Only one pod can create the lock, which means the one that gets the lock gets the role of migrator, while all other pods that try get the role of waiter.

2) It tells the waiters that migration is complete and it is safe to continue booting up.

Once the migrator is self-selected, the others take waiting status until the lock is removed. They can not become migrators after they have started waiting.

So if we were to set the lock expiry to, say, 3 hours. This would protect the migrator role as needed. The migrator will remove the lock when it completes (if things are functioning properly) and if for some reason the migrator is killed and doesn't remove the lock, it will auto-clear after the 3 hours elapses.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers