A failed deploy can cause permanent failure to deploy
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-k8s-discourse |
Confirmed
|
Low
|
Unassigned |
Bug Description
If a deployment or update fails, in certain cases the environment may be unable recover and will permanently wait for the migrate lock to disappear.
Details:
During a deployment or upgrade, the image makes use of a redis lock to ensure only one unit is running the discourse database migration step. This lock is removed once the discourse migration step completes. If, for some reason, this step is interrupted by kubernetes (such as pod kill or other scenario that halts the startup script) the lock is left in place and from that point on no new unit can deploy.
Detection:
If you look at the pod logs and see messages like this on all of your deploying units:
Pod setup starting...
Migrate lock found, Migrate running on discourse-
Migrate lock found, Migrate running on discourse-
Look for the pod named in the log line. If that pod does not exist, you are affected by this bug.
Workaround:
Access the redis server that your deployment makes use of and remove the `discourse_
Changed in charm-k8s-discourse: | |
status: | New → Confirmed |
importance: | Undecided → Low |
A few possible ways to address this:
- See if it's feasible for the charm to clean up the lock if an update or deployment fails (not sure we can be aware of all the failure modes and whether this is feasible though).
- Make sure if this lock does exist there's an easy mechanism to communicate that to charm operators (possibly we should set the charm to blocked status).
- Give operators an easy way to remove it, possibly via a charm action, and make it obvious that that's how to do it.
I think having a lock expiry is potentially dangerous because we're playing a guessing game as to how long that should be. Set it too low or too high and there are downsides. I'd be open to being convinced otherwise, but that's my initial instinct here.