Comment 2 for bug 1915466

Revision history for this message
Jay Kuri (jk0ne) wrote :

I take your point and will investigate how we might clean up the lock in the case of a failure.

I think adding lock expiry won't hurt though, here's why:

Lock expiry should work with a long timeout (like 3 hours), because during a deployment the lock only matters when the pods first spin up.

The lock state provides two pieces of information:

1) Only one pod can create the lock, which means the one that gets the lock gets the role of migrator, while all other pods that try get the role of waiter.

2) It tells the waiters that migration is complete and it is safe to continue booting up.

Once the migrator is self-selected, the others take waiting status until the lock is removed. They can not become migrators after they have started waiting.

So if we were to set the lock expiry to, say, 3 hours. This would protect the migrator role as needed. The migrator will remove the lock when it completes (if things are functioning properly) and if for some reason the migrator is killed and doesn't remove the lock, it will auto-clear after the 3 hours elapses.