MAAS runs a number of database migrations when the maas-region-controller package is upgraded on a system. When in HA mode it is currently possible to upgrade one instance of the region while keeping another instance at the older version. Another possibility is the database is backed up, a new installation with a newer version of MAAS is installed, and the old database is restored. This can cause really weird errors as the older version may not be compatible with the newer database migrations or new versions of MAAS are expecting database migrations to be applied which haven't.
We need the following checks to occur
* On region start the region should make sure only migrations which it knows about have been applied. If unknown migrations have been applied it should be logged and the region should fail to start. Its important that the systemd service fails to make this very clear.
* If one region is upgraded in HA mode the other regions should disconnect and log the mismatch error. Again its important the systemd service fails to make this very clear.
* If the regiond starts and notices some database migrations haven't been applied it should automatically apply them before starting.
As a starting point for an implementation, can I suggest the following
approach?
- Agree on a database lock number, say K.
- During start-up of each regiond a single database connection is
established, in which:
- A SHARED lock is taken on K.
- The migration state is checked against what is expected.
- If there's a mismatch, the lock on K is upgraded to an EXCLUSIVE
lock.
- The migration state is checked again, then:
- If the database is behind on migrations, migrations are applied,
then the lock on K is downgraded to a SHARED lock (or the
connection is dropped).
- If the database is ahead on migrations, regiond logs an error,
then EXITS.
- If the database is now level on migrations the lock on K is
downgraded to a SHARED lock (or the connection is dropped).
- In the main body of the application, as each new connection is opened,
a SHARED lock is taken on K and it is not released until the
connection is closed.
I may have it wrong, but the expected behaviour is:
- Migrations will only ever be applied while holding an EXCLUSIVE lock
on K.
- Acquisition of a SHARED lock blocks/fails until an EXCLUSIVE lock is
released. All connections hold at least a SHARED lock on K at all
times, hence:
(a) Migrations cannot run while there are other connections using the
database.
(b) Other connections cannot use the database while migrations are
being applied.
i.e. normal run-time use of the database and the application of
migrations are mutually exclusive.
- Given a regiond running at migration level M, a newly started regiond
expecting M+1 will block until the former goes away, apply migrations,
then complete start-up.
Consider an installation of MAAS with two region hosts, A and B. If
MAAS is updated and restarted on A, the regionds on A will wait until
those on B are stopped (presumably as part of the upgrade, but not
necessarily). One of those regionds on A will then win the exclusive
lock race and apply migrations while the others wait. Once it releases
that exclusive lock all the regionds will finish starting up.
If B was, say, only rebooted without upgrading MAAS, the regionds
would find the database now to be ahead of their expectations, and
thus exit. Their absence would be noted by service tracking and
administrators would go and investigate.
As you can see migrations would no longer applied by packaging. This
makes good sense in a distributed system; a Debian package has a view
only of the local system.