MAAS

The region should verify database migrations on start

Bug #1644345 reported by Lee Trager on 2016-11-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Wishlist	Unassigned

Bug Description

MAAS runs a number of database migrations when the maas-region-controller package is upgraded on a system. When in HA mode it is currently possible to upgrade one instance of the region while keeping another instance at the older version. Another possibility is the database is backed up, a new installation with a newer version of MAAS is installed, and the old database is restored. This can cause really weird errors as the older version may not be compatible with the newer database migrations or new versions of MAAS are expecting database migrations to be applied which haven't.

We need the following checks to occur
* On region start the region should make sure only migrations which it knows about have been applied. If unknown migrations have been applied it should be logged and the region should fail to start. Its important that the systemd service fails to make this very clear.
* If one region is upgraded in HA mode the other regions should disconnect and log the mismatch error. Again its important the systemd service fails to make this very clear.
* If the regiond starts and notices some database migrations haven't been applied it should automatically apply them before starting.

Tags:

Andres Rodriguez (andreserl) on 2016-11-23

Changed in maas:
importance:	Critical → Wishlist
milestone:	none → 2.2.0

Revision history for this message

Gavin Panella (allenap) wrote on 2016-11-25:

As a starting point for an implementation, can I suggest the following
approach?

- Agree on a database lock number, say K.

- During start-up of each regiond a single database connection is
established, in which:

- A SHARED lock is taken on K.

- The migration state is checked against what is expected.

- If there's a mismatch, the lock on K is upgraded to an EXCLUSIVE
lock.

- The migration state is checked again, then:

    - If the database is behind on migrations, migrations are applied,
      then the lock on K is downgraded to a SHARED lock (or the
      connection is dropped).

- If the database is ahead on migrations, regiond logs an error,
then EXITS.

- If the database is now level on migrations the lock on K is
downgraded to a SHARED lock (or the connection is dropped).

- In the main body of the application, as each new connection is opened,
a SHARED lock is taken on K and it is not released until the
connection is closed.

I may have it wrong, but the expected behaviour is:

- Migrations will only ever be applied while holding an EXCLUSIVE lock
on K.

- Acquisition of a SHARED lock blocks/fails until an EXCLUSIVE lock is
released. All connections hold at least a SHARED lock on K at all
times, hence:

(a) Migrations cannot run while there are other connections using the
database.

(b) Other connections cannot use the database while migrations are
being applied.

i.e. normal run-time use of the database and the application of
migrations are mutually exclusive.

- Given a regiond running at migration level M, a newly started regiond
expecting M+1 will block until the former goes away, apply migrations,
then complete start-up.

  Consider an installation of MAAS with two region hosts, A and B. If
  MAAS is updated and restarted on A, the regionds on A will wait until
  those on B are stopped (presumably as part of the upgrade, but not
  necessarily). One of those regionds on A will then win the exclusive
  lock race and apply migrations while the others wait. Once it releases
  that exclusive lock all the regionds will finish starting up.

  If B was, say, only rebooted without upgrading MAAS, the regionds
  would find the database now to be ahead of their expectations, and
  thus exit. Their absence would be noted by service tracking and
  administrators would go and investigate.

As you can see migrations would no longer applied by packaging. This
makes good sense in a distributed system; a Debian package has a view
only of the local system.

As a starting point for an implementation, can I suggest the following
approach?

- Agree on a database lock number, say K.

- During start-up of each regiond a single database connection is
  established, in which:

- A SHARED lock is taken on K.

- The migration state is checked against what is expected.

- If there's a mismatch, the lock on K is upgraded to an EXCLUSIVE
    lock.

- The migration state is checked again, then:

- If the database is behind on migrations, migrations are applied,
      then the lock on K is downgraded to a SHARED lock (or the
      connection is dropped).

- If the database is ahead on migrations, regiond logs an error,
      then EXITS.

- If the database is now level on migrations the lock on K is
      downgraded to a SHARED lock (or the connection is dropped).

- In the main body of the application, as each new connection is opened,
  a SHARED lock is taken on K and it is not released until the
  connection is closed.

I may have it wrong, but the expected behaviour is:

- Migrations will only ever be applied while holding an EXCLUSIVE lock
  on K.

- Acquisition of a SHARED lock blocks/fails until an EXCLUSIVE lock is
  released. All connections hold at least a SHARED lock on K at all
  times, hence:

(a) Migrations cannot run while there are other connections using the
      database.

(b) Other connections cannot use the database while migrations are
      being applied.

i.e. normal run-time use of the database and the application of
  migrations are mutually exclusive.

- Given a regiond running at migration level M, a newly started regiond
  expecting M+1 will block until the former goes away, apply migrations,
  then complete start-up.

As you can see migrations would no longer applied by packaging. This
makes good sense in a distributed system; a Debian package has a view
only of the local system.

Andres Rodriguez (andreserl) on 2017-06-20

Changed in maas:
milestone:	2.2.0 → 2.2.x

Andres Rodriguez (andreserl) on 2017-11-02

tags:	added: performance
tags:	added: ha
Changed in maas:
milestone:	2.2.x → next

Revision history for this message

Adam Collard (adam-collard) wrote on 2019-09-19:

This bug has not seen any activity in the last 6 months, so it is being automatically closed.

If you are still experiencing this issue, please feel free to re-open.

MAAS Team

Changed in maas:
status:	Triaged → Invalid

Björn Tillenius (bjornt) on 2021-08-24

Changed in maas:
milestone:	next → none

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.