MAAS

Bug #1817484
Comment #10

Comment 10 for bug 1817484

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2019-02-25:

#10

Hi guys.

Please correct me if I'm wrong as I don't know all the details, but as I understand the configuration of HA is as follows:

1. Pacemaker handles the fail-over of postresql as a resource, and also handles updating the DNS entry in MAAS as another resource.
2. Once pacemaker has confirmed the failover of the DB, pacemaker immediately runs the DNS updating resource.
3. Pacemaker never ensures that MAAS has fully recovered from the fail-over before attempting to run a DNS update.

From that perspective, it would seem to me that given that pacemaker is handling the fail-over of the database, and that MAAS has not fully recovered from the fail-over, pacemaker shouldn't be attempting to run the DNS update until it can ensure that MAAS is up and running and fully recovered from it.

I would recommend that either your DNS update resource/script ensure that MAAS is fully connected before attempting to update the DNS record *or*, you should add a new pacemaker resource that ensures MAAS fully reconnected in the event of a fail-over.

On the other hand, what seems to be an issue in the MAAS side is that it is accepting API requests while there's DB connection issues. MAAS shouldn't be allowing API requests until it can ensure it is fully connected to the database. However, this is outside of the fact that pacemaker shouldn't really assume it can execute a resource on a service (maas) that's not tracked by pacemaker, but its dependent on one that's tracked by pacemaker (postgresql).

Again, we shouldn't be making the assumption that because pacemaker has recovered the database, we can execute actions in MAAS when we cannot even ensure that MAAS itself has recovered from the failover (specially when regiond is not even managed by pacemaker, hence, pacemaker cannot even ensure its fully recovered).