vault service not restarted to use new db after db migration

Bug #2047764 reported by Shunde Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Charm Guide
New
Undecided
Unassigned
vault-charm
Triaged
Medium
Unassigned

Bug Description

The issue arose after migrating vault's db from percona to innodb.

In old deployments percona db is used as the storage backend of vault. But new deployments use innodb. So there is a need to migrate old percona db to innodb as per [1].
The symptom of this issue is, after migration, vault service is still connecting to the old db (percona), although the config file has been changed to use the new db (innodb), because vault service is not restarted after config file change.

In some environments the old db is not stopped and keeps running (just in case a reversion is needed), so because of this issue, vault is running happily with old db and the user is not aware of it.

To reproduce this issue, firstly deploy one vault unit and one percona unit as per [2].
Then unseal vault.
Next deploy new innodb (3 units) and do db migration as per [1].
Finally the environment may look like [4], and vault 1.7/stable charm is used.

During the migration process it seems vault is never restarted.
After disconnecting vault and old db with "juju remove-relation vault:shared-db percona-cluster:shared-db", I can see that vault service is still running and connecting to the old db, although juju status output shows vault in blocked state with message "'shared-db' or 'db' missing". Also at this time vault config is not changed and still pointing to old db.
Then after adding a relation between vault and new db with "juju add-relation vault:shared-db vault-mysql-router:shared-db", I can see that in vault config file it is changed to use new db. But vault service is again not restarted to pick up the new config.
From netstat output vault is still connecting to the old db.

In juju log it seems start_vault is triggered but somehow vault service is still not restarted [3].

The fix is easy, just need to restart vault after db migration and unseal it.
But this behaviour in vault doesn't look right and it looks like a bug.

[1] https://docs.openstack.org/charm-guide/latest/project/procedures/percona-series-upgrade-to-focal.html
[2] https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/ussuri/app-vault.html
[3] https://pastebin.ubuntu.com/p/6cvDtfFrK6/
[4] https://paste.ubuntu.com/p/G73V74pcTK/

Tags: sts
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

start_vault(...) looks like this:

@when_not("is-update-status-hook")
@when('configured')
@when_not('started')
def start_vault():
    # start or restart vault
    vault.opportunistic_restart()

    @tenacity.retry(wait=tenacity.wait_exponential(multiplier=1, max=10),
                    stop=tenacity.stop_after_attempt(10),
                    retry=tenacity.retry_if_result(lambda b: not b))
    def _check_vault_running():
        return service_running('vault')

    if _check_vault_running():
        set_flag('started')
        clear_flag('failed.to.start')
        if config('totally-unsecure-auto-unlock'):
            vault.prepare_vault()
    else:
        set_flag('failed.to.start')

It's probable that opportunistic_restart() is trying very hard not to restart the vault unit as that will seal it. This is because opportunistic_restart() -> can_restart() won't return True if the vault unit is unsealed, which it probably is if the vault unit is running.

A solution is to pause the unit (using the action), then un-pause it (again using the resume action) and then unseal the unit. This will cause the unit to use the new db configured.

As to solving this, the charm should probably stop the service when the shared-db relation is broken, but obviously, removing shared_db causes the charm to fall back to raft as the consensus system. Hmm

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Triaging the charm bug as medium; what we really need to do is to update the charm guide/docs for vault to high-light the procedure for changing the database.

Changed in vault-charm:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Shunde Zhang (shunde-zhang) wrote :

Hi Alex,

Thank you for looking into this.

After running "juju remove-relation vault:shared-db percona-cluster:shared-db", I can see vault in blocked state with message "'shared-db' or 'db' missing" in Juju status. But in the system, vault service is still running, and vault config file is not modified to remove db config. This looks like inconsistency between Juju and the system, and creates some confusion to users. Maybe vault service should be stopped and db config should be removed from config file if db relation is removed? Also if shared-db relation is missing, vault service won't be started by Juju, so it won't fall back to raft.

Thanks!

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The raft backend is only supported on the 1.8 channel or newer, but it is the default. From the README for the charm:

> If no databases are related, vault will be auto configured to use
> its embedded raft storage backend for storage and HA.
> Note that raft storage is only supported in Vault 1.8/stable or newer
> (see `channel` in charm config).

Thus, no psql or mysql backends would default back to raft for a 1.8 or newer charm.

> Maybe vault service should be stopped and db config should be removed from config file if db relation is removed?

Yes, this is what I said in #1. My concern is that a newer charm may revert to raft; however, if the service is stopped across all units then re-adding a different shared-db would then not be a problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.