"chain" and "ca" values sometimes not shared due to incorrect return value of is_unit_paused_set()

Bug #1962777 reported by Paul Goins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
vault-charm
Confirmed
Undecided
dongdong tao

Bug Description

This may be a shared issue between both vault and charmhelpers.

I have a 3 unit vault cluster in use by an OpenStack cloud. Certificates have recently been updated, but on at least one of the relationships we're seeing that "chain" and "ca" values aren't being provided, and thus the client (octavia) isn't able to set up its certs appropriately.

I traced this to reactive/vault_handlers.py, publish_ca_info(). I then cross-checked logs, and very clearly observed this log message: "The Vault unit is paused, passing on publishing ca info."

That branch is directly dependent on the return value of the is_unit_paused_set() function from charmhelpers.contrib.openstack.utils.

"juju status" shows all 3 vault units as active and running:

$ juju status vault | grep ^vault/
vault/0 active idle 15 <REDACTED> 8200/tcp Unit is ready (active: true, mlock: enabled)
vault/1 active idle 16 <REDACTED> 8200/tcp Unit is ready (active: false, mlock: enabled)
vault/2* active idle 17 <REDACTED> 8200/tcp Unit is ready (active: false, mlock: enabled)

However, vault/1 and vault/2 both have the unit-paused flag set in the state DB. Example:

ubuntu@vault-2:~$ sudo su -
root@vault-2:~# cd /var/lib/juju/agents/unit-vault-*/charm/
root@vault-2:/var/lib/juju/agents/unit-vault-1/charm# python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pprint
>>> import sqlite3
>>> conn = sqlite3.connect('.unit-state.db')
>>> pprint.pprint(conn.execute("""select key, data from kv where key like "%unit-paused%" order by key asc""").fetchall())
[('unit-paused', 'true')]

My instinct here was that pausing and resuming the vault units may resolve this. However, in this particular situation it does not; we end up hitting the publish_ca_info reactive hook after the pause logic has completed on the embedded call to charms.reactive.main() immediately prior to completion of the pause action. Pastebin of the traceback: https://pastebin.ubuntu.com/p/KgMwMZrDXm/

Revision history for this message
Paul Goins (vultaire) wrote :

I suspect, based upon reviewing the charmhelpers pause_unit implementation, that if services are indeed running normally that we can work around this manually via clearing the incorrect unit-paused record, via the SQL command "DELETE FROM kv WHERE key='unit-paused'".

Revision history for this message
Paul Goins (vultaire) wrote :

Confirmed that I could work around this via:

* Stopping the jujud worker for vault
* Performing SQL edit against $CHARM_DIR/.unit-state.db: update kv set value='false' where key='unit-paused'
* Resuming the jujud worker for vault

Revision history for this message
Jose Guedez (jfguedez) wrote :

Hit this issue today, with a relation to the `ceph-dashboard` charm. Tool a while to figure out the problem with so many layers of abstraction involved between the two charms.

In the end the db surgery from Paul's comment helped unblock the charm. There's a small typo in the sql query, should be:

sqlite> update kv set data='false' where key='unit-paused';

can validate with the following:

sqlite> select * from kv where key = 'unit-paused';
unit-paused|false

Changed in vault-charm:
status: New → Confirmed
dongdong tao (taodd)
Changed in vault-charm:
assignee: nobody → dongdong tao (taodd)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.