on leader-change, charms that require secrets-storage use token from old leader

Bug #1849323 reported by Rodrigo Barbieri
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Edward Hope-Morley
Charm Helpers
Fix Released
High
Edward Hope-Morley
OpenStack Barbican-Vault Charm
Fix Released
High
Unassigned
OpenStack Nova Compute Charm
Fix Released
High
Edward Hope-Morley
OpenStack Swift Storage Charm
Fix Released
High
Edward Hope-Morley

Bug Description

on a deployment using vault, every time refresh-secrets is issued, the tokens are refreshed and the leader sends the new tokens through relation-data.

If the vault is deployed in HA, upon switching vault leaders (let's say new leader is vault/2 and old leader is vault/0), the old token will remain in the relation data between the units that require secrets-storage (barbican-vault, ceph-osd, ...) and the old leader (vault/0). The new leader (vault/2) will issue new tokens on refresh-secrets action and provide them through relation (vault/2 <=> barbican-vault, ceph-osd), but the requiring units will read the old tokens from the relation-data of the old leader (vault/0 <=> barbican-vault, ceph-osd). Then, it causes the exception below.

The tokens should be read from the new leader (vault/2) instead. As a workaround, if the leader is switched back to vault/0, the problem goes away until vault leader is changed again.

Steps to reproduce:

1) Force vault leadership to lowest numbered unit (Vault/0)
2) Issue new tokens
3) Units will grab tokens from Vault/0 and everything will work fine
4) Force change vault leadership to a higher numbered unit (Vault/2)
5) Issue new tokens
6) Units will grab tokens from lowest value units (Vault/0) and will fail to authenticate

This happens because on both reactive and classic charms, the related unit will loop through the vault units in ascending order and will grab the token from the first unit that has one.

reactive charms: https://github.com/juju-solutions/charms.reactive/blob/1ff9c476693d834bc0fcc284ff1a89302d084c91/charms/reactive/endpoints.py#L748

classic charms: https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py#L44

Sample stacktrace: https://pastebin.ubuntu.com/p/65RHKKWj6q/

Tags: sts
Changed in charm-barbican-vault:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
status: New → In Progress
Changed in charm-helpers:
assignee: nobody → Tiago Pasqualini da Silva (tiago.pasqualini)
tags: added: sts
description: updated
Revision history for this message
Tiago Pasqualini da Silva (tiago.pasqualini) wrote :

This issue also happens on classic charms. Whenever the Vault leader changes to a higher number unit, the lower number one will have a higher priority, so the charm will get the token from the wrong unit.

In summary, the first unit to have a token will be chosen, but whenever we run the 'refresh-secrets', we need to get the information from the Vault leader.

https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py#L44

Revision history for this message
Tiago Pasqualini da Silva (tiago.pasqualini) wrote :
description: updated
Changed in charm-barbican-vault:
assignee: Rodrigo Barbieri (rodrigo-barbieri2010) → Tiago Pasqualini da Silva (tiago.pasqualini)
Changed in charm-helpers:
status: New → In Progress
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I have done a test to double check how the vault charm (and interface-vault-kv) currently behaves when the vault leader switches. output is here - https://paste.ubuntu.com/p/gRmGCGvHDt/

What I see is: 3 units, 1 leader (vault/2), only leader is presenting the tokens.

Then I poweroff vault/2, the leader switches to vault/1 and I see: 3 units, 1 leader (now vault/1) and vault/1 is presenting the same settings that were previously set by vault/2.

Then I poweron vault/2 again and what i see is vault/0 still presents nothing and vault/1 and vault/2 present identical settings and there are no errors.

This is true for both ceph-osd and barbican-vault i.e. a classic and a reactive charm or put another way, one using [1] and one using [2]

[1] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py
[2] https://github.com/openstack-charmers/charm-interface-vault-kv/

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I also note that the barbican-vault charm used the same token twice successfully:

2019-11-29 16:30:38 INFO juju-log secrets-storage:53: Retrieving secret-id from vault (http://10.5.100.15:8200)
2019-11-29 16:57:05 INFO juju-log secrets-storage:53: Retrieving secret-id from vault (http://10.5.100.15:8200)

So this shows that the leader will always present the current tokens but non-leader units can still have old values e.g. if i then ran refresh-secrets, vault/1 would present the new values and vault/2 would remain with the old. The solution in [1] currently proposes to clear the settings from a unit if leader-changed fires and the unit is non-leader and that is fine but will still mean eventual consistency which is a problem for charms that don't implement the use-once policy when there is a mix of old and new tokens.

[1] https://github.com/openstack-charmers/charm-interface-vault-kv/pull/8

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hi Ed,

I think you got some understanding backwards, if I understood your comment #3 correctly. When you said "vault/1 is presenting the same settings that were previously set by vault/2" I can see in your pastebin that is not the case.

Line 28, before power off and vault/2 is leader: ceph-osd/0_token: '"s.9aSiyapUKfFHbpBqzaNGfYd0"'
Line 50, after power off and vault/1 is leader: ceph-osd/0_token: '"s.dWjKyjxR3JXzr67qpKCC6EOm"'

I'm not sure if you have re-run the command in line 38 (the sqlite command), but you should have seen a different token there upon re-running that command.

vault/0 didn't present any tokens because currently the vault charm does not propagate the tokens to other vault units. Also, the leadership change from vault/2 to vault/1 does not hit the bug, as vault/1 < vault/2, therefore the charms will pick up the change. There are no timestamps in your pastebin, but in comment #4 when you said "barbican-vault charm used the same token twice successfully" it makes sense, as the tokens were refreshed on leader changed and read by barbican-vault units successfully as per the vault/2 => vault/1 transition.

If the change was the opposite, from vault/1 to vault/2, the charms would still be trying to use the token provided by vault/1 despite vault/2 being the leader and providing new tokens.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

The problem I see with the "last_token" logic present in charm-helpers is that, given that the loop always starts at vault/0, a leadership transition vault/0 => vault/1 => vault/2 will cause the code to always raise an exception, as the comparison will always be successful:

token (vault/0's) != last_token (vault/1's)

So the solutions are:

1) clear the data from old leaders(currently proposed): this works but has a timing issues, and then it will need juju resolved --no-retry to fix

2) have the leader propagate the new tokens to other vault units before setting relation data. This needs to be carefully implemented to not cause a timing issue. Even if propagated to other vault units, if vault/2 is the first to set the relation data while vault/0 still has stale data, it will cause errors.

the "last_token" logic can be also implemented in the interface to help mitigate the timing issue.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

ok I just re-tested the "clear non-leader data" solution and I got mixed results. I triggered leader changes by stopping jujud vault service and pausing hacluster each vault unit.

1) for some reason, not every leader change triggered a change of tokens. So, in some instances, the leader did not broadcast anything (or it did and the clear method overwrote it really fast). This is still unclear to me.

2) My nova-compute got into status "Waiting" while changing leaders, as per #1, the old leader cleared relation data, and the new leader did not broadcast new tokens.

I avoided running refresh-secrets action, as this is something that operators shouldn't have to run manually to fix things.

It is clear that just "clearing non-leader data" introduces a few issues, as there are other gaps throughout the vault charm code that are not handled properly under those circumstances.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

One additional detail with the "clear non-leader data" approach is that the original error persists if the lowest numbered unit never comes back to clear its stale relation data. Same goes for the "propagation" approach, however, the receiver needs to have the "last_token" logic to mitigate this issue, then the corner case I mentioned in comment #6 is unlikely (but possible) to happen because vault/0 should have the same token as vault/1.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Rodrigo, well spotted, the tokens were indeed changing on leader switch hence why it was working - I just misread the values. That at least makes sense now.

Regarding your testing, the issue here is that the consuming charms are not currently tolerant of eventual consistency and that is irrespective of if they are using the interface or charmhelpers code. That and, as you point out, that the last_token logic is not sufficient. So I think the way we need to fix is not on the provider (vault) side but actually on the consumer/requires side such that it is tolerant of invalid tokens, records which ones it has tried regardless of success and then moves on to try any others that are presented until one works. That takes advantage of the fact that vault is currently guaranteeing that at least one will be valid and, crucially, it means that we dont need to introduce logic on the provider side that will result in extra hooks noise (e.g. settings propagation or relation clearing) and that would still be eventually consistent so would not actually resolve the issue entirely. I am therefore going to add all charms that will need updating to this bug.

affects: charm-swift-proxy → charm-swift-storage
Changed in charm-ceph-osd:
milestone: none → 20.01
Changed in charm-nova-compute:
milestone: none → 20.01
Changed in charm-swift-storage:
milestone: none → 20.01
Changed in charm-ceph-osd:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-swift-storage:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-nova-compute:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-barbican-vault:
assignee: Tiago Pasqualini da Silva (tiago.pasqualini) → Edward Hope-Morley (hopem)
status: In Progress → New
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Changed in charm-helpers:
assignee: Tiago Pasqualini da Silva (tiago.pasqualini) → Edward Hope-Morley (hopem)
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Changed in charm-helpers:
importance: Undecided → High
Changed in charm-barbican-vault:
importance: Undecided → High
Changed in charm-ceph-osd:
importance: Undecided → High
Changed in charm-nova-compute:
importance: Undecided → High
Changed in charm-swift-storage:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-barbican-vault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697691

Changed in charm-barbican-vault:
status: New → In Progress
Changed in charm-helpers:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698240

Changed in charm-ceph-osd:
status: New → In Progress
Changed in charm-nova-compute:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698241

Changed in charm-swift-storage:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-swift-storage (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-storage (master)

Reviewed: https://review.opendev.org/698242
Committed: https://git.openstack.org/cgit/openstack/charm-swift-storage/commit/?id=60dd2f0189c9eeb5b35fa2663cd17bc64d9f68d3
Submitter: Zuul
Branch: master

commit 60dd2f0189c9eeb5b35fa2663cd17bc64d9f68d3
Author: Edward Hope-Morley <email address hidden>
Date: Tue Dec 10 13:39:40 2019 +0000

    Charmhelpers sync to get vaultlocker fixes

    Also gate checking vault context completing on whether
    dependencies are installed.

    Change-Id: Ib424abe608081da21207db262fb82362f23fe6ca
    Closes-Bug: #1849323

Changed in charm-swift-storage:
status: In Progress → Fix Committed
Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/698240
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=d6dc3c794be2a64f97278d6f9b8f7f6d54a0ada9
Submitter: Zuul
Branch: master

commit d6dc3c794be2a64f97278d6f9b8f7f6d54a0ada9
Author: Edward Hope-Morley <email address hidden>
Date: Tue Dec 10 13:37:17 2019 +0000

    Charmhelpers sync to get vaultlocker fixes

    Also gate checking vault context completing on whether
    dependencies are installed.

    Change-Id: I6c89944960f592300921fbd455c6d1d8c4b9b2a2
    Closes-Bug: #1849323

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/698241
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=3e67cc5387bf0d30f264178c6606830fff13cfc6
Submitter: Zuul
Branch: master

commit 3e67cc5387bf0d30f264178c6606830fff13cfc6
Author: Edward Hope-Morley <email address hidden>
Date: Tue Dec 10 13:38:51 2019 +0000

    Charmhelpers sync to get vaultlocker fixes

    Also gates checking vaultlocker status until it is installed

    Change-Id: I07f92132b0340b538ee472887c7fd0e0cc911453
    Closes-Bug: #1849323

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-barbican-vault (master)

Reviewed: https://review.opendev.org/697691
Committed: https://git.openstack.org/cgit/openstack/charm-barbican-vault/commit/?id=4a1517d2fbde1b2531388d9abf5ad7af1255ae69
Submitter: Zuul
Branch: master

commit 4a1517d2fbde1b2531388d9abf5ad7af1255ae69
Author: Edward Hope-Morley <email address hidden>
Date: Fri Dec 6 16:29:22 2019 +0000

    Support trying all available tokens

    Since the relation with vault maybe contain more than one
    token and we have no way to know if they are valid, we
    support trying them all until we get a good one and raise
    the error only if we tried them all unsuccessfully and
    have no current secret-id otherwise return current
    secret-id.

    Change-Id: I2ee5ffe5d53e874efb3fabc6a880bf95b00a44f9
    Partial-Bug: #1849323

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

changed barbican-vault charm status to "Fix Committed" as for some reason launchpad did not set it automatically.

Changed in charm-barbican-vault:
status: In Progress → Fix Committed
Changed in charm-ceph-osd:
milestone: 20.01 → 20.02
Changed in charm-nova-compute:
milestone: 20.01 → 20.02
Changed in charm-swift-storage:
milestone: 20.01 → 20.02
Changed in charm-barbican-vault:
status: Fix Committed → Fix Released
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
Changed in charm-nova-compute:
status: Fix Committed → Fix Released
Changed in charm-swift-storage:
status: Fix Committed → Fix Released
James Page (james-page)
Changed in charm-helpers:
status: Fix Committed → Fix Released
Revision history for this message
Hybrid512 (walid-moghrabi) wrote :

This is not fixed, I re-deployed a fresh cluster today and still have issues with secrets-storage-relation-changed hook.

Here is the error from the debug :

root@juju-c041c2-2-lxd-9:/var/lib/juju/agents/unit-barbican-vault-3/charm# ./hooks/secrets-storage-relation-changed
lib/charm/vault_utils.py:32: DeprecationWarning: Call to deprecated function '_post'. This method will be removed in version '0.8.0' Please use the 'post' method on the 'hvac.adapters' class moving forward.
  response = client._post('/v1/sys/wrapping/unwrap')
Traceback (most recent call last):
  File "./hooks/secrets-storage-relation-changed", line 22, in <module>
    main()
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-barbican-vault-3/charm/reactive/barbican_vault_handlers.py", line 94, in plugin_info_barbican_publish
    secret_id = get_secret_id(secrets_storage, current_secret_id)
  File "/var/lib/juju/agents/unit-barbican-vault-3/charm/reactive/barbican_vault_handlers.py", line 59, in get_secret_id
    secret_id = vault_utils.retrieve_secret_id(url, token)
  File "lib/charm/vault_utils.py", line 32, in retrieve_secret_id
    response = client._post('/v1/sys/wrapping/unwrap')
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/hvac/utils.py", line 174, in new_func
    return method(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 2579, in _post
    return self._adapter.post(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 107, in post
    return self.request('post', url, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 304, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-barbican-vault-3/.venv/lib/python3.8/site-packages/hvac/utils.py", line 32, in raise_for_error
    raise exceptions.InvalidRequest(message, errors=errors)
hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist

I'm deploying focal-ussuri and my barbican-vault charm release is :

barbican-vault 3.0.1 active 3 barbican-vault jujucharms 18 ubuntu

Changed in charm-barbican-vault:
status: Fix Released → New
assignee: Edward Hope-Morley (hopem) → nobody
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Discussed offline: Hybrid512 may be hitting lp:1871981 instead

Changed in charm-barbican-vault:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.