barbican-vault is failing in hook failed: "secrets-storage-relation-changed"

Bug #1871981 reported by Wouter van Bommel
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Barbican-Vault Charm
Fix Released
Critical
Unassigned

Bug Description

The charm is failing in the hook hook failed: "secrets-storage-relation-changed". Logging on the unit show's it's having problems with: hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist
but no clues are available on how to resolve this issue.

Please find attached some more extensive logging.

Juju version = 2.7.5 (was 2.5.8 before, upgrade made no difference (as expected))
charm version = stable/19.07

Revision history for this message
Wouter van Bommel (woutervb) wrote :

As an addition, vault charm is a stable/19.04

Revision history for this message
Wouter van Bommel (woutervb) wrote :
Revision history for this message
Ashley Lai (alai) wrote :

I'm hitting the same error.

openstack-origin: cloud:bionic-stein

Charm version:
cs:barbican-vault-13
cs:barbican-28
cs:vault-37

2020-04-24 03:36:46 ERROR juju-log secrets-storage:58: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-barbican-vault-0/charm/reactive/barbican_vault_handlers.py", line 94, in plugin_info_barbican_publish
    secret_id = get_secret_id(secrets_storage, current_secret_id)
  File "/var/lib/juju/agents/unit-barbican-vault-0/charm/reactive/barbican_vault_handlers.py", line 59, in get_secret_id
    secret_id = vault_utils.retrieve_secret_id(url, token)
  File "lib/charm/vault_utils.py", line 25, in retrieve_secret_id
    response = client._post('/v1/sys/wrapping/unwrap')
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/utils.py", line 174, in new_func
    return method(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/v1/__init__.py", line 2579, in _post
    return self._adapter.post(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/adapters.py", line 107, in post
    return self.request('post', url, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/adapters.py", line 336, in request
    response = super(JSONAdapter, self).request(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/adapters.py", line 304, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.6/site-packages/hvac/utils.py", line 32, in raise_for_error
    raise exceptions.InvalidRequest(message, errors=errors)
hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist

Revision history for this message
Ashley Lai (alai) wrote :

Subscribed to field high as it is blocking our deployment.

Note that we have seen this issue multiple times on intermediate CA vault deployment.

David Ames (thedac)
Changed in charm-barbican-vault:
status: New → Confirmed
importance: Undecided → High
importance: High → Critical
assignee: nobody → David Ames (thedac)
Revision history for this message
David Ames (thedac) wrote :

This affected all versions of the charm as the root cause is a change in the hvac.Client post response type from an object to a dictionary.

Fix on the way.

As a supplementary note. The tokens generated are one time use. The exception, "hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist" was seen on subsequent hook runs.

For future information (after the response type change fix lands), if one sees the hvac.exceptions.InvalidRequest exception, one can refresh the one time use tokens with the following:

juju run-action --wait vault/0 refresh-secrets

Revision history for this message
David Ames (thedac) wrote :
Changed in charm-barbican-vault:
status: Confirmed → In Progress
David Ames (thedac)
Changed in charm-barbican-vault:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-barbican-vault (stable/20.02)

Fix proposed to branch: stable/20.02
Review: https://review.opendev.org/723617

Revision history for this message
David Ames (thedac) wrote :
Revision history for this message
David Ames (thedac) wrote :

Re opening this one as it turns out my initial fix was too simplistic:

https://github.com/hvac/hvac/pull/537/files#diff-b8514c4f6ceea573e955553e996f5e04R336-R343

Changed in charm-barbican-vault:
status: Fix Committed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-barbican-vault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/724402

Changed in charm-barbican-vault:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-barbican-vault (stable/20.02)

Change abandoned by David Ames (<email address hidden>) on branch: stable/20.02
Review: https://review.opendev.org/723617
Reason: Better solution going to master: Change-Id: Ia8517a75e7bc20f751eca83aaa84728cc62edee6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-barbican-vault (master)

Reviewed: https://review.opendev.org/724402
Committed: https://git.openstack.org/cgit/openstack/charm-barbican-vault/commit/?id=f6546dda33636bca1a94cbce7736b350f64ab74a
Submitter: Zuul
Branch: master

commit f6546dda33636bca1a94cbce7736b350f64ab74a
Author: David Ames <email address hidden>
Date: Wed Apr 29 12:14:08 2020 -0700

    More robust hvac.Client post response handling

    Pin the version of hvac for a consistent response type. Use the adpater
    to get response code and json data.

    Closes Bug: #1871981
    Change-Id: Ia8517a75e7bc20f751eca83aaa84728cc62edee6

Changed in charm-barbican-vault:
status: In Progress → Fix Committed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I'm marking this Fix released as it landed in the last cycle. I suspect that the automation around it broke because it went committed -> triaged -> inprogress -> committed.

Changed in charm-barbican-vault:
status: Fix Committed → Fix Released
Revision history for this message
Joe Guo (guoqiao) wrote :

I hit this bug today while doing charm upgrades.
The upgrade path is 19.07->19.10->20.02->20.05.
However, the bugfix is only released to 20.05.
I tried to cherry-pick it to 19.07 and it conflicts.
So we cancelled the charm upgrade.
It will be very helpful if we can have backports for bugfix like this.

Revision history for this message
Joe Guo (guoqiao) wrote :

Hi OpenStack team,

I'll add more details about how this bug blocked charm upgrades.

1). As BootStack charm upgrade policy, we have to upgrade charms per release. For example, in my situation, cloud is at 19.07, to upgrade to 20.05, we have to perform 3 upgrades:

- 19.07 -> 19.10
- 19.10 -> 20.02
- 20.02 -> 20.05

2). In theory, we can not leave an important charm (e.g.: barbican-vault) in error and just upgrade it to latest. We have to fix the error and move on to next release. So the real upgrade process for us will be:

19.07 -> 19.07 + bugfix --> 19.10 + bugfix --> 20.02 + bugfix -> 20.05

3). The bugfix commit (f6546dda33636bca1a94cbce7736b350f64ab74a) is merged into master, and released to 20.05, but not backported to old releases/branches 19.07/19.10/20.02.

4). I tried to clone the charm barbican-vault repo, cherry-pick above commit to 19.07/19.10/20.02 and rebuild the charm myself. But cherry-pick failed with conflicts.

5). Although the conflicts could be easy to resolve in git, but since our maintenance window is in limited timeslot, after investigation to figure out above situation, time is not enough to finish all 3 upgrades. And we didn't test the cherry-pick which is also risky. So after talk with my manager, we decided to cancel the upgrades.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

tldr; My suggestion in this case is option #2 below.

--

Thank you for the added information. First I want to confirm again that we are aware of the gap between the current release flow, and the operational goal of stepping through individual charm upgrades to catch up to the stable release.

Ultimately, this bug fix will only be released to the 20.05 (latest stable charm) in the charm store under the current release model. The charm store has exactly 1 stable namespace (ie. cs:barbican-vault == cs:~openstack-charmers/barbican-vault == git branch stable/20.05).

When the charm store/hub and the related tooling grows track/channel support, we plan to revisit that flow. But to be clear, that will not be for some time yet and there are currently no plans to re-release an updated 19.07, 19.10, or 20.02 stable set of charms with this bug's patch applied.

I see two paths forward:

1) Locally-patch older stable charm revisions and step-upgrade through them. There is a natural risk in doing so. As highlighted in previous comments, the patch may not pick cleanly back to older stable branches. Additionally, the patched fork will not have gone through the rigor and regression testing that our released revisions go through. Other unexpected issues may arise during or after the charm upgrade.

2) Upgrade the charm to the latest revision where the bug is fixed. This charm upgrade path is not tested. The charm revisions are only tested N+1 during release, which means there is an assumed risk here as well. But with this option, there are fewer hops in getting the system to a resolved state. Be aware that manual operator intervention may be necessary to bring the charm and the workload back to a ready/steady state after the skip-revision upgrade. Please document all cs:urls and "repo-info" file git hash contents along the way, in the event that additional assistance is needed.

To be clear, the above isn't intended to change the standing recommendation of N+1 charm upgrades. But, when blocked in a case such as this, I think it is reasonable to do a controlled skip-revision charm upgrade to the latest revision after testing that jump in a non-production environment.

Revision history for this message
Dean Henrichsmeyer (dean) wrote :

Let me clarify. Upgrade to the latest stable revision where the bug is fixed and engineering will 100% support you if and when issues arise as a result of that path.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We seem to have hit this again on focal testing yesteray. At 16:20 barbican-vault and all other openstack services were Ready; vault was unlocked and all setup.

At 17:20, we started running secrets-storage-relation-changed for an unknown reason, and in that we hit:

2020-09-02 17:20:23 INFO juju-log secrets-storage:48: Retrieving secret-id from vault (http://vault-internal.prodymcprodface.solutionsqa:8200)
2020-09-02 17:20:23 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-09-02 17:20:23 ERROR juju-log secrets-storage:48: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-barbican-vault-0/charm/reactive/barbican_vault_handlers.py", line 94, in plugin_info_barbican_publish
    secret_id = get_secret_id(secrets_storage, current_secret_id)
  File "/var/lib/juju/agents/unit-barbican-vault-0/charm/reactive/barbican_vault_handlers.py", line 59, in get_secret_id
    secret_id = vault_utils.retrieve_secret_id(url, token)
  File "lib/charm/vault_utils.py", line 32, in retrieve_secret_id
    response = client._post('/v1/sys/wrapping/unwrap')
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/hvac/utils.py", line 174, in new_func
    return method(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 2579, in _post
    return self._adapter.post(*args, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 107, in post
    return self.request('post', url, **kwargs)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 304, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-barbican-vault-0/.venv/lib/python3.8/site-packages/hvac/utils.py", line 32, in raise_for_error
    raise exceptions.InvalidRequest(message, errors=errors)
hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist

This was with cs:~openstack-charmers-next/barbican-vault-50

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Reproduced today on cs:barbican-vault-16 with the same traceback as above.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Also, worth noting that I had a slightly different reproducer:

1) I had a bionic-ussuri cloud
2) The cloud itself was in deploying state, meaning a lot of things were happening in parallel.

I will collect the juju crashdump in case I'll hit this again.

Revision history for this message
David Ames (thedac) wrote :

Since "hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist" is a generic error for a one time use token I suspect this bug will never die. Though the original bug was resolved, reopening as from around comment #14 new instances of the generic error have occurred.

Changed in charm-barbican-vault:
status: Fix Released → Confirmed
Revision history for this message
Rony Zeidan (ronynov) wrote :

Hello,
we are facing this issue too. We hit this while in 19.07. we upgraded to the latest version cs:barbican-vault-16 but still same behavior. This is a bionic-stein cloud.
Any idea on how to solve thi.

Revision history for this message
Zachary Zehring (zzehring) wrote :

I encountered this issue on 19.10 in an 3-unit HA Vault deployment. Checking the vault relation data that's provided to barbican-vault units, the non-leader vaults also are providing a token which are the old tokens from when the leader was different. To resolve, the following steps were taken (vault/1 is the leader):

1. refresh secrets using vault lead
juju run-action --wait vault/leader refresh-secrets

2. unset all but leader barbican-vault/X_token using relation-set
juju run -u vault/0,vault/2 -- relation-set -r secrets-storage:402 barbican-vault/0_token='' barbican-vault/1_token='' barbican-vault/2_token=''

3. juju resolve --no-retry on the units until they correctly have the tokens set (check what's currently set by using "relation-get -r secrets-storage - vault/1")

4. repeat until you get to the point where barbican-vault picks up the correct token
NOTE: when refreshing secrets, barbican-vault does not get the new token until resolve --no-retry is run, as it will try with the old token. Also, the tokens are only good for an hour, so the above must be done within an hour of refreshing secrets.

Liam Young (gnuoy)
Changed in charm-barbican-vault:
assignee: David Ames (thedac) → Liam Young (gnuoy)
Revision history for this message
James Page (james-page) wrote :

Seems like there are two potential issues here:

1) deployments taking some time to complete and wrapping tokens expiring before they are used.

2) leadership changing in vault and consuming charms still attempting to access using the older tokens provided by the previous leader(s).

I think this all points to the fact that we need to be more resilient to these situations in the client code.

If 1) is encountered then the unit should go into a blocked state OR we should add relation semantics for the consuming unit to request a new wrapping token.

If 2) is encountered then two potential routes - when units become non-leader they could clear any previously set tokens - this will generate quite a bit of hook execution. OR clients could look at all presented tokens and deal with any exceptions of this type that are raised. Slightly awkward as its not a specific exception that is raised.

Revision history for this message
Liam Young (gnuoy) wrote :

I think a big part of the issue was barbican-vault looking in the wrong place for a current approle_secret_id. This is fixed by:

https://review.opendev.org/c/openstack/charm-barbican-vault/+/764470

Changed in charm-barbican-vault:
status: Confirmed → Fix Committed
assignee: Liam Young (gnuoy) → nobody
Revision history for this message
João Pedro Seara (jpseara) wrote :

Thank you very much for the workaround Zachary. It didn't work * quite * like that, so I had to make some extra commands.

Here's the process:

Assuming vault/1 is the leader.

1. Get relation ID:

juju run --unit vault/1 "relation-ids secrets"
secrets:131

2. Confirm relation "secrets:131" is actually related to barbican-vault and not any other charm:

juju run --unit vault/1 -- 'relation-get -r secrets:131 - vault/1'

# You should have outputs like 'role_id' and 'token', all related to barbican-vault

3. Let's clear everything in vault:

juju run --unit vault/0,vault/2 -- relation-set -r secrets:131 barbican-vault/0_role_id='' barbican-vault/1_role_id='' barbican-vault/2_role_id=''
juju run --unit vault/0,vault/1,vault/2 -- relation-set -r secrets:131 barbican-vault/0_token='' barbican-vault/1_token='' barbican-vault/2_token=''

# Please note that the first command is only applied to the non-leader vaults. The second is applied to all. This is on purpose.

4. Refresh the secrets:

juju run-action --wait vault/1 refresh-secrets

5. Run:

juju resolve --no-retry barbican-vault/0
juju resolve --no-retry barbican-vault/1
juju resolve --no-retry barbican-vault/2
juju status barbican-vault

6. Repeat step 5 till all active.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Marking as fix-released based on:

https://review.opendev.org/plugins/gitiles/openstack/charm-barbican-vault/+/26ba0535efe16f50aca7ffad5496ee9b344a7d7d

charm-barbican-vault$ git branch --contains=26ba0535efe16f50aca7ffad5496ee9b344a7d7d
* master
  stable/21.01

Changed in charm-barbican-vault:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.