secrets-storage-relation-changed hook remains in error state

Bug #1821209 reported by John George
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
vault-charm
Invalid
High
James Page

Bug Description

Deploying vault with a FCB bundle results in hook errors for secrets-storage-relation-changed between ceph-osd and nova-compute.

We see that url in /etc/vaultlocker/vaultlocker.conf is not updated with a vault unit IP or it's VIP.
The vault VIP set in the bundle is 10.22.130.32

The juju-crashdump is located at:
http://people.canonical.com/~jog/juju-crashdump-126b95e2-aa73-4827-927b-712aeb564fc1.tar.xz

The bundle is at:
https://people.canonical.com/~jog/1821209

Juju status:
https://people.canonical.com/~jog/1821209/juju_status_vault_hook_errors.status

cat /etc/vaultlocker/vaultlocker.conf
[vault]
url = http://10.5.0.13:8200
approle = e256bf3b-fb28-b1d6-f2fb-3adc8339d3ad
secret_id = 9428ad25-7b4a-442f-8f20-f23be0575146
backend = secret

Revision history for this message
John George (jog) wrote :

There is also a proxy in use at this site. So any http/https connection requirements are being added to juju's no-proxy model-config.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please provide a record of the expected cidrs on each space. I need to interpret the unit ip addresses, network space cidrs, and vips, together, in order to get the full picture (ie. to understand the logical networks in play).

Take note too that one of the nova-cloud-controller units is failing to apt install, which may be indicative of network config or connectivity issues.

Changed in vault-charm:
importance: Undecided → High
assignee: nobody → Ryan Beisner (1chb1n)
milestone: none → 19.04
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for your work on this.

Absent a full set of spaces/cidr details, the best we can tell is that the VIP assigned to vault was not within the network range of the declared network space binding.

A fresh redeploy is underway. If issues continue, please re-raise this bug and include new logs, including netspace cidr details.

Changed in vault-charm:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :

Network config used by FCB to configure spaces in maas.

Changed in vault-charm:
status: Incomplete → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

The bug is set back to 'new' status. Does that mean that you hit it again? If so, new logs, please & thanks.

Changed in vault-charm:
status: New → Incomplete
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

As an additional observation / question, why is a new deploy being done with Xenial instead of Bionic?

Revision history for this message
John George (jog) wrote :

Xenial was a customer requirement.

David Ames (thedac)
Changed in vault-charm:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in vault-charm:
milestone: 19.07 → 19.10
Revision history for this message
Andrea Ieri (aieri) wrote :

I've just hit this, and I bet it's a race condition because over multiple redeploys of the same cloud it only happened once, and only on some units.

$ juju spaces
Space Subnets
ceph-access-space 100.100.187.0/24
ceph-replica-space 100.100.186.128/25
external-space 100.100.186.0/25
oam-space 100.100.184.0/23
overlay-space 100.100.188.0/26
                    100.100.188.128/26
                    100.100.188.64/26
                    100.127.2.0/24

$ juju config vault vip
100.100.184.58

$ juju run --timeout=10s -a ceph-osd -- sudo grep url /etc/vaultlocker/vaultlocker.conf | grep url | sort | uniq -c
ERROR timed out waiting for result from: unit ceph-osd/1
      4 url = http://100.100.184.58:8200
      1 url = http://10.5.0.13:8200

$ juju run --timeout=10s -a nova-compute-kvm -- sudo grep url /etc/vaultlocker/vaultlocker.conf | grep url | sort | uniq -c
ERROR timed out waiting for result from: unit nova-compute-kvm/3
     16 url = http://100.100.184.58:8200
     11 url = http://10.5.0.13:8200

10.5.0.13 does not fall within any local network as far as I can see, not even lxdbr0.

The cloud is Bionic Queens, 19.07 charms.

Revision history for this message
Andrea Ieri (aieri) wrote :

Sorry, wrong unit log... this one does show the error message:
hvac.exceptions.InvalidRequest: wrapping token is not valid or does not exist

Changed in vault-charm:
status: Incomplete → New
Revision history for this message
Andrea Ieri (aieri) wrote :

as a workaround, removing and readding the relation worked for me

Revision history for this message
James Page (james-page) wrote :

I'm highly suspicious of the 10.5.0.13 IP address - that is the same as in the original bug report and its unclear as to where that is actually coming from.

Revision history for this message
James Page (james-page) wrote :

Note that the code that retrieves the secret_id (which is generating the InvalidRequest exception) does not use the vaultlocker configuration file - it directly uses the URL and response-wrapped token presented on the relation.

Revision history for this message
James Page (james-page) wrote :

url = http://10.5.0.13:8200

is a red herring - this is the example URL provided in the vaultlocker.conf from the packaging - that indicates that the unit has failed to retrieve a secret id using the token provided from vaultlocker, rather than being the root cause of the problem.

Revision history for this message
James Page (james-page) wrote :

I have a suspicion that the token is expiring before it gets used; the TTL is set to 1hr, which means the consuming charm must have a hook execution within that time frame or the token will no longer be valid; removing and re-adding the relation causes them to be re-issued, and as its not during the initial deployment, there are less hook executions going on so it all happens < 1hr.

This is a particular issue on baremetal with multiple principle charms and lots of subordinates.

Revision history for this message
James Page (james-page) wrote :

@aieri are you using totally-unsecure-auto-unlock? this will cause vault to be ready to start issuing tokens very early in the model deployment time which might exacerbate this issue.

Switching to automating the unseal of vault at the end of deployment would avoid this issue; I'm loath to increase the TTL above 1hr as we start to compromise the overall security posture of the deployment.

Changed in vault-charm:
status: New → Incomplete
assignee: Ryan Beisner (1chb1n) → James Page (james-page)
James Page (james-page)
Changed in vault-charm:
milestone: 19.10 → none
Revision history for this message
Andrea Ieri (aieri) wrote :

@james-page, no totally-unsecure-auto-unlock is set to false.

I have encountered the issue after redeploying this cloud without killing the controllers. I don't know if it's actually related, but I'm experiencing general juju slugginesh so even though 1h is a long time, you theory might actually apply here.

I'm going to redeploy again (including the controllers) and see if it happens again.

Revision history for this message
Andrea Ieri (aieri) wrote :

@james-page are the tokens relation-specific or global? One thing that surprised me was that re-adding the specific [vault:secrets, swift-storage-zone2:secrets-storage] relation resolved the issue with the other apps as well, but after reading your analysis it starts to make sense.

Revision history for this message
James Page (james-page) wrote :

The tokens used to wrap the secret are specific to each unit that needs one.

Revision history for this message
James Page (james-page) wrote :

They expire after 1hr and are one-shot - i.e. once they have been used, they don't work any longer.

John George (jog)
description: updated
Revision history for this message
James Page (james-page) wrote :

As detailed in #17 still waiting for feedback from @aieri on whether the general sluggishness of the deployment environment caused the TTL's on the tokens to expire before the secret_id's could be retrieved.

Revision history for this message
James Page (james-page) wrote :

@aieri - this bug still has a field-critical subscription - it would be good to get it moving forward or close it out as the TTL limit being hit due to slowness in the environment.

Revision history for this message
James Page (james-page) wrote :

This field-crit bug has not had any updates in nearly 2 months - @aieri is this still an issue? I'm pretty sure its a slow deployment hitting the TTL on the token used to retrieve the secret_id.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Closing this bug -- Please reset to NEW and add new logs and evidence if the issue persists. Thank you.

Changed in vault-charm:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.