VaultKVContext context returning incomplete

Bug #1884312 reported by David Ames
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Triaged
High
Unassigned
Charm Helpers
Invalid
Undecided
Unassigned
OpenStack Nova Compute Charm
Triaged
High
Unassigned
OpenStack Swift Storage Charm
Triaged
High
Unassigned

Bug Description

The VaultKVContext context is marking itself incomplete.

The environment:
CRM
vault in HA in one model
nova-cmopute with encrypt=True in another model
barbican-vault and nova-compute related to the secrets-storage relation in CRM
barbican-vault successfully completes its secrets-storage

Inspecting one of the nova-compute nodes:
Data including the token is on the relation from only one vault node, the leader. The other vault nodes send only their ip information.

In the nova-compute juju log we see the following:

2020-06-19 19:26:22 DEBUG juju-log secrets-storage:233: Encryption requested but vault relation not complete
2020-06-19 19:26:34 INFO juju-log secrets-storage:233: vault relation's interface, secrets-storage, is related but has no units in the relation.

Unfortunately, the VaultKVContext is a binary black box with very little logging [0]. So it is unclear if the problem is a failure with tokens like bug 1849323 [1] or if the problem is only one vault unit sends its data.

[0] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py#L39
[1] https://bugs.launchpad.net/charm-barbican-vault/+bug/1849323

Revision history for this message
David Ames (thedac) wrote :

Initial thoughts and triage

I am not convinced that CMR is the problem.
I also think "vault relation's interface, secrets-storage, is related but has no units in the relation." message may just be the assess status functions generic interpretation of the incomplete context.

The root problem is VaultKVContext returning incomplete.

TRIAGE:

Since there has been recent work from LP bug#1849323 [1][0] that is the first place we should check.

Setup an environment with vault and nova-compute with encrypt=True related to secrets-storage.
Utilize the refresh-secrets action of vault to kick the secrets-storage relation.
Confirm or dis-confirm bad tokens
Confirm or dis-confirm missing data from non-leader vault nodes

If this does not produce results setup the CMR env.

[0] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py#L39
[1] https://bugs.launchpad.net/charm-barbican-vault/+bug/1849323

Jeff Hillman (jhillman)
tags: added: cpe-onsite
Revision history for this message
Jeff Hillman (jhillman) wrote :

Subscribed field-critical, this is blocking a deployment and subsequent handover.

Revision history for this message
David Ames (thedac) wrote :

A simple test with everything in the same model did not display the problem.
http://paste.ubuntu.com/p/7shF7xQgGW/

@Jeff so we don't waste too much time:

What version of OpenStack?
What version of Ubuntu?
What version of the charms?

Revision history for this message
David Ames (thedac) wrote :

"good news"/bad news we can reproduce it. It is in fact, a CMR bug. And it is in effect a Not Implmented problem.

A fairly simple deploy scenario using func tests for vault and for nova-compute in separate models:

juju offer zaza-ead92f1ccacb.vault:secrets
Application "vault" endpoints [secrets] available at "admin/zaza-ead92f1ccacb.vault"
juju consume admin/zaza-ead92f1ccacb.vault remote-vault
juju add-relation remote-vault nova-compute

nova-compute has state Incomplete relations: vault

Because this is a test env I can post the relation data:

root@juju-ae180b-zaza-4cf922063f1a-5:/var/lib/juju/agents/unit-nova-compute-0/charm# relation-get -r secrets-storage:27 - remote-vault/0
egress-subnets: 10.5.0.61/32
ingress-address: 10.5.0.61
private-address: 10.5.0.61
vault_url: '"http://10.5.150.39:8200"'
root@juju-ae180b-zaza-4cf922063f1a-5:/var/lib/juju/agents/unit-nova-compute-0/charm# relation-get -r secrets-storage:27 - remote-vault/1
egress-subnets: 10.5.0.72/32
ingress-address: 10.5.0.72
private-address: 10.5.0.72
vault_url: '"http://10.5.150.39:8200"'
root@juju-ae180b-zaza-4cf922063f1a-5:/var/lib/juju/agents/unit-nova-compute-0/charm# relation-get -r secrets-storage:27 - remote-vault/2
egress-subnets: 10.5.0.25/32
ingress-address: 10.5.0.25
private-address: 10.5.0.25
remote-814c01e71f384b588d930775258f7fda/0_role_id: '"ace609a5-d182-aeff-4bdc-53763d500982"'
remote-814c01e71f384b588d930775258f7fda/0_token: '"s.UHS92pcut8MExrjPKcggVvPk"'
vault_url: '"http://10.5.150.39:8200"'

Note the unit id "remote-814c01e71f384b588d930775258f7fda"

However the vaultlocker code is checking for the local unit name [0] i.e. nova-compute/0:
        for relation_id in hookenv.relation_ids(self.interfaces[0]):
            for unit in hookenv.related_units(relation_id):
                data = hookenv.relation_get(unit=unit,
                                            rid=relation_id)
                vault_url = data.get('vault_url')
                role_id = data.get('{}_role_id'.format(hookenv.local_unit()))
                token = data.get('{}_token'.format(hookenv.local_unit()))

I suspect but I have not yet proven that barbican-vault is also broken regardless of its workload status as the interface does the same as above [1]

[0] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/vaultlocker.py#L64
[1] https://github.com/openstack-charmers/charm-interface-vault-kv/blob/master/requires.py#L72

Changed in charm-helpers:
status: New → Confirmed
Changed in charm-nova-compute:
status: New → Confirmed
Changed in charm-helpers:
importance: Undecided → Critical
Changed in charm-nova-compute:
importance: Undecided → Critical
milestone: none → 20.08
Revision history for this message
Stuart Bishop (stub) wrote :

The consuming end is (deliberately) anonymized and the offering end does not know details about the consumer, and why the offering end sees the remote-814c01e71f384b588d930775258f7fda/0 unit ID. The PostgreSQL interface has the same issue. A work around there is to reverse the relation, offering the client and consuming the server instead of vice-versa. Or better yet drop the anonymization in Juju, as the rationale for it likely no longer applies.

Revision history for this message
Jeff Hillman (jhillman) wrote :

@stub I tried the workaround, but it is basically giving me the same results.

I created the offer: 'juju offer nova-compute:secrets-storage'

I then related the offer into the vault model ...

'juju add-relation foundations-maas:admin/openstack.nova-compute vault:secrets'

I see nova-compute executing and then goes back to waiting with an incomplete relation.

Then inside of the nova-compute unit logs it says the same message of there being no units available.

Then, doing the same methods as thedac earlier, running ...

'juju run --unit nove-compuate/0 -- 'relation-get -r secrets-storage:254 - remote-<some-uuid>/0'

I see that only the /0 unit has the roles/secrets, and the /1 and /2 units have just the IP addresses.

So, same idea. Gathering logs now and will upload them shortly.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Relevant data should start after June 22 13:00

Revision history for this message
Jeff Hillman (jhillman) wrote :

Relevant data should start after June 22 13:00

Revision history for this message
Jeff Hillman (jhillman) wrote :

some command output to help see how I did the reverse CMR

$ juju offers -m openstack
Offer User Relation id Status Endpoint Interface Role Ingress subnets
barbican-vault admin 255 joined secrets-storage vault-kv requirer
nova-compute admin 254 joined secrets-storage vault-kv requirer

$ juju show-offer nova-compute
Store URL Access Description Endpoint Interface Role
foundations-maas admin/openstack.nova-compute admin OpenStack Compute, codenamed Nova, is a cloud secrets-storage vault-kv requirer
                                                        computing fabric controller. In addition to
                                                        its "native" API (the OpenStack API), it also
                                                        supports the Amazon EC2 API. . This charm...

$ juju status -m vault | head
Model Controller Cloud/Region Version SLA Timestamp
vault foundations-maas maas_cloud 2.7.6 unsupported 13:53:06Z

SAAS Status Store URL
barbican-vault active foundations-maas admin/openstack.barbican-vault
graylog active foundations-maas admin/lma.graylog
nagios active foundations-maas admin/lma.nagios
nova-compute waiting foundations-maas admin/openstack.nova-compute
prometheus active foundations-maas admin/lma.prometheus-target

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for your work on this issue.

The nova-compute charm is not currently validated for CMR scenarios. As thedac indicated, it is a
NotImplemented feature in the charm. We expect there will be issues as you've found until the charm gets a distinct CMR development and testing effort. This is in our backlog, but not in the current roadmap for 20.10.

In the meantime, we would welcome and support a contribution by other community members or teams, starting with the specification process and by attaching a sample/sanitized bundle to this bug.

https://docs.openstack.org/charm-guide/latest/feature-specification.html

Ryan Beisner (1chb1n)
Changed in charm-nova-compute:
importance: Critical → Wishlist
Ryan Beisner (1chb1n)
tags: added: cross-model
Changed in charm-helpers:
importance: Critical → Wishlist
Revision history for this message
Jeff Hillman (jhillman) wrote :

removing field-crit and subscribing field-high per managerial direction.

Revision history for this message
James Page (james-page) wrote :

The vault charm does have some support via its interface for CMR:

https://github.com/openstack-charmers/charm-interface-vault-kv/blame/master/provides.py#L72

reactive charms should be setting the unit_name key automatically:

https://github.com/openstack-charmers/charm-interface-vault-kv/blob/master/requires.py#L54

however I can't see equivalent changes in nova-compute, swift-storage or ceph-osd which means they will not see the response correctly in a CMR deployment.

Changed in charm-ceph-osd:
status: New → Triaged
Changed in charm-nova-compute:
status: Confirmed → Triaged
Changed in charm-swift-storage:
status: New → Triaged
Changed in charm-helpers:
status: Confirmed → Invalid
Changed in charm-swift-storage:
importance: Undecided → Wishlist
Changed in charm-ceph-osd:
importance: Undecided → Wishlist
milestone: none → 20.08
Changed in charm-swift-storage:
milestone: none → 20.08
Revision history for this message
James Page (james-page) wrote :

tl;dr the feature gap is all consuming side - each charm needs to present hookenv.local_unit() using the 'unit_name' key.

Revision history for this message
James Page (james-page) wrote :

No charmhelpers change needed - its doing the right thing as the context is incomplete from the consuming side of the relation.

Revision history for this message
Jeff Hillman (jhillman) wrote :

In this same environment, I added a vault local the openstack model, just for secrets-storage and I'm seeing the same behavior.

To be clear, I now have 2 vaults. one in its own model just to hand out certificates, and a 2nd one, in the openstack model, just for secrets-storage.

Nova-compute will NOT fully relate to the vault in the openstack model, it is giving the same messages as before, that there are no units in the relation.

When I run "juju run --unit nova-compute/0 'relation-list -r secrets-storage:267'" I see the 3 vault units.

When I run "juju run --unit nova-compute/0 'relation-get -r secrets-storage:267 - vault/0'" i see the egress-subnet, the ingress subnet and the role_id/token.

When I run this on the other 2 vault units, I only see the egress/ingress subnet information.

Nova-compute is sitting at 'Incomplete relation: vault'

Also, again like before with CMR and secrets-storage, barbican-vault is sitting at Unit is Ready with this in-model vault deploy.

Revision history for this message
James Page (james-page) wrote :

barbican-vault == reactive
nova-compute == classic

so given the state of the codebases #15 reflects my comment in #12

Revision history for this message
James Page (james-page) wrote :

hmm or maybe not...

Revision history for this message
Jeff Hillman (jhillman) wrote :

nova-compute/0 unit logs from openstack model

Revision history for this message
Jeff Hillman (jhillman) wrote :

vault/0 (leader) unit logs from openstack model

Revision history for this message
Billy Olsen (billy-olsen) wrote :
Revision history for this message
Jeff Hillman (jhillman) wrote :

re-escalating to field critical since the proposed workaround, doesn't work.

Ryan Beisner (1chb1n)
Changed in charm-helpers:
importance: Wishlist → High
Changed in charm-ceph-osd:
importance: Wishlist → High
Changed in charm-nova-compute:
importance: Wishlist → High
Changed in charm-swift-storage:
importance: Wishlist → High
Changed in charm-helpers:
importance: High → Undecided
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Two questions:

1. Re: my comment #10, can you please post a sanitized bundle that can be used to describe this topology and config? We need to narrow our reproduction scope to match what you're deploying.

2. Was the model as-reported formerly part of a multi-model deployment? Or, is this reproduced with a clean deploy?

Thank you.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Sanitized bundle being uploaded.

RE: question 2, the openstack model is clean and is pointing to the local vault relations. The vault model has not been touched, it just isn't being referenced the same way.

Revision history for this message
Jeff Hillman (jhillman) wrote :

sanitized bundle

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Jeff, here's what I'm seeing in my local deployment. This is with the attached focal-ussuri bundle I know you're using bionic-stein. I'm not sure the openstack release matters. All services are active but it seems as if I'm getting similar relation-get data that you were seeing; only getting role_id and token from the vault leader.

$ juju status --format short nova-compute vault

- nova-compute/0: 10.5.0.21 (agent:idle, workload:active)
  - ovn-chassis/0: 10.5.0.21 (agent:idle, workload:active)
- nova-compute/1: 10.5.0.184 (agent:idle, workload:active)
  - ovn-chassis/2: 10.5.0.184 (agent:idle, workload:active)
- nova-compute/2: 10.5.0.164 (agent:idle, workload:active)
  - ovn-chassis/1: 10.5.0.164 (agent:idle, workload:active)
- vault/0: 10.5.0.8 (agent:idle, workload:active) 8200/tcp
  - vault-mysql-router/1: 10.5.0.8 (agent:idle, workload:active)
- vault/1: 10.5.0.29 (agent:idle, workload:active) 8200/tcp
  - vault-mysql-router/2: 10.5.0.29 (agent:idle, workload:active)
- vault/2: 10.5.0.185 (agent:idle, workload:active) 8200/tcp
  - vault-mysql-router/0: 10.5.0.185 (agent:idle, workload:active)

$ juju run --unit nova-compute/0 'relation-get -r secrets-storage:63 - vault/0'
egress-subnets: 10.5.0.8/32
ingress-address: 10.5.0.8
private-address: 10.5.0.8
vault_url: '"http://10.5.0.8:8200"'

$ juju run --unit nova-compute/0 'relation-get -r secrets-storage:63 - vault/1'
egress-subnets: 10.5.0.29/32
ingress-address: 10.5.0.29
nova-compute/0_role_id: '"ef15b909-efb8-607c-dc34-77ff094e4fc9"'
nova-compute/0_token: '"s.TfCNHSO8lzDUgkZ7jFiNN6el"'
nova-compute/1_role_id: '"65063d25-5e31-a6c4-0d67-9fc7b2f0d136"'
nova-compute/1_token: '"s.j2cO4M12GavhyXQjKsekA8fB"'
nova-compute/2_role_id: '"57b9af31-2003-6950-d824-1803e304bec2"'
nova-compute/2_token: '"s.KJWpuoXbpTbOJOewJdXjZWqh"'
private-address: 10.5.0.29
vault_url: '"http://10.5.0.29:8200"'

$ juju run --unit nova-compute/0 'relation-get -r secrets-storage:63 - vault/2'
egress-subnets: 10.5.0.185/32
ingress-address: 10.5.0.185
private-address: 10.5.0.185
vault_url: '"http://10.5.0.185:8200"'

$ juju run --unit vault/1 'is-leader'
True

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Were you seeing vault_url in all relation-gets?

Is it possible there are any left-over CMR relations in the deployment that might be blocking? I wonder if juju export-bundle would shed light on that:
juju export-bundle > /tmp/exported-bundle.yaml

Revision history for this message
Jeff Hillman (jhillman) wrote :

That particular model no longer exists. I am currently trying a completely local vault, with no CMR for vault in this model, as a test.

However, previously I uploaded the vault and nova-compute unit logs from this model, perhaps it is in there ... ?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I didn't see much much in the logs other than nova-compute logs showing vault secrets-storage relation never getting any units. It feels like something residual from CMR might be causing that. Your fresh non-CMR redeploy should prove that theory. Please let me know how the redeploy goes.

Revision history for this message
Jeff Hillman (jhillman) wrote :

I completely took CMR (for vault) out of the picture, and added a local vault in the same model, and am still getting incomplete relation to vault for nova-compute int hat model.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

I did some tests to reproduce the bug and see if I hit the same issue as Jeff.

1) I deployed CMR : one model with vault, one model with nova-compute
2) when relating vault-secrets with nova-compute, the relation does not complete. Bug reproduced.
3) I removed the CMR relationship
4) I deployed a unit of vault in the same model as nova-compute
5) I related vault:secrets and nova-compute:secrets-storage. It took about a minute for them to complete the relation, but they did. I did not see the bug when they are both deployed in the same model.

Perhaps there is some residual data at the controller level that is causing your issue Jeff? If all else fails it could be worth it to redeploy the controller and the models.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Thanks Camille. I recently did just that, tore down the whole environment. Will know soon when the openstack model is up if the issue is still prevelant.

Revision history for this message
David Ames (thedac) wrote :

@Jeff,

We have been running down a different barbican-vault bug that *might* give us insight about this bug.

Please see James Page's discussion of the required spaces bindings and validate that your bundle is doing so to rule this out.

https://bugs.launchpad.net/charm-barbican-vault/+bug/1886424

Revision history for this message
Jeff Hillman (jhillman) wrote :

I have a "" rule for vault, baribican vault, nova-compute and everyone else. We should know by EOD if my vault in the same model still exists.

Revision history for this message
Liam Young (gnuoy) wrote :

Encryption at rest via cross mode lrelations should be fixed by these:

interface-vault-kv https://github.com/openstack-charmers/charm-interface-vault-kv/pull/11
ceph-osd https://review.opendev.org/740545
vault https://review.opendev.org/740546

Revision history for this message
Liam Young (gnuoy) wrote :

Those patches fix ceph-osd connected to vault over a cross-model relation.

Revision history for this message
Jeff Hillman (jhillman) wrote :

I was able to get vault working in the same model by having ALL bindings point to one network.

I had secrets-storage and secrets all on the same bind for all charms, but this didn't appear tow work. simplifying the bindings to just 1 solved the issue. not sure if this is a bug or not, but it does work in model.

James Page (james-page)
Changed in charm-nova-compute:
milestone: 20.08 → none
Changed in charm-swift-storage:
milestone: 20.08 → none
Changed in charm-ceph-osd:
milestone: 20.08 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.