Secrets do not behave as expected

Bug #2037120 reported by Pedro Guimarães
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

According to multiple conversations with Juju team, the expected behavior is:

https://chat.charmhub.io/charmhub/pl/pa7d6mcpdigo3pb89c6kh8wfge
https://chat.charmhub.io/charmhub/pl/mnxr9u4dapd8xre6cq5emw9h3e

Juju team explained the following behavior is expected for secrets:
1) Update a given secret value with "secret-set" (which creates revision N for this discussion)
2) secret-changed is thrown
3) In secret-changed: get the old value of the secret:
    event.secret.get_content()
4) Get the new value of the secret:
    event.secret.peek_content()
5) Run any activities that need the secrets (e.g. changing a password in the database)
6) Commit back to Juju that revision N-1 is obsolete
    event.secret.get_content(refresh=True)
7) Finish the event: doFlush will be called from the juju agent side

I am currently running on Juju built from 3.3 commit: 230e203a72617a348072b896056054b3761203a8 (Sep, 20th) compiled for debug.

I can say this is not at all the behavior we've found.

First, in the secret-changed, I can run PDB and check the values of each of the calls above:
https://pastebin.ubuntu.com/p/Y3bkt23tsS/

And we can see they are all the same. Therefore, the flow described above is not functional.

---------------------------------------------------------

Looking into Juju code, what I can see happening is the following:

In the agent managing the charm:
1) jujuc is called with "secret-set" to set a new value (new: revision N, old: revision N-1) [1]
2) UpdateSecret calls "update" [2]
3) update method registers a new update in secretChangeRecorder's pendingUpdates [3]

Once the event is finished, the agent calls doFlush [4], which will iterate over each secret present in pendingUpdates and create a new secret with revision=N in the SecretBackend. Then, doFlush calls its API back to the controller with UpdateSecret request [5,6].

Then, UpdateSecrets is called in the controller, which then calls "updateSecrets"[7]. That method will eventually call: markObsoleteRevisionOps, which will mark the revision N-1 as "obsolete".

---------------------------------------------------------

This issue can be confirmed as follows:
Set Juju to log all transactions to its own mongodb:
juju model-config -m controller logging-config="<root>=DEBUG;juju.state.txn=TRACE;juju.mgo=TRACE;juju.mgo.txn=DEBUG"

While the event that issued a "secret-set" is finishing (i.e. running doFlush), then Juju controller will run several queries to its database. Filter the controller, logs, for example with:

$ kubectl logs -n controller-test-k8s-localhost controller-0 -c api-server -f | grep "mgo" | grep -i "document unmarshaled" | grep -i secret

It is possible to see that, at that moment, Juju controller will request an update in revision N-1 with "obsolete"=True.

---------------------------------------------------------

Based on:
https://chat.charmhub.io/charmhub/pl/jddb4wtoepd1pk8f1uguykjyor

Secret labels are not being used in this charm.

---------------------------------------------------------
REPRODUCER

$ git clone https://github.com/canonical/postgresql-k8s-operator/ -b DPE-2568_relation_secrets
$ tox -e database-relation-integration-juju3 -- --keep-models

# Attach the debug-hooks and run a password change action:
$ juju run postgresql-k8s/leader set-password

---------------------------------------------------------

[1] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/worker/uniter/runner/jujuc/secret-set.go#L74
[2] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/worker/uniter/runner/context/context.go#L947
[3] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/worker/uniter/runner/context/secrets.go#L74C1-L75C1
[4] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/worker/uniter/runner/context/context.go#L1543
[5] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/worker/uniter/runner/context/context.go#L1589
[6] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/api/agent/uniter/unit.go#L1029
[7] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/apiserver/facades/agent/secretsmanager/secrets.go#L325
[8] https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/state/secrets.go#L2148C18-L2148C41

Revision history for this message
Pedro Guimarães (pguimaraes) wrote (last edit ):

To provide some more context:
* We are using app-scoped secrets
* Each secret represents the password that is used by the database and shared between the units
* There is just one model

The key part here is:
https://github.com/juju/juju/blob/3e561add5940a510f785c83076b2bcc6994db103/apiserver/facades/agent/secretsmanager/secrets.go#L704-L725

Removing it and I can see that, within secret-changed event, the "operator-password" respects the:
1) secret.get_content() # returns old value
2) secret.peek_content() # returns new value
3) secret.get_content(refresh=True) # returns new value
4) secret.get_content() # after, refresh, it returns the new value

Here: https://pastebin.ubuntu.com/p/Fbp88TDSSP/
The "operator-password" gets updated.

Also, I have ran the same change (i.e. "set-password" action) multiple times successfully now. That means the charm can also recover the up-to-date secret once secret-changed event is over.

I have proposed a PR with the fix: https://github.com/juju/juju/pull/16316

Revision history for this message
Harry Pidcock (hpidcock) wrote :

We believe this is behaving as it was designed, but I think this highlights some gaps in either the design or documentation.

Our thoughts here is that the consumer of the secret in your case is the owner of the secret. Which since it set the secret is assumed to always want the current secret. Aka read what you wrote.

Once wallyworld is back, we'll have some internal discussions to go over this use case to understand what we can do to improve this workflow.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote (last edit ):

Hi @hpidcock, it is an app-scoped secret, but indeed set by the leader unit. It is set across all replicas of the database.

However, app-scoped should affect all units equal.

The main concern here is to always have the new secret committed before doing any actual changes, which need to happen with the old and new secrets (i.e. at the secret-changed). These changes happen on every unit equally, including the leader unit.

The goal is to avoid the following risk in the same hook:
1) leader generates new secret
2) leader updates database with new password
<ERROR, at some point later: secret is never committed>
3) leader retries the hook, generates a new secret but fails (2) as the DB has a password that now has been forgotten.

Therefore, we need app-scoped secrets to also apply secret-changed to the leader.

I did not test it, but I'd guess the same applies to unit-scoped secrets: the unit may need to run at a point right after secret is updated, with both old and new secrets.

Revision history for this message
Ian Booth (wallyworld) wrote :

The thinking was what Harry said - as the owner of the secret, you want to read what you wrote. But the highlighted use case does seem to show that this thinking might be flawed in some cases.

The patch in the proposed PR does remove the owner special case, but I don't think it's fully sufficient. Extra logic would also be needed in the hook context so that within a hook where a secret is updated, secret-get [--peek] also behaves as expected. And if secret-get --refresh is called in the hook, we'd need to think about how to ensure that behaviour is implemented, since we'd need to potentially add an extra parameter to the flush call when the hook commits to record that the latest revision should be used.

Revision history for this message
Ian Booth (wallyworld) wrote :

Note also that the steps outlined in the description describes (IIANM) a scenario where all the following happen inside a secret changed hook:

get the old value of the secret:
    event.secret.get_content()

get the new value of the secret:
    event.secret.peek_content()

run any activities that need the secrets (e.g. changing a password in the database)

commit back to Juju that revision N-1 is obsolete
    event.secret.get_content(refresh=True)

finish the event: doFlush will be called from the juju agent side

--

The "commit back to juju that rev N-1 is obsolete" step (get with refresh=True)...

this will currently fail with an error for the secret owner since the (current) design is that owners always read the last value written. Also note that being the owner of the secret, the full metadata and latest value is pre-cached in the hook and fetched from that context without a server side call being needed, so the --refresh is irrelevant even without the error being raised.

So some thought and work is needed if we want to have the get behaviour as provided to the secret owner be the same as that of just a "normal" (non-owner) consumer. We also have to consider cases where get is called in the same hook as that which just did any update, as opposed to doing the get in the resulting secret changed hook.
Changing behaviour will/might break existing charms that always expect to owners read the latest value. Maybe that's ok as this is still all relatively new.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote (last edit ):

Hi @wallyworld, thanks for taking a look into this.
My sole point here is who should access the old & new secrets at secret-changed.

All units that receive secret-changed should have access to both old & new secrets, and it should be up to that unit to decide what to do.

I agree that secrets should be handled in a two-staged hooks: (1st hook) to update the secret with Juju; and (2nd - secret-changed hook) to update the services, files, filesystem, etc; that depends on it: do it on the secret-changed. I don't think we should change the actual services/files in other events.

That way, we can safely resolve any situation where updating a secret fails and we need to retry the same event.

I think that is all the PR above does: allows all units to see the secret's old and new values, even if it is the owner of the secret.

What are the next steps here?

Revision history for this message
Ian Booth (wallyworld) wrote :

The patch in the PR is necessary but not sufficient if we want to fully address the issue, since the in-hook behaviour also needs to obey the same semantics. We will discuss the issue and assuming it is agreed we want to proceed, we'll do a patch this week to get into the next 3.1.7 release.

Revision history for this message
Ian Booth (wallyworld) wrote :

To prototype if this is feasible or not, I have a work in progress branch which removes the "always read latest revision" behaviour for secret owners. If a unit creates a unit owned secret, or a unit leader creates an app owned secret, all consumers, whether or not the owner (or a peer unit for app secrets) have the same revision tracking behaviour - you only get the latest revision with --peek or --refresh.

The logic around how labels are managed for secret owners vs secret consumers also needed tweaking. There's no user visible changes, just internal updates to make it work. Essentially, the label for a secret owner is always stored with the secret metadata, not the consumed revision tracking record. The same label is used when notifying the owner of expiry, rotation etc, as well as when requesting the tracked revision value with just the label and no URI. So there were a couple of internal fixes here.

The in-hook behaviour is interesting. Consider the case where a secret is updated in a hook, and then a get is done in that same hook. The same --peek and --refresh behaviour is needed. --peek is easy enough - use the cached/pending value. Without --peek, read the tracked value from the backend. But --refresh is harder as we would need to plumb through the update to the tracked revision when the hook is committed. This is not yet done in my prototype.

However,

can you clarify the failure case you are guarding against. You wrote above

1) leader generates new secret
2) leader updates database with new password
<ERROR, at some point later: secret is never committed>
3) leader retries the hook, generates a new secret but fails (2) as the DB has a password that now has been forgotten.

Even with this proposed change, will we guard against that failure? Say the latest revision is 2. The hook creates a new revision 3 (not committed yet). DB is updated to use revision 3 in the same hook. Hook commit runs but fails. The revision 3 is still "lost" but is use on the DB. Units will still only see the last tracked revision (eg 2). So the issue remains.

And,

To make sure we are on the same page - the reason it was implemented the way it was was so that all peer units of an application were guaranteed to read a consistent (the same) value. With this change, each peer unit now separately tracks what revision it reads. If leadership changes, all of a sudden the new leader may start using a different secret value, unless --refresh is always used; but this is the current behaviour now - --refresh is done implicitly.

So you can see there's a bit to consider as to what the desired behaviour should be.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi Ian, I am not proposing to update the DB in the same (random) hook we create the new secret. I really think this should happen in two stages: a hook generates a secret; secret-changed updates any internals of the workload.

My point is exactly that all units need to see the secret's old / new values only at secret-changed for that. Therefore, leader can generate the new secret and, if pushed successfully, all units receive a secret-changed and update their own workloads.

Some scenarios for app-scoped secrets:
1) Leader fails in random-hook:

Leader: (1 random hook) ----- generate secret (rev 2) ----- ERROR ----> random hook is abandoned and retried, no unit receives secret-changed in this case for the rev 2, only for rev 3.

2) Unit loses track of -changed:

Leader: (1 - generate secret) ---> secret-changed, v2 ---> gen. secret ---> secret-changed, v3
Unit: ---> no -changed received, not aware of v2 --> receives v3, DB is using v1: v1 should be accessible and v3 should be available by --peek; --refresh should jump to v3 directly

So, I think the important part here is --refresh should jump all the revisions to the latest value; whereas get_content() should return the newest secret with "obsolete=True" for that said unit.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

For unit-scoped secrets, I guess the logic stays the same. I did not have as much exposure to it as app-scoped secrets.

Revision history for this message
Ian Booth (wallyworld) wrote :

To clarify a point in your comment:

"--refresh should jump all the revisions to the latest value"....

Are you proposing that the leader runs "secret-get --refresh", then that updates the tracked revision for *all* peer units? All peer units track the same revision number, and the leader, via --refresh, can update the tracked revision to the latest for all units? And each unit can use --peek if it wants to see the latest? Does your scenario 2 below contradict that? Or at least indicate a unit owned secret should be used in that case?

With your scenario 1, the current behaviour still supports that - if an error occurs in the hook generating a new rev (rev 2 in your example), no secret changed hook is fired.

With your scenario 2, that can indeed happen if a unit agent is down when the rev 2 changed hook is fired, and then it wakes up before the changed hook for rev 3 is fired, so it sees that rev 3 when it does a secret-get. If you want that unit to still access the v1 it last used, then that means that we do indeed want each peer unit to individually track their own revision; the notion that all peer units see the same revision is then not what we would want. But, if we want each unit to track a secret separately, then that suggests we'd use a unit owned secret, not an app owned one. So, we come back to for app owned secrets, all peer units track the same revision, and only the leader can do a --refresh to update the pointer. The leader is responsible for using the secret changed event to --peek at the latest revision to update the workload. If successful, it does a --refresh to expose that new revision to all other peer units next time they do a secret-get; unit that time, the other peers still see the old revision. Am I interpreting your intent correctly?

Revision history for this message
Ian Booth (wallyworld) wrote :

The other thing to consider if we move away from the current "read what you wrote" approach is coordination between peer units on what the tracked revision is.

Consider the case where there's 3 peer units (1 leader, 2 minions). And we want the minion (non-leader) units to only use the tracked revision as set by the leader unit (the leader unit uses --refresh to update it). The leader creates a new revision, and all 3 peers subsequently get a secret changed event. A minion unit sees the event first. So a secret-get at that point still only gets rev X-1. Then the leader unit gets the changed event and updates the tracked revision. And finally the other minion unit sees the changed event, but now secret-get returns rev X. There's no (Juju modelled) coordination between peers to ensure each peer uses the same tracked revision. This currently would need to be done out of band using (peer) relation data, which is sub optimal because it then tries to mimic what juju does. The current behaviour of each peer unit reads the latest revision doesn't have that problem. But it does suffer from the issue you've highlighted where it's possible to update the workload to use a new secret value and an error prevents that new revision from being stored in the model.

Maybe we could do something like only emit the secret changed events to non-leader peer units after the leader has done a --refresh to update the tracked revision. So:

1 leader creates new revision in some hook
2 only leader gets secret changed hook; non-leaders do not see it yet
3 leader runs secret changed hook to update workload using --peek to get the latest rev to use; if all good, leader runs secret-get --refresh
4 now non-leaders get the secret changed event; for them secret-get now returns the latest revision

Until step 3, in some random hook, non-leaders can still run secret-get --peek if they want to.

All the above is for app owned secrets (unit owned no change to behaviour).

Changed in juju:
status: New → Triaged
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi @wallyworld, so I did some further testing with Juju main branch (not the PR I had proposed before). What I can see is:

juju status: https://pastebin.ubuntu.com/p/mdTTCXM79h/

Using an app-scoped secret. On secret-changed event, I try to get the different revisions from:
1) Leader unit: https://pastebin.ubuntu.com/p/RY7yJbsKGz/
2) Non-leader unit: https://pastebin.ubuntu.com/p/t3t42JsHSJ/

You can see that both return exactly the same values for each of the commands:
* secret-get <ID>
* secret-get <ID> --peek
* secret-get <ID> --refresh
And that is the latest value.

I also did the same test by creating an unit-scoped secret and, from that unit at secret-changed, trying to use each of the commands above.
All of the commands returned the same latest value.

Currently, I cannot see the difference between the commands above.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote (last edit ):

Answering your previous questions. I cannot see how you will provide access to secret revisions N and N-1 in the same hook, without one of the either:
1) to allow users to pick their own revision numbers, then each unit memorizes the latest unit number they saw
2) to have a secretRevision collection entry per secret, per consumer instead of one main entry per secret as today. That will allow different units track different older versions of the secret.

> Are you proposing that the leader runs "secret-get --refresh", then that updates the tracked revision for *all* peer units?

No. Whenever owner does secret-set, then the secret-changed event should be thrown to every unit.

> All peer units track the same revision number, and the leader, via --refresh, can update the tracked revision to the latest for all units?

I don't think this will work. If we want each unit to be able to see "latest-revision" and "last-known-revision" versions at secret-changed, then we must account that "last-known-revision" may differ between units.
So, secret-set by the owner sets a new value and throws secret-changed to all the remaining units. As you said on your next question, each unit recovers that new secret by running: "secret-get <ID> --peek", then each unit can run --refresh, which should only update its own revision track (from this moment onwards, "last-known-revision" == "latest-revision" until next secret update).

Revision history for this message
Pedro Guimarães (pguimaraes) wrote (last edit ):

As it is today, the safest way for me to manage the admin password of databases is:

* To have the leader defining the latest password in an app-scoped secret
* Each unit having a unit-scoped secret with its own currently used admin password

When a new password is set, then the following happens:

1) Leader unit runs the set-password action, generates a new random password and saves it in the app-scoped secret using "secret-set <app-scoped-secret-id>"
2) The app-scoped secret update issues a secret-changed: each unit, including the leader, receives the latest value of the new admin password
3) Each unit loads its own unit-scoped secret, containing the currently working password
4) Each unit unlocks the DB with that password and updates the password of their database
5) Each unit, after confirming the password has been successfully updated, save the new app-scoped password in its own unit-scoped secret

The following secret-remove (app-scoped-secret) is ignored.
The following secret-changed (unit-scoped-secret) should be used to check that unit and app-scoped secrets match (hence, that unit did not lost several updates running before). If they do not match, then run the steps above.
The following secret-remove (unit-scoped-secret) is ignored.

A similar logic would be used in the case of a secret shared across relation to manage the admin password.

Revision history for this message
Ian Booth (wallyworld) wrote :

Given the interesting interactions here, I think it will be easier to discuss f2f in Riga and key stakeholders can all be in the same room. I'll set up a session.

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

To summaries the discussion outcome:

- the secret owner will be treated just like any other consumer - we will remove the read what you just wrote behaviour
- all peer units of an application will individually track their own secret revision
- the guidance is that a new secret revision is created in a particular hook; the new secret revision is not used to update the workload until the secret changed hook arrives
- in the secret changed hook, the usual peek and refresh semantics apply

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi @wallyworld, I am not sure, but maybe your change will also affect: https://bugs.launchpad.net/juju/+bug/2042594

Revision history for this message
Ian Booth (wallyworld) wrote :

Yes, the above should be fixed with the fix for this bug.

Changed in juju:
milestone: none → 3.1.7
assignee: nobody → Ian Booth (wallyworld)
importance: Undecided → High
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
tags: added: canonical-data-platform-eng
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.