Fernet token management should be done outside of juju

Bug #1849519 reported by Liam Young
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Fix Released
High
Alex Kavanagh

Bug Description

Fernet tokens are rotated and distributed to all keystone units by the keystone unit using the leader settings mechanism. If there is an issue with juju then the leader may perform the rotation but fail to distribute the keys. After time this will cause the non-leaders to reject the leaders tokens and vice versa.

I think it makes sense to move the key distribution to a mechanism that does not involve juju.

Tags: seg
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Just adding some thoughts that we generated during our call:

1. The fernet token rotation has to take place on only ONE unit; currently this is the Juju leader, but if juju is excused, then determining which unit should do the rotation might be more complicated. [1]
2. An approach that was discussed is to use rsync over ssh between the 'leader' and the other units.
3. juju is still (obviously) used to configure the jobs that does the fernet key rotation.

[1] - I think we can use juju to 'set' via the config which unit is the leader, and the script just dumbly follows this. If leader changes, then the config will be re-written, etc. i.e. no additional method of determining the leader is needed.

Changed in charm-keystone:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Angelos Kolaitis (aggkolaitis) wrote :

We actually had issues on our production OpenStack (~ 1800 units) because of Fernet keys going out of sync between the keystone leader and the rest of the nodes. (we have 3 clustered keystone units running on lxd containers)

Initially, we observed that a number of messages "Unable to retrieve images/users/projects/etc" were showing on Horizon for all users, and user was being logged out a short while after. Furthermore, the openstack cli worked about 50% of the time.

Upon inspection on the keystone nodes, we found a huge number of logs like this:

```
$ sudo journalctl -f
...
Sep 12 09:53:51 juju-980f92-92-lxd-31 (keystone.common.wsgi)[1619]: 2019-09-12 09:53:51,443 WARNING This is not a recognized Fernet token gAAAAA[REDACTED]brKCLbbEvFAbZnE_44rxzBwBlYwk=
...
Sep 12 10:07:28 juju-980f92-92-lxd-31 (keystone.common.wsgi)[1620]: 2019-09-12 10:07:28,632 WARNING Authorization failed. The request you have made requires authentication. from REDACTED_IP
Sep 12 10:07:28 juju-980f92-92-lxd-31 (keystone.common.wsgi)[1618]: 2019-09-12 10:07:28,871 WARNING Authorization failed. The request you have made requires authentication. from REDACTED_IP
```

As it turned out, the fernet keys between our three keystone hosts had been out of sync for an extended period of time. Manually performing the key synchronization did the trick and restored the service.

However, the issue occured again a few days later, where we had to take more serious action:

The second time, the cronjob existed on a non-leader keystone node. Furthermore, simply running the cronjob did not help this time, as the agent had crashed. We had to manually restart the Juju keystone service on the machine in order for the sync to resume properly.

We have not had any further issues since then, but I believe this issue is important and should be treated as such. The main problem was that there was no indication of the issue, and we had to search through the system logs in order to identify it.

Finally, we have found that under heavy load Juju may have trouble communicating with the agents, and there is no clear-cut way to know this, except manually connecting to the machines and looking at the logs. This means that if the sync happens during heavy Juju load, the leader may fail to properly send the keys to the rest of the nodes.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Setting to high due to report from Aggelos in #2

Changed in charm-keystone:
importance: Wishlist → High
Revision history for this message
Adam Dyess (addyess) wrote :

I encountered an issue where the fernet tokens wouldn't sync because I never saw the leader-settings event encountered.

I manually sync'd the files and immediately keystone function was restored

Revision history for this message
Andrea Ieri (aieri) wrote :

subscribed field-high. The workaround Adam mentioned had already been applied to this cloud, which hints at it not being permanent.

Revision history for this message
Xav Paice (xavpaice) wrote :

Regardless of whether the scripts are managed in a Juju environment or not, the rotate script runs via cron every few mins. This checks the age of keys, and rotates them if needed. Unfortunately, on the cloud Adam and I are looking at, the check doesn't trigger a rotate and so the keys are not rotated, at all, and therefore are not synced.

The workaround mentioned only lasts as long as the keys - so if the keys expire after 24h, the workaround needs to be repeated every 24h to maintain service, otherwise we end up with intermittent Keystone failures.

Revision history for this message
Adam Dyess (addyess) wrote :

Here's a workaround python script installed as a cronjob on the juju host
Run it as the juju user and it will keep the fernet tokens in sync until this bug is fixed

https://paste.ubuntu.com/p/Tfbvrsrw3y/

Changed in charm-keystone:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Adam Dyess (addyess) wrote :

As an updated, the work-around above from 2020-10-28 is holding -- however leaves artifacts on the non-leader keystone units. These artifacts are:

 $ juju run -a keystone ls /etc/keystone/fernet-keys
- Stdout: |
    0
    20
    21
  UnitId: keystone/0
- Stdout: |
    0
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    9
  UnitId: keystone/1
- Stdout: |
    0
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    9
  UnitId: keystone/2

So, i'm still currently checking in on these units. :/ Maybe the work-around can be improved to remove tokens before expanding the tar file into the non-leader units

Felipe Reyes (freyes)
tags: added: seg
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-keystone (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/762645

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I've done a patch (see https://bugs.launchpad.net/charm-keystone/+bug/1849519/comments/9) which makes the key rotation and sync across units more robust, particularly when it comes to recovering from sync problems. e.g. if any of the hooks fail, but units are then recovered, then the units should re-sync within 5 minutes and all have the same set of keys.

I hope that this resolves the field-high designation of this bug, and perhaps it can be reduced to a field-medium?

However, to fully resolve this issue requires complete decoupling of the distribution of the keys from juju to another, secure, distribution system. This is a bigger piece of work, but a candidate is to use vault to store the current set of keys and ensure that every keystone unit (via a cron job) always has these keys synced. I think that this will need to be a roadmap item.

Changed in charm-keystone:
assignee: Alex Kavanagh (ajkavanagh) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-keystone (master)

Reviewed: https://review.opendev.org/762645
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=c7e34558c498497f81923feeff5a64cf96b192d9
Submitter: Zuul
Branch: master

commit c7e34558c498497f81923feeff5a64cf96b192d9
Author: Alex Kavanagh <email address hidden>
Date: Fri Nov 13 11:31:50 2020 +0000

    Make Fernet key distribution more robust

    The related bug indicated that the Fernet keys could get out of sync
    between the leader and non-leader units. This patchset assumes that
    hooks fail, or that units are off-line when the rotation occurs. Thus
    it tries hard to ensure that the keys are in sync. It still uses juju
    to 'send' the keys from the leader to the subordinate units, so in that
    sense, it is not a fix to the related bug, but it does make it more
    robust.

    Change-Id: Id40a3ccbe565bd742e3fdbd5190deb6b21204a82
    Related-Bug: #1849519

Changed in charm-keystone:
status: Triaged → Fix Committed
assignee: nobody → Alex Kavanagh (ajkavanagh)
Changed in charm-keystone:
milestone: none → 21.01
David Ames (thedac)
Changed in charm-keystone:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.