Deployments with high churn are susceptible to false positives with token validation

Bug #1816927 reported by Lance Bragstad
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Fix Released
Low
Pavlo Shchelokovskyy
OpenStack-Ansible
Fix Released
Undecided
Unassigned

Bug Description

The implementation for fernet tokens relies on symmetric encryption. This underpinning requires that each keystone API node "share" the same key repository, specifically in deployments where keystone servers need to validate tokens issued by one another (e.g., a cluster of keystone servers behind an HA proxy).

Without getting into too much detail, each key repository consists of a set of files on disk. The naming of each file is crucial because it denotes the type of key it is (documented extensively [0]). Each file name corresponds to an integer. The file name with the highest index is used to encrypt new tokens, which is called the primary key. The file name with the lowest index, or 0, is known as a staged key and it is always promoted to be the primary key on the next rotation. Every other key in the repository is a secondary key and is only used to decrypt tokens. Each key on disk goes through a lifecycle, starting as a staged key, promoted to a primary key, eventually being demoted to a secondary key. Note that keystone does *not* handle key distribution between API servers. We recommend this be done using configuration management. The documentation suggests rsync as one possible utility to keep key repositories in sync.

I'm opening this bug because it was brought to our attention that keystone servers may respond with a 401 Invalid Fernet token, in deployments with high churn, or high token load, across a cluster of keystone nodes.

The issue is that in the process of key rotation, the staged key is promoted to be the primary key. As soon as this happens, any subsequent requests to create tokens will use the primary key to encrypt the token. It is assumed all other API servers have a copy of this key, because it's the staged key and also valid as a secondard key. A token encrypted with the new primary key should be validatable on other API servers if they have a copy of the staged key, which has the same key contents as the new primary key on the API server that initiated the token rotation. The rsync implementation deletes the contents of the key repository and rebuilds it, alphanumerically. This results in the staged key always being written by rsync first, because its file name is 0. The primary key is always written last, because its filename is the highest index of the key repository.

A unique timing event where:

- a token is created after key rotation, but before key distribution
- key distribution is invoked using a mechanism like rsync
- token validation is performed on the API server getting its key repository built by rsync
- the token is validated before the new primary key is written to the key repository by rsync, and fails validation because the key repository doesn't contain the key used to encrypt the token

A subsequent request to validate the token should succeed if rsync completes successfully.

pas-ha brought this to the #openstack-keystone channel as an issue that was affecting an internal CI/CD deployment that has a lot of churn [1].

[0] https://docs.openstack.org/keystone/latest/admin/fernet-token-faq.html#what-are-the-different-types-of-keys
[1] http://eavesdrop.openstack.org/irclogs/%23openstack-keystone/%23openstack-keystone.2019-02-20.log.html#t2019-02-20T20:11:12

tags: added: fernet
summary: - Deployments with high churn as susceptible to false positives with token
- validation
+ Deployments with high churn are susceptible to false positives with
+ token validation
description: updated
description: updated
description: updated
Revision history for this message
Lance Bragstad (lbragstad) wrote :

Cloudnull and I discussed possible solutions in IRC [0]. There is a patch up, but it likely needs to get verified against the OSA project directly [1].

[0] http://eavesdrop.openstack.org/irclogs/%23openstack-ansible/%23openstack-ansible.2019-02-21.log.html#t2019-02-21T02:03:38
[1] https://review.openstack.org/#/c/638327/

Revision history for this message
Colleen Murphy (krinkle) wrote :

> The rsync implementation deletes the contents of the key repository and rebuilds it

rsync supports an incredibly extensive number of behavioral options, so I don't think it's fair to say rsync in general is to blame for this, but instead that we need to find the right combination of flags to recommend setting so that the primary key isn't deleted too early.

Revision history for this message
Lance Bragstad (lbragstad) wrote :

Right, I should have clarified the rsync command being used in those details [0].

Digging through the rsync man page turns up several solutions. One of which already made its way into review as a comment (using --exclude 0 and --delete-after).

[0] http://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/tree/tasks/keystone_fernet_keys_distribute.yml?id=b40f4b5e1aa2f26384a35424c27cb0a363538ebf#n20

Revision history for this message
Lance Bragstad (lbragstad) wrote :

Verifying this for keystone since won't don't go into much detail about the ordering or keys during a transfer between hosts.

That would be a good addition to our documentation.

Changed in keystone:
status: New → Triaged
importance: Undecided → Low
tags: added: docu
tags: added: documentation
removed: docu
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible-os_keystone (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/639234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible-os_keystone (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/639235

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible-os_keystone (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/639236

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible-os_keystone (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/639237

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible-os_keystone (master)

Reviewed: https://review.openstack.org/638327
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/commit/?id=28a0c5abbf654ff8b625edc0c12af50a3def2429
Submitter: Zuul
Branch: master

commit 28a0c5abbf654ff8b625edc0c12af50a3def2429
Author: Kevin Carter <email address hidden>
Date: Wed Feb 20 21:43:35 2019 -0600

    Correct fernet token sync race condition

    The fernet token rotation is subject to a race condition when using
    aggressive rotation in a high volume, high traffic, high capacity cloud.
    This change addresses the potential race condition by converting our
    fernet token sync method from rsync to scp and by sorting the fernet
    keys in reverse version ordering. This will ensure that the key with
    the highest index is always synchronized first and will ensure that
    the underlying file structure of a given target node always remains
    intact during a sync operation.

    Related-Bug: 1816927
    Change-Id: I9087d953f7dabe04a2ad19af6121dae71544e5b2
    Signed-off-by: Kevin Carter <email address hidden>

Changed in openstack-ansible:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible-os_keystone (stable/queens)

Reviewed: https://review.openstack.org/639235
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/commit/?id=e476286de2b5a3c402b00ad1550e579b7a5c6882
Submitter: Zuul
Branch: stable/queens

commit e476286de2b5a3c402b00ad1550e579b7a5c6882
Author: Kevin Carter <email address hidden>
Date: Wed Feb 20 21:43:35 2019 -0600

    Correct fernet token sync race condition

    The fernet token rotation is subject to a race condition when using
    aggressive rotation in a high volume, high traffic, high capacity cloud.
    This change addresses the potential race condition by converting our
    fernet token sync method from rsync to scp and by sorting the fernet
    keys in reverse version ordering. This will ensure that the key with
    the highest index is always synchronized first and will ensure that
    the underlying file structure of a given target node always remains
    intact during a sync operation.

    Related-Bug: 1816927
    Change-Id: I9087d953f7dabe04a2ad19af6121dae71544e5b2
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit 28a0c5abbf654ff8b625edc0c12af50a3def2429)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible-os_keystone (stable/ocata)

Reviewed: https://review.openstack.org/639237
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/commit/?id=f0b11d7a87ad86f59de973a16e7732ea12700edf
Submitter: Zuul
Branch: stable/ocata

commit f0b11d7a87ad86f59de973a16e7732ea12700edf
Author: Kevin Carter <email address hidden>
Date: Wed Feb 20 21:43:35 2019 -0600

    Correct fernet token sync race condition

    The fernet token rotation is subject to a race condition when using
    aggressive rotation in a high volume, high traffic, high capacity cloud.
    This change addresses the potential race condition by converting our
    fernet token sync method from rsync to scp and by sorting the fernet
    keys in reverse version ordering. This will ensure that the key with
    the highest index is always synchronized first and will ensure that
    the underlying file structure of a given target node always remains
    intact during a sync operation.

    Related-Bug: 1816927
    Change-Id: I9087d953f7dabe04a2ad19af6121dae71544e5b2
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit 28a0c5abbf654ff8b625edc0c12af50a3def2429)

tags: added: in-stable-ocata
tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible-os_keystone (stable/pike)

Reviewed: https://review.openstack.org/639236
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/commit/?id=5d47236c891dafd786552a91520dd0bc95d5d4a2
Submitter: Zuul
Branch: stable/pike

commit 5d47236c891dafd786552a91520dd0bc95d5d4a2
Author: Kevin Carter <email address hidden>
Date: Wed Feb 20 21:43:35 2019 -0600

    Correct fernet token sync race condition

    The fernet token rotation is subject to a race condition when using
    aggressive rotation in a high volume, high traffic, high capacity cloud.
    This change addresses the potential race condition by converting our
    fernet token sync method from rsync to scp and by sorting the fernet
    keys in reverse version ordering. This will ensure that the key with
    the highest index is always synchronized first and will ensure that
    the underlying file structure of a given target node always remains
    intact during a sync operation.

    Related-Bug: 1816927
    Change-Id: I9087d953f7dabe04a2ad19af6121dae71544e5b2
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit 28a0c5abbf654ff8b625edc0c12af50a3def2429)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible-os_keystone (stable/rocky)

Reviewed: https://review.openstack.org/639234
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_keystone/commit/?id=3a3f062cb66a4d7b046d9fd49842cdbd6fdcd738
Submitter: Zuul
Branch: stable/rocky

commit 3a3f062cb66a4d7b046d9fd49842cdbd6fdcd738
Author: Kevin Carter <email address hidden>
Date: Wed Feb 20 21:43:35 2019 -0600

    Correct fernet token sync race condition

    The fernet token rotation is subject to a race condition when using
    aggressive rotation in a high volume, high traffic, high capacity cloud.
    This change addresses the potential race condition by converting our
    fernet token sync method from rsync to scp and by sorting the fernet
    keys in reverse version ordering. This will ensure that the key with
    the highest index is always synchronized first and will ensure that
    the underlying file structure of a given target node always remains
    intact during a sync operation.

    Related-Bug: 1816927
    Change-Id: I9087d953f7dabe04a2ad19af6121dae71544e5b2
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit 28a0c5abbf654ff8b625edc0c12af50a3def2429)

Changed in keystone:
assignee: nobody → Pavlo Shchelokovskyy (pshchelo)
status: Triaged → In Progress
Colleen Murphy (krinkle)
Changed in keystone:
milestone: none → stein-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to keystone (master)

Reviewed: https://review.openstack.org/638397
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=261eeaa19bb4c9e9ea89fac28e473fa44c4a55de
Submitter: Zuul
Branch: master

commit 261eeaa19bb4c9e9ea89fac28e473fa44c4a55de
Author: Pavlo Shchelokovskyy <email address hidden>
Date: Thu Feb 21 13:06:10 2019 +0200

    Add hint for order of keys during distribution

    If the new primary key is not the first to be distributed after fernet
    key rotation, there may be a small time window during the key
    distribution when tokens issued by the node where fernet rotation was
    performed can not be validated on the node where keys are being
    distributed to.

    Change-Id: I34b5cadd12815ee95c71d8c163504390a9e5e343
    Closes-Bug: #1816927

Changed in keystone:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/keystone 15.0.0.0rc1

This issue was fixed in the openstack/keystone 15.0.0.0rc1 release candidate.

Changed in openstack-ansible:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.