fernet keys need to be kept active until allow_expired_window, not token_expiration to allow long running jobs and service tokens to work correctly

Bug #1987466 reported by Trent Lloyd
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Triaged
High
Unassigned

Bug Description

Fernet tokens need to be decryptable until the end of "allow_expired_window" and not "token_expiration"

The reason is for validation of long-running jobs. Long running jobs such as a cinder re-type which live migrates a storage volume from one storage location to another (e.g. from a HDD ceph pool, to a Flash ceph pool) can take many hours or in a recent case over a week to complete.

When the migration complete's the users original token is used to complete the API calls to "finalise" the migration and persist it to the database as described in Bug #1986886

For this to work, we need to be able to validate the token even if it has met the 1 hour default token_expiration window. OpenStack allows for this by having the service send it's own "Service Token" which is currently valid and non-expired along with the users valid, but expired, token. As long as the token is within the "allow_expired_window" window (default: 2 days) *and* has a valid non-expired service token then the operation is still validated.

When tokens were persisted in the database this worked fine, however with Fernet the keys are encrypted with a key that is rotated and deleted on a rotational basis. Currently the charm, as per the original spec (https://specs.openstack.org/openstack/charm-specs/specs/rocky/implemented/keystone-fernet-tokens.html) automatically calculates how long those keys are kept based on the token_expiration window.

This means that after 3 hours the tokens cannot be decrypted, thus cannot be validated, even if they are inside the token_expiration window.

We need to change this calculation to depend on allow_expired_window instead, and, add allow_expired_window as a configurable item as even the default 2 days would have left this 7-day storage migration of a 10TB volume failing. It's a particularly bad failure case as the database is left pointing at the old volume which is not deleted so if the VM is stopped and then started again the data will all roll back days, weeks or even months to when the migration finished but failed. As per Bug #1986886

Changed in charm-keystone:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.