Keystone is restarting due to stale primary key

Bug #1895723 reported by Tomas Stodulka
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Invalid
Undecided
Unassigned
kolla-ansible
Fix Released
High
Mark Goddard
Train
Fix Released
High
Radosław Piliszek
Ussuri
Fix Released
High
Radosław Piliszek
Victoria
Fix Released
High
Mark Goddard

Bug Description

After restart of keystone's container, it keeps restarting. I found only this error in docker logs keystone:
Running command: '/usr/bin/keystone-startup.sh -DFOREGROUND'
+ exec /usr/bin/keystone-startup.sh -DFOREGROUND
+ set -o errexit
+ set -o pipefail
+ TOKEN_DIR=/etc/keystone/fernet-keys
+ n=0
+ '[' '!' -f /etc/keystone/fernet-keys/0 ']'
++ ls -1 /etc/keystone/fernet-keys
++ sort -hr
++ head -n 1
+ TOKEN_PRIMARY=5
++ date +%s
++ date +%s -r /etc/keystone/fernet-keys/5
+ TOKEN_AGE=589164
+ '[' 589164 -gt 86400 ']'
+ echo 'ERROR: Primary token 5 is stale.'
+ exit 1

Workaround is change expiration from 86400 to 864000 in /etc/kolla/keystone/keystone-startup.sh:

# Compare if it's older than fernet_token_expiry and run key rotation if needed
if [ "${TOKEN_AGE}" -gt "864000" ]; then
    echo "ERROR: Primary token ${TOKEN_PRIMARY} is stale."
    exit 1
fi

Regarding the comment in code, It should also run rotation of primary key. But this part is missing, it only throws an exception as mentioned. Or I would like to ask, why the primary key wasn't rotated automatically when it was needed.

I am using 2 weeks old deployment of Ussuri, deployd by kolla-ansible on CentOS8.

Revision history for this message
Tomas Stodulka (xstodu07) wrote :
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I believe it was fixed already but let Michał see.

Changed in keystone:
status: New → Invalid
Changed in kolla-ansible:
assignee: nobody → Michal Nasiadka (mnasiadka)
importance: Undecided → High
Revision history for this message
Tomas Stodulka (xstodu07) wrote :

Ok, thank you. :)

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Well, I would first like to understand, how can we end up with stale token with crontab set for every third day with default retention period.

Tomas - can you share /etc/kolla/keystone-fernet/crontab?

Was it only restart of keystone, or did you shut it down for like a week?

Revision history for this message
Tomas Stodulka (xstodu07) wrote :

Sure, here it is:
# cat /etc/kolla/keystone-fernet/crontab
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin

0 0 * * 0 /usr/bin/fernet-rotate.sh
0 0 * * 3 /usr/bin/fernet-rotate.sh

It was only restart of keystone container, when I wanted to apply changes in ../kolla/keystone/keystone.conf (corn.allowed_origin), but it had nothing to do with the problem, I also revert the change back.

Revision history for this message
Tomas Stodulka (xstodu07) wrote :

And in Kibana I havn't had any errors or warnings, regarding this issue for last 2 weeks (only some warnings due to gnocchi on controller).

Revision history for this message
Tomas Stodulka (xstodu07) wrote :

Now I found strange thing. If crontab executing fernet-rotate.sh, script content is follow:
#!/bin/bash
set -o errexit
set -o pipefail keystone-manage --config-file /etc/keystone/keystone.conf fernet_rotate --keystone-user keystone --keystone-group keystone /usr/bin/fernet-push.sh

So it also execute fernet-push.sh, but the script is doing nothing, only with this:
#!/bin/bash
set -o errexit
set -o pipefail

Shouldn't it sends newly generated token somehow?

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Unless it's not a single node installation - it should have rsync commands to other hosts: https://github.com/openstack/kolla-ansible/blob/stable/ussuri/ansible/roles/keystone/templates/fernet-push.sh.j2

Revision history for this message
Tomas Stodulka (xstodu07) wrote :

yeah, I am using multinode of 1 compute, 1 controller. So I should have it. I run the command under ansible in site.yml easily as:

  hosts: localhost
  gather_facts: true
  tasks:
    - name: test
      vars:
        host: "{{item}}"
      debug:
        msg: "/usr/bin/rsync -az -e 'ssh -i /var/lib/keystone/.ssh/id_rsa -p {{ hostvars[host]['keystone_ssh_port'] }} -F /var/lib/keystone/.ssh/config' --delete /etc/keystone/fernet-keys/ keystone@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:/etc/keystone/fernet-keys"
      with_items: "{{ groups['keystone'] }}"

and I got this string:

/usr/bin/rsync -az -e 'ssh -i /var/lib/keystone/.ssh/id_rsa -p 8023 -F /var/lib/keystone/.ssh/config' --delete /etc/keystone/fernet-keys/ keystone@10.0.200.11:/etc/keystone/fernet-keys

Do you think that this is the root cause? I paste it in fernet-push.sh and restart the container.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

If you have one controller, then it really does not need to synchronize with anyone else, hence no code for that part.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

So, in summary, it failed to keep fernet keys current for itself. Well, that's interesting. I wonder if we just have it working only for multi-controllers scenarios due to sync happening there, i.e. A gets them current because it produces them, sends to B and B sends to A.

Revision history for this message
Tomas Stodulka (xstodu07) wrote :

#10 > I thought that. :)

#11 > It could be as you are saying. I will be in lab next week, and I could try some deeper investigation in my deployment.

Changed in kolla-ansible:
status: New → In Progress
Revision history for this message
joel waddell (maasmaaster) wrote :

i'm also running into this issue. I'm running a 3 controller 3 compute cluster, when keystone is restarted it gets stuck in a restart loop with the same error found in the docker logs above. The only way I get get keystone to start up is the change /etc/kolla/keystone/keystone-startup.sh token age to 8640000 (yes, 100 days because my deployment is older than Tomas' deployment). I'm running most recent version of master, so I picked up the keystone changes from about 28 days ago. I manually ran the farnet-rotate.sh script the cronjob runs and it seemed to complete ok. Each control node keystone and keystone_fernet container has 3 keys in /etc/keystone/farnet-keys folder so it appears the rsync is running ok.

```
(keystone-fernet)[root@ddc-control-3 /]# ls -ahl /etc/keystone/fernet-keys/
total 20K
drwxrwx--- 2 keystone keystone 4.0K Sep 25 12:32 .
drwxr-xr-x 1 keystone keystone 4.0K Sep 25 13:08 ..
-rw------- 1 keystone keystone 44 Sep 25 12:32 0
-rw------- 1 keystone keystone 44 Aug 27 14:23 1
-rw------- 1 keystone keystone 44 Aug 27 14:23 2

```

it just seems like keystone is using one of the older keys even know a newer one is present.

on a side note, when I pulled the updated git repo (from 350 to 380) I pip installed ./kolla and ./kolla-ansible then I ran kolla-ansible deploy to see if the keystone changes in the newer master 380 would fix the issue. has not for me. Not sure if I should be running deploy or upgrade so that could be part of the issue of why keystone changes did not seem to fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/754085
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d78673e77fc376274192a8f7b547ce90d6bdc151
Submitter: Zuul
Branch: master

commit d78673e77fc376274192a8f7b547ce90d6bdc151
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

    Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

    Closes-Bug: #1895723

    Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/754392

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/754393

Revision history for this message
joel waddell (maasmaaster) wrote :

update, saw that 5 mins after my post that there was an update to the keystone-startup.sh. I pulled and picked up the change and ran deploy again. Still having issue where keystone container is restarting with same error message from logs. I had to add an extra 0 at the end of 249200

```
if [ "${TOKEN_AGE}" -gt "2592000" ]; then
    echo "ERROR: Primary token ${TOKEN_PRIMARY} is stale."
    exit 1
fi
```

Revision history for this message
joel waddell (maasmaaster) wrote :

I noticed that on 2 of my control nodes (nodes 2 and 3)

```
cat /var/spool/cron/crontabs/root/fernet-cron
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin
```
shows nothing for a job but on control node 1

```
cat /var/spool/cron/crontabs/root/fernet-cron
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin

0 0 * * 3 /usr/bin/fernet-rotate.sh
```

Could this symptom the be the cause to outdated keys?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/754393
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ecd899c9a856de153a892d513c4406bcb73b57b1
Submitter: Zuul
Branch: stable/train

commit ecd899c9a856de153a892d513c4406bcb73b57b1
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

    Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

    Closes-Bug: #1895723

    Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4
    (cherry picked from commit d78673e77fc376274192a8f7b547ce90d6bdc151)

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

@Joel: yes - generally with 3 controllers - and default token expiry settings - you should have the crontab non-empty on 2 of them (with different day to rotate them).

Revision history for this message
joel waddell (maasmaaster) wrote :

Ok, I think right now only one of my nodes has the cronjob on it, I manually rotated the fernet running the script that the cron job runs and that fixed the issue for me because it caused a newer token to become primary. I'm not sure why only 1/3 nodes has a non-empty cron but that might be what caused my keys to become too dated.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/754392
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=7e3d3d8bc71b55701f5e1929573d2fda0835c3f9
Submitter: Zuul
Branch: stable/ussuri

commit 7e3d3d8bc71b55701f5e1929573d2fda0835c3f9
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

    Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

    Closes-Bug: #1895723

    Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4
    (cherry picked from commit d78673e77fc376274192a8f7b547ce90d6bdc151)

Revision history for this message
joel waddell (maasmaaster) wrote :

So I left this over the weekend and decided to restart keystone on one of my control nodes. Keystone got stuck in a restart loop because of primary fernet key is outdated. I found out that the keystone_fernet container did you fire off it's cronjob to rotate the keys because all of the keys in the keystone containers are the same keys as friday. Running the script that is in the cron job works just fine but it seems the actual cronjob might be the issue. This bug is kinda similar to this bug https://bugs.launchpad.net/kolla-ansible/+bug/1850711 also.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759210

Changed in kolla-ansible:
assignee: Michal Nasiadka (mnasiadka) → Mark Goddard (mgoddard)
status: Triaged → In Progress
Revision history for this message
Joris Vleminckx (jvleminc) wrote :

I ran Kolla-Ansible Train, updating an older Train Openstack and the keystone container (which already has the fix mentioned here (https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ecd899c9a856de153a892d513c4406bcb73b57b1)) keeps on restarting.
A brand new installation ends without issues.

--
+ TOKEN_PRIMARY=1
++ date +%s
++ date +%s -r /etc/keystone/fernet-keys/1
+ TOKEN_AGE=9495029
+ '[' 9495029 -gt 259200 ']'
+ echo 'ERROR: Primary token 1 is stale.'
+ exit 1
ERROR: Primary token 1 is stale.
--

Revision history for this message
Joris Vleminckx (jvleminc) wrote :

With the proposed fix (of #24) the installation ends well in all cases. :-)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/759210
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ba8c27f554682e1f2720fad1bff5cfa1b35017f2
Submitter: Zuul
Branch: master

commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

    Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

    We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

    This change goes for the more simple option 1. It also cleans up some
    terminology in the keystone-startup.sh script.

    Closes-Bug: #1895723

    Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/759560

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/759561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/759560
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8e56a1a7975d7030241cfb8476ae54c9550498a0
Submitter: Zuul
Branch: stable/ussuri

commit 8e56a1a7975d7030241cfb8476ae54c9550498a0
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

    Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

    We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

    This change goes for the more simple option 1. It also cleans up some
    terminology in the keystone-startup.sh script.

    Closes-Bug: #1895723

    Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8
    (cherry picked from commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/759561
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=5ede5d9db30e8aa1748b9a6e71bb452aa16833a8
Submitter: Zuul
Branch: stable/train

commit 5ede5d9db30e8aa1748b9a6e71bb452aa16833a8
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

    Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

    We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

    This change goes for the more simple option 1. It also cleans up some
    terminology in the keystone-startup.sh script.

    Closes-Bug: #1895723

    Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8
    (cherry picked from commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 10.2.0

This issue was fixed in the openstack/kolla-ansible 10.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.3.0

This issue was fixed in the openstack/kolla-ansible 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.