Bug #1895723 “Keystone is restarting due to stale primary key” : Bugs : OpenStack Identity (keystone)

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-15:

#1

keystone.log Edit (234.3 KiB, text/plain)

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-09-16:

#2

I believe it was fixed already but let Michał see.

Changed in keystone:
status:	New → Invalid
Changed in kolla-ansible:
assignee:	nobody → Michal Nasiadka (mnasiadka)
importance:	Undecided → High

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-16:

#3

Ok, thank you. :)

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2020-09-16:

#4

Well, I would first like to understand, how can we end up with stale token with crontab set for every third day with default retention period.

Tomas - can you share /etc/kolla/keystone-fernet/crontab?

Was it only restart of keystone, or did you shut it down for like a week?

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-16:

#5

Sure, here it is:
# cat /etc/kolla/keystone-fernet/crontab
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin

0 0 * * 0 /usr/bin/fernet-rotate.sh
0 0 * * 3 /usr/bin/fernet-rotate.sh

It was only restart of keystone container, when I wanted to apply changes in ../kolla/keystone/keystone.conf (corn.allowed_origin), but it had nothing to do with the problem, I also revert the change back.

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-16:

#6

And in Kibana I havn't had any errors or warnings, regarding this issue for last 2 weeks (only some warnings due to gnocchi on controller).

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-16:

#7

Now I found strange thing. If crontab executing fernet-rotate.sh, script content is follow:
#!/bin/bash
set -o errexit
set -o pipefail keystone-manage --config-file /etc/keystone/keystone.conf fernet_rotate --keystone-user keystone --keystone-group keystone /usr/bin/fernet-push.sh

So it also execute fernet-push.sh, but the script is doing nothing, only with this:
#!/bin/bash
set -o errexit
set -o pipefail

Shouldn't it sends newly generated token somehow?

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2020-09-16:

#8

Unless it's not a single node installation - it should have rsync commands to other hosts: https://github.com/openstack/kolla-ansible/blob/stable/ussuri/ansible/roles/keystone/templates/fernet-push.sh.j2

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-16:

#9

yeah, I am using multinode of 1 compute, 1 controller. So I should have it. I run the command under ansible in site.yml easily as:

  hosts: localhost
  gather_facts: true
  tasks:
    - name: test
      vars:
        host: "{{item}}"
      debug:
        msg: "/usr/bin/rsync -az -e 'ssh -i /var/lib/keystone/.ssh/id_rsa -p {{ hostvars[host]['keystone_ssh_port'] }} -F /var/lib/keystone/.ssh/config' --delete /etc/keystone/fernet-keys/ keystone@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:/etc/keystone/fernet-keys"
      with_items: "{{ groups['keystone'] }}"

and I got this string:

/usr/bin/rsync -az -e 'ssh -i /var/lib/keystone/.ssh/id_rsa -p 8023 -F /var/lib/keystone/.ssh/config' --delete /etc/keystone/fernet-keys/ keystone@10.0.200.11:/etc/keystone/fernet-keys

Do you think that this is the root cause? I paste it in fernet-push.sh and restart the container.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-09-17:

#10

If you have one controller, then it really does not need to synchronize with anyone else, hence no code for that part.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-09-17:

#11

So, in summary, it failed to keep fernet keys current for itself. Well, that's interesting. I wonder if we just have it working only for multi-controllers scenarios due to sync happening there, i.e. A gets them current because it produces them, sends to B and B sends to A.

Revision history for this message

Tomas Stodulka (xstodu07) wrote on 2020-09-17:

#12

#10 > I thought that. :)

#11 > It could be as you are saying. I will be in lab next week, and I could try some deeper investigation in my deployment.

OpenStack Infra (hudson-openstack) on 2020-09-24

Changed in kolla-ansible:
status:	New → In Progress

Revision history for this message

joel waddell (maasmaaster) wrote on 2020-09-25:

#13

i'm also running into this issue. I'm running a 3 controller 3 compute cluster, when keystone is restarted it gets stuck in a restart loop with the same error found in the docker logs above. The only way I get get keystone to start up is the change /etc/kolla/keystone/keystone-startup.sh token age to 8640000 (yes, 100 days because my deployment is older than Tomas' deployment). I'm running most recent version of master, so I picked up the keystone changes from about 28 days ago. I manually ran the farnet-rotate.sh script the cronjob runs and it seemed to complete ok. Each control node keystone and keystone_fernet container has 3 keys in /etc/keystone/farnet-keys folder so it appears the rsync is running ok.

```
(keystone-fernet)[root@ddc-control-3 /]# ls -ahl /etc/keystone/fernet-keys/
total 20K
drwxrwx--- 2 keystone keystone 4.0K Sep 25 12:32 .
drwxr-xr-x 1 keystone keystone 4.0K Sep 25 13:08 ..
-rw------- 1 keystone keystone 44 Sep 25 12:32 0
-rw------- 1 keystone keystone 44 Aug 27 14:23 1
-rw------- 1 keystone keystone 44 Aug 27 14:23 2

```

it just seems like keystone is using one of the older keys even know a newer one is present.

on a side note, when I pulled the updated git repo (from 350 to 380) I pip installed ./kolla and ./kolla-ansible then I ran kolla-ansible deploy to see if the keystone changes in the newer master 380 would fix the issue. has not for me. Not sure if I should be running deploy or upgrade so that could be part of the issue of why keystone changes did not seem to fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Fix merged to kolla-ansible (master)

#14

Reviewed: https://review.opendev.org/754085
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d78673e77fc376274192a8f7b547ce90d6bdc151
Submitter: Zuul
Branch: master

commit d78673e77fc376274192a8f7b547ce90d6bdc151
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

Closes-Bug: #1895723

Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Fix proposed to kolla-ansible (stable/ussuri)

#15

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/754392

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Fix proposed to kolla-ansible (stable/train)

#16

Fix proposed to branch: stable/train
Review: https://review.opendev.org/754393

Revision history for this message

joel waddell (maasmaaster) wrote on 2020-09-25:

#17

update, saw that 5 mins after my post that there was an update to the keystone-startup.sh. I pulled and picked up the change and ran deploy again. Still having issue where keystone container is restarting with same error message from logs. I had to add an extra 0 at the end of 249200

```
if [ "${TOKEN_AGE}" -gt "2592000" ]; then
echo "ERROR: Primary token ${TOKEN_PRIMARY} is stale."
exit 1
fi
```

Revision history for this message

joel waddell (maasmaaster) wrote on 2020-09-25:

#18

I noticed that on 2 of my control nodes (nodes 2 and 3)

```
cat /var/spool/cron/crontabs/root/fernet-cron
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin
```
shows nothing for a job but on control node 1

```
cat /var/spool/cron/crontabs/root/fernet-cron
PATH=/var/lib/kolla/venv/bin:/usr/bin:/bin

0 0 * * 3 /usr/bin/fernet-rotate.sh
```

Could this symptom the be the cause to outdated keys?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Fix merged to kolla-ansible (stable/train)

#19

Reviewed: https://review.opendev.org/754393
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ecd899c9a856de153a892d513c4406bcb73b57b1
Submitter: Zuul
Branch: stable/train

commit ecd899c9a856de153a892d513c4406bcb73b57b1
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

Closes-Bug: #1895723

Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4
(cherry picked from commit d78673e77fc376274192a8f7b547ce90d6bdc151)

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2020-09-25:

#20

@Joel: yes - generally with 3 controllers - and default token expiry settings - you should have the crontab non-empty on 2 of them (with different day to rotate them).

Revision history for this message

joel waddell (maasmaaster) wrote on 2020-09-25:

#21

Ok, I think right now only one of my nodes has the cronjob on it, I manually rotated the fernet running the script that the cron job runs and that fixed the issue for me because it caused a newer token to become primary. I'm not sure why only 1/3 nodes has a non-empty cron but that might be what caused my keys to become too dated.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix merged to kolla-ansible (stable/ussuri)

#22

Reviewed: https://review.opendev.org/754392
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=7e3d3d8bc71b55701f5e1929573d2fda0835c3f9
Submitter: Zuul
Branch: stable/ussuri

commit 7e3d3d8bc71b55701f5e1929573d2fda0835c3f9
Author: Michal Nasiadka <email address hidden>
Date: Thu Sep 24 15:16:55 2020 +0200

Fix keystone-startup.sh

    keystone-startup.sh is using fernet_token_expiry instead of
    fernet_key_rotation_interval - which effects in restart loop of keystone
    containers - when restarted after 2-3 days.

Closes-Bug: #1895723

Change-Id: Ifff77af3d25d9dc659fff34f2ae3c6f2670df0f4
(cherry picked from commit d78673e77fc376274192a8f7b547ce90d6bdc151)

Revision history for this message

joel waddell (maasmaaster) wrote on 2020-09-28:

#23

So I left this over the weekend and decided to restart keystone on one of my control nodes. Keystone got stuck in a restart loop because of primary fernet key is outdated. I found out that the keystone_fernet container did you fire off it's cronjob to rotate the keys because all of the keys in the keystone containers are the same keys as friday. Running the script that is in the cron job works just fine but it seems the actual cronjob might be the issue. This bug is kinda similar to this bug https://bugs.launchpad.net/kolla-ansible/+bug/1850711 also.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-22: Fix proposed to kolla-ansible (master)

#24

Fix proposed to branch: master
Review: https://review.opendev.org/759210

Changed in kolla-ansible:
assignee:	Michal Nasiadka (mnasiadka) → Mark Goddard (mgoddard)
status:	Triaged → In Progress

Revision history for this message

Joris Vleminckx (jvleminc) wrote on 2020-10-22:

#25

I ran Kolla-Ansible Train, updating an older Train Openstack and the keystone container (which already has the fix mentioned here (https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ecd899c9a856de153a892d513c4406bcb73b57b1)) keeps on restarting.
A brand new installation ends without issues.

--
+ TOKEN_PRIMARY=1
++ date +%s
++ date +%s -r /etc/keystone/fernet-keys/1
+ TOKEN_AGE=9495029
+ '[' 9495029 -gt 259200 ']'
+ echo 'ERROR: Primary token 1 is stale.'
+ exit 1
ERROR: Primary token 1 is stale.
--

Revision history for this message

Joris Vleminckx (jvleminc) wrote on 2020-10-23:

#26

With the proposed fix (of #24) the installation ends well in all cases. :-)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-23: Fix merged to kolla-ansible (master)

#27

Reviewed: https://review.opendev.org/759210
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=ba8c27f554682e1f2720fad1bff5cfa1b35017f2
Submitter: Zuul
Branch: master

commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

This change goes for the more simple option 1. It also cleans up some
terminology in the keystone-startup.sh script.

Closes-Bug: #1895723

Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-24: Fix proposed to kolla-ansible (stable/ussuri)

#28

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/759560

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-24: Fix proposed to kolla-ansible (stable/train)

#29

Fix proposed to branch: stable/train
Review: https://review.opendev.org/759561

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-24: Fix merged to kolla-ansible (stable/ussuri)

#30

Reviewed: https://review.opendev.org/759560
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8e56a1a7975d7030241cfb8476ae54c9550498a0
Submitter: Zuul
Branch: stable/ussuri

commit 8e56a1a7975d7030241cfb8476ae54c9550498a0
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

This change goes for the more simple option 1. It also cleans up some
terminology in the keystone-startup.sh script.

Closes-Bug: #1895723

Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8
(cherry picked from commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-24: Fix merged to kolla-ansible (stable/train)

#31

Reviewed: https://review.opendev.org/759561
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=5ede5d9db30e8aa1748b9a6e71bb452aa16833a8
Submitter: Zuul
Branch: stable/train

commit 5ede5d9db30e8aa1748b9a6e71bb452aa16833a8
Author: Mark Goddard <email address hidden>
Date: Thu Oct 22 09:18:32 2020 +0100

Fix keystone-startup.sh - remove Fernet key age check

    Currently we check the age of the primary Fernet key on Keystone
    startup, and fail if it is older than the rotation interval. While this
    may seem sensible, there are various reasons why the key may be older
    than this:

    * if the rotation interval is not a factor of the number of seconds in a
      week, the rotation schedule will be lumpy, with the last rotation
      being up to twice the nominal rotation interval
    * if a keystone host is unavailable at its scheduled rotation time,
      rotation will not happen. This may happen multiple times

We could do several things to avoid this issue:

    1. remove the check on the age of the key
    2. multiply the rotation interval by some factor to determine the
       allowed key age

This change goes for the more simple option 1. It also cleans up some
terminology in the keystone-startup.sh script.

Closes-Bug: #1895723

Change-Id: I2c35f59ae9449cb1646e402e0a9f28ad61f918a8
(cherry picked from commit ba8c27f554682e1f2720fad1bff5cfa1b35017f2)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-07: Fix included in openstack/kolla-ansible 10.2.0

#32

This issue was fixed in the openstack/kolla-ansible 10.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-07: Fix included in openstack/kolla-ansible 9.3.0

#34

This issue was fixed in the openstack/kolla-ansible 9.3.0 release.

OpenStack Identity (keystone)

Keystone is restarting due to stale primary key

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Identity (keystone)	Invalid	Undecided	Unassigned
kolla-ansible	Fix Released	High	Mark Goddard	kolla-ansible 11.0.0 "victoria"
Train	Fix Released	High	Radosław Piliszek	kolla-ansible 9.3.0 "Train"
Ussuri	Fix Released	High	Radosław Piliszek	kolla-ansible 10.2.0 "ussuri"
Victoria	Fix Released	High	Mark Goddard	kolla-ansible 11.0.0 "victoria"