Mistral fails to maintain a keystone session while deploying an overcloud

Bug #1761050 reported by Emilien Macchi
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mistral
In Progress
High
Unassigned
tripleo
Incomplete
High
Unassigned

Bug Description

Trying to deploy a containerized overcloud from a containerized undercloud in OVB environment, the overcloud gets deployed but Mistral Executor fails when Zaqar is trying to post the message on the queue:

https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/var/log/containers/mistral/executor.log.txt.gz#_2018-04-04_01_56_12_631

https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/var/log/containers/zaqar/zaqar.log.txt.gz#_2018-04-04_01_56_10_669

https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/var/log/containers/keystone/keystone.log.txt.gz#_2018-04-04_01_56_10_548

Maybe irrelevant but this messages popup in our logs:
Loaded 2 Fernet keys from /etc/keystone/fernet-keys, but `[fernet_tokens] max_active_keys = 5`; perhaps there have not been enough key rotations to reach `max_active_keys` yet?

Therefore, the overcloud fails to finish the deployment.
I've been working on aligning the mistral/zaqar configurations:
https://trello.com/c/GMYssQ9b/44-align-openstack-configs-with-tht-and-instack-undercloud
But it didn't help; so now wondering about key rotations etc.
Note that we haven't hit this bug in multinode jobs; maybe because job is faster than OVB? Do we have some sort of expiration?

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

I don't think it's a token provider configuration issue, since we only have 2 keys by default. And the deployment seems to start and go forward for quite a while: https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/home/jenkins/overcloud_deploy.log.txt.gz

from there I can see it goes up to step 5, and in the end it fails with this exception: No JSON object could be decoded

which I guess comes from mistral client.

At some point in the zaqar logs I can see that it fails with authorization failed:
https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/var/log/containers/zaqar/zaqar.log.txt.gz#_2018-04-04_01_56_10_669

which gets reflected in the mistral executor logs here https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/var/log/containers/mistral/executor.log.txt.gz#_2018-04-04_01_56_12_627

which is what Emilien reported.

I think that the issue is that mistral (server) is not refreshing the token that zaqar is using. The token works for a while and expires after an hour (which is what we configure). The theory has backup data because of the log timings:

The deploy starts at 0:55
https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2018-04-04_00_55_54

And we see the error at 1:55
https://logs.rdoproject.org/56/542556/100/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/Z64d11a27268e46db803351bb52f7cc25/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2018-04-04_01_55_56

So ultimately it seems to me that it's an issue on how mistral creates the client (in a way that doesn't refresh the keystone tokens). This should have been handled already though, and it does seem to me that mistral is using sessions correctly (as far as I can tell). Are we using an old mistral container?

This would have usually gotten handled by the session object from keystoneauth1, which I thought was being used in zaqar

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Going through the zaqarclient codebase, it seems to me that they don't use keystone sessions to do http requests, instead, they use the session to get the token explicitly and further in their "transport drivers" they build another object to do the requests. in the case of the http driver, they create another requests session and put the headers explicitly there. This is where they do authentication:

https://github.com/openstack/python-zaqarclient/blob/master/zaqarclient/auth/keystone.py#L175

And they seem to use a deprecated method (get_token), however, that code does seem to trigger reauthentication (deep into the rabbithole). So... I'm not sure what the issue is. But I think it's either in zaqarclient or mistral.

summary: - Failures to get tokens when undercloud is containerized
+ Mistral or Zaqar fail to maintain a keystone session while deploying an
+ overcloud
tags: added: tech-debt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/558922

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/558922
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=02cacfd53a9cf168d7e1967faef3211333800eb6
Submitter: Zuul
Branch: master

commit 02cacfd53a9cf168d7e1967faef3211333800eb6
Author: Emilien Macchi <email address hidden>
Date: Wed Apr 4 13:42:41 2018 -0700

    undercloud: increase token expiration time

    We did it in the past (3 years ago!) in instack-undercloud:
    https://github.com/openstack/instack-undercloud/commit/43e792c6844d4a7081b718d7f89b0c40f5cfb708
    in the context of: https://bugzilla.redhat.com/show_bug.cgi?id=1235908

    This time, we have the same problem when the undercloud is
    containeirized.
    This patch is actually setting parity with keystone config from
    instack-undercloud, but also raising an actual issue that will be
    addressed this cycle.

    In the meantime, let's increase the token expiration so we can move
    forward with testing the containerized undercloud.

    Change-Id: Iceaaf53fae44b5bcda9f6517f163939ba6be3d49
    Related-Bug: #1761050

Dougal Matthews (d0ugal)
Changed in mistral:
importance: Undecided → High
status: New → Incomplete
status: Incomplete → New
Dougal Matthews (d0ugal)
tags: added: workflows
Dougal Matthews (d0ugal)
Changed in mistral:
status: New → Triaged
milestone: none → rocky-1
Revision history for this message
Dougal Matthews (d0ugal) wrote : Re: Mistral or Zaqar fail to maintain a keystone session while deploying an overcloud

therve told me about some testing related to this bug, it seems you can replicate the same issue with the heat actions. So this is likely a bug specific to Mistral.

tags: added: tripleo
Revision history for this message
Dougal Matthews (d0ugal) wrote :
Changed in tripleo:
milestone: rocky-1 → rocky-3
Thomas Herve (therve)
no longer affects: zaqar
summary: - Mistral or Zaqar fail to maintain a keystone session while deploying an
+ Mistral fails to maintain a keystone session while deploying an
overcloud
Revision history for this message
Dougal Matthews (d0ugal) wrote :

Added the security-hardening tag as we now require a 4 hour Keystone token that has security disadvantages.

tags: added: security-hardening
Brad P. Crochet (brad-9)
Changed in mistral:
assignee: nobody → Brad P. Crochet (brad-9)
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (master)

Fix proposed to branch: master
Review: https://review.openstack.org/572448

Changed in mistral:
status: Triaged → In Progress
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-2 → rocky-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on mistral (master)

Change abandoned by Brad P. Crochet (<email address hidden>) on branch: master
Review: https://review.openstack.org/572448
Reason: Will see if Change-Id: Ia2a19a3fcd8808475a16d4d439e085e62a00dfdc works better

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
assignee: nobody → Brad P. Crochet (brad-9)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Dougal Matthews (<email address hidden>) on branch: master
Review: https://review.openstack.org/462056
Reason: Abandoning in favor of...
https://review.openstack.org/#/c/572448/
https://review.openstack.org/#/c/585904/

Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-3 → rocky-rc1
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-rc1 → rocky-rc2
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-rc2 → stein-1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/585904
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=0f17aecbe9a45744341fa4ca1a498de4f57ed531
Submitter: Zuul
Branch: master

commit 0f17aecbe9a45744341fa4ca1a498de4f57ed531
Author: Brad P. Crochet <email address hidden>
Date: Wed Jul 25 18:40:43 2018 -0400

    Use keystone group for loading auth params

    This really needs to come from mistral-extra.

    Change-Id: I8c1ff35df46347c2f247f74720942f9884908449
    Partial-Bug: #1595084
    Partial-Bug: #1761050

Changed in tripleo:
milestone: stein-1 → stein-2
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: stein-1 → stein-2
Brad P. Crochet (brad-9)
Changed in tripleo:
assignee: Brad P. Crochet (brad-9) → nobody
Changed in mistral:
assignee: Brad P. Crochet (brad-9) → nobody
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in mistral:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in mistral:
milestone: stein-3 → train-1
Changed in tripleo:
milestone: stein-rc1 → train-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on mistral (master)

Change abandoned by Brad P. Crochet (<email address hidden>) on branch: master
Review: https://review.opendev.org/572448

Changed in tripleo:
milestone: train-1 → train-2
Revision history for this message
kobig (kobi.ginon) wrote :

hi @d0ugal , @alex-schultz
i wonder why this change is abandoned ? (unless i misunderstood)
i m having thee same error's and the same scenario with containerized undercloud
with Rocky version
The scenario mainly reproduces on large deployment (more then 30 Blades)
can you suggest a solution ?

Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in mistral:
milestone: train-1 → ussuri-1
Changed in mistral:
milestone: ussuri-1 → ussuri-2
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Hi, we're looking for volunteers to take this bug. Ideally, from the TripleO team. Anybody is interested? The Mistral core team won't likely be able to take it in Ussuri 2.

Changed in mistral:
milestone: ussuri-2 → ussuri-3
milestone: ussuri-3 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
Changed in mistral:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
Changed in mistral:
milestone: ussuri-3 → ussuri-rc1
Changed in mistral:
milestone: ussuri-rc1 → ussuri-rc2
Changed in mistral:
milestone: ussuri-rc2 → none
milestone: none → victoria-1
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in mistral:
milestone: victoria-1 → wallaby-1
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
Marios Andreou (marios-b) wrote :

This is an automated action. Bug status has been set to 'Incomplete' and target milestone has been removed due to inactivity. If you disagree please re-set these values and reach out to us on freenode #tripleo

Changed in tripleo:
milestone: wallaby-3 → none
status: In Progress → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.