Master promotion: error creating the default Deployment Plan overcloud

Bug #1733345 reported by Attila Darazs on 2017-11-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Critical
Unassigned
tripleo
Critical
wes hayutin

Bug Description

Periodic promotion jobs on master started failing with:

2017-11-20 14:31:34 | RuntimeError: ERROR error creating the default Deployment Plan overcloud Check the create_default_deployment_plan execution in Mistral with openstack workflow execution list Mistral execution ID: e0752a69-209f-4f2b-a118-230860a2a4de

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-master/6a4192c/undercloud/home/jenkins/undercloud_install.log.txt.gz#_2017-11-20_14_31_34

In mistral, the following error can be seen:

2017-11-20 14:31:29.932 32344 INFO workflow_trace [req-c13767cd-fa5c-4546-b3d0-1522e8b5ae9c 0782da8c4d764d47a2b8833f61b8d641 4238c32d43b142718113bdf1c0bac8fa - default default] Workflow 'tripleo.plan_management.v1.create_deployment_plan' [RUNNING -> ERROR, msg=Failure caused by error in tasks: notify_zaqar

[..]

 ZaqarAction.queue_post failed: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-master/6a4192c/undercloud/var/log/mistral/engine.log.txt.gz#_2017-11-20_14_31_29_932

Attila Darazs (adarazs) on 2017-11-20
summary: - Master promotion: ERROR error creating the default Deployment Plan
- overcloud
+ Master promotion: error creating the default Deployment Plan overcloud
Changed in tripleo:
importance: Undecided → Critical
tags: added: workflows
Adriano Petrich (apetrich) wrote :

On better investigation I think that the patch on comment #2 is not related.

It might be related to swift. way before that error we start to get unauthorized errors from swift

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-master/6a4192c/undercloud/var/log/mistral/engine.log.txt.gz#_2017-11-20_14_31_20_215

It fails for a head request and later for a CreateContainerAction so after that the deployment will not work.

Dougal Matthews (d0ugal) wrote :

As adriano spotted, I also think the Swift error is important.

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-master/6a4192c/undercloud/var/log/mistral/executor.log.txt.gz#_2017-11-20_14_31_20_140

It is almost like the user has no access to Zaqar or Swift now for some reason.

Dougal Matthews (d0ugal) wrote :

Could it be because we don't use sessions when creating the clients? https://github.com/openstack/mistral/blob/master/mistral/actions/openstack/actions.py

Brad P. Crochet (brad-9) wrote :

@d0ugal I think so, yes. Working on a patch now.

Changed in tripleo:
assignee: nobody → Brad P. Crochet (brad-9)
status: Triaged → In Progress
Dougal Matthews (d0ugal) wrote :

We also need to make sure we update the recent addition to tripleo-common: https://github.com/openstack/tripleo-common/commit/e862155531ea91bdda6897fe1ad8b4fd230885db

Changed in mistral:
status: New → Confirmed
importance: Undecided → High

Reviewed: https://review.openstack.org/521611
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=bec878eb89ff608069e4ed585f0d605a71c01d39
Submitter: Zuul
Branch: master

commit bec878eb89ff608069e4ed585f0d605a71c01d39
Author: Brad P. Crochet <email address hidden>
Date: Tue Nov 21 12:32:55 2017 -0500

    Switch zaqarclient and swiftclient to use a session

    Use a keystone session with zaqar and swift clients.

    Change-Id: I1be34d903b2785205c1f240095e52a63de795b8e
    Closes-Bug: #1733345

Changed in mistral:
status: Confirmed → Fix Released
Emilien Macchi (emilienm) wrote :

I'm closing the bug, since code merged in Mistral, we now need a promotion.

tags: removed: alert
Changed in tripleo:
status: In Progress → Fix Released
Changed in tripleo:
status: Fix Released → In Progress
wes hayutin (weshayutin) on 2017-11-22
tags: added: alert

Fix proposed to branch: master
Review: https://review.openstack.org/522296

Changed in tripleo:
assignee: Brad P. Crochet (brad-9) → Adriano Petrich (apetrich)
Changed in tripleo:
assignee: Adriano Petrich (apetrich) → wes hayutin (weshayutin)

Reviewed: https://review.openstack.org/522296
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=75020f80e7856d184da8ffab7090d1a0566c0c37
Submitter: Zuul
Branch: master

commit 75020f80e7856d184da8ffab7090d1a0566c0c37
Author: Adriano Petrich <email address hidden>
Date: Wed Nov 22 16:42:16 2017 +0000

    Switch to use sessions on zaqar, nova and swift

    Tripleo-common clients were not using session. Add keystone sessions to
    some clients.

    Change-Id: Ic9cd67cce307d489a514254d3f3200a058a009a2
    Closes-Bug: #1733345

Changed in tripleo:
status: In Progress → Fix Released
Dougal Matthews (d0ugal) wrote :

Re-opened as we are still seeing this error. Neither of these patches resolved it.

Changed in tripleo:
status: Fix Released → In Progress
Dougal Matthews (d0ugal) wrote :

I have spent some time in an environment that reproduces this, and I've not learned very much. However, I have managed to rule out a few things...

- Every openstack client is broken. So this isn't specific to swift or zaqar etc. I tested this with "mistral run-action" on a number of actions, including "mistral.workflows_list" they all fail with auth error.
- I have tested the credentials in /etc/mistral/mistral.conf by exporting them in my bash session and using the CLI tools. Everything works as expected.

So it seems something is going wrong in Mistral, but I can't figure out why. Nothing has changed in the code, so is something different in the environment?

We could do with some input from somebody with better keystone knowledge than me (that isn't a high bar). I suspect they could help us figure out where things have gone wrong.

Thomas Herve (therve) wrote :

https://review.openstack.org/#/c/518244/ touches logging values, in particular the token, and this is used (wrongly) in mistral context: https://github.com/openstack/mistral/blob/master/mistral/context.py#L63 . We should use to_dict() instead, maybe filling what values could be used.

Reviewed: https://review.openstack.org/522822
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=a944cdb98e9c026d04cf08011172622890a9b51f
Submitter: Zuul
Branch: master

commit a944cdb98e9c026d04cf08011172622890a9b51f
Author: Thomas Herve <email address hidden>
Date: Fri Nov 24 14:52:42 2017 +0100

    Don't use oslo context get_logging_values

    get_logging_values has been changed recently to not pass the token
    anymore, call to_dict as expected.

    Change-Id: I3a7f1293a4d0082274af270f86b5c732d898f8bc
    Closes-Bug: #1733345

Dougal Matthews (d0ugal) on 2017-11-27
Changed in tripleo:
status: In Progress → Fix Released
Changed in mistral:
importance: High → Critical

This issue was fixed in the openstack/tripleo-common 8.2.0 release.

Reviewed: https://review.openstack.org/523096
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=0b39b5c8dfeadee47a5aa82bd605de532defcb4c
Submitter: Zuul
Branch: stable/pike

commit 0b39b5c8dfeadee47a5aa82bd605de532defcb4c
Author: Thomas Herve <email address hidden>
Date: Fri Nov 24 14:52:42 2017 +0100

    Don't use oslo context get_logging_values

    get_logging_values has been changed recently to not pass the token
    anymore, call to_dict as expected.

    Change-Id: I3a7f1293a4d0082274af270f86b5c732d898f8bc
    Closes-Bug: #1733345
    (cherry picked from commit a944cdb98e9c026d04cf08011172622890a9b51f)

tags: added: in-stable-pike

This issue was fixed in the openstack/mistral 6.0.0.0b2 development milestone.

This issue was fixed in the openstack/mistral 5.2.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers