Heat fails to re-authenticate when faced with authentication failure during stack operations

Bug #1306294 reported by Clint Byrum
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Steven Hardy
tripleo
Fix Released
High
Unassigned

Bug Description

We recently created a stack which tried to create 41 servers all at once using nova baremetal. Exactly 1 hour after the first server went 'CREATE_IN_PROGRESS', we received a 401 error from Nova because our token had expired. Heat should re-authenticate with keystone and continue working, but instead it failed the stack creation.

Changed in heat:
milestone: none → juno-1
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Most likely we need to discard the stack-create supplied token very early and create a trust token to perform the create. Then the token expiry can be stored in the context, and a new trust token can be created when within some threshold of the expiry time

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93799

Changed in tripleo:
assignee: nobody → Gregory Haynes (greghaynes)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/93848

Changed in tripleo:
assignee: Gregory Haynes (greghaynes) → Robert Collins (lifeless)
Steven Hardy (shardy)
Changed in heat:
assignee: nobody → Steven Hardy (shardy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/93799
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=0bd3fa87d6bfe857acb67fbfb97270a218da74a6
Submitter: Jenkins
Branch: master

commit 0bd3fa87d6bfe857acb67fbfb97270a218da74a6
Author: Gregory Haynes <email address hidden>
Date: Thu May 15 13:34:32 2014 -0700

    Error out when complete waitcondition fails

    When we curl to the complete waitcondition url we don't consider heat
    returning an error as reason to exit with error. This prevents us from
    attempting to re-run the configuration and thus we never signal heat
    again.

    Change-Id: I915825d76ec889bc09b81ad93f70b34d909262b0
    Closes-bug: #1306294

Changed in tripleo:
status: In Progress → Fix Released
Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Steven Hardy (shardy) wrote :

I'm looking into solving the heat part of this by re-authenticating via the stored trust_id.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/96222

Changed in tripleo:
status: Triaged → In Progress
Changed in heat:
status: Triaged → In Progress
Revision history for this message
Steven Hardy (shardy) wrote :

Related patch proposed to keystoneclient: https://review.openstack.org/#/c/96298/

I'm still working on the heat part of the fix, but hit a keystone bug which has delayed things a little.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/93848
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=8c84cf691ea8c9f6100cfcbf7d1942fdf1384f79
Submitter: Jenkins
Branch: master

commit 8c84cf691ea8c9f6100cfcbf7d1942fdf1384f79
Author: Robert Collins <email address hidden>
Date: Fri May 16 16:06:36 2014 +1200

    Workaround Heat not handling token expiry

    Heat cannot handle a token expiring mid-deploy. Until the heat bug is
    fixed we can workaround it by upping the token expiry time. The value
    of 4 hours is higher than the time we need to deploy a 30 node cloud
    of real hardware, and less than the 5 hour upper maximum suggested by
    Morgan Fainberg, Keystone core.

    Change-Id: Ib079f8880b3672ac442804557baca954e8af87ff
    Partial-Bug: #1306294

Thierry Carrez (ttx)
Changed in heat:
milestone: juno-1 → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Steven Hardy (<email address hidden>) on branch: master
Review: https://review.openstack.org/96222
Reason: Abandoning as I have another patch series in progress which will solve this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/99728

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/99728
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=0b91f35c014728afbe53dcb20e710578ab37d401
Submitter: Jenkins
Branch: master

commit 0b91f35c014728afbe53dcb20e710578ab37d401
Author: Steven Hardy <email address hidden>
Date: Wed Jun 11 18:11:12 2014 +0100

    parser.Stack add stored_context

    Adds a new stored_context method, which may allow for some cleanup
    in service.py, and also enable switching from the request context
    to the stored context when needed, e.g when the current context
    auth_token is about to expire.

    Change-Id: I7b797719036238c424905a5d2555dc00e40f4e89
    Partial-Bug: #1306294

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/100365

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/100366

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/100367

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/100700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/100701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/99730
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=049f251d5c602810057f9e2bd79b379e456e6928
Submitter: Jenkins
Branch: master

commit 049f251d5c602810057f9e2bd79b379e456e6928
Author: Steven Hardy <email address hidden>
Date: Thu Jun 12 14:42:12 2014 +0100

    parser.Stack add use_stored_context option

    Add an option to switch to a context based on the user_creds data
    at constructor time, which allows a cleaner interface for selecting
    the stored context than creating it before creating the Stack object.

    Partial-Bug: #1306294

    Change-Id: I44df4e142dfbba1e70199b0d8f14c910f028a0f1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/99731
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c6a1db424340ecee228e78f4993acf3cf937dc60
Submitter: Jenkins
Branch: master

commit c6a1db424340ecee228e78f4993acf3cf937dc60
Author: Steven Hardy <email address hidden>
Date: Thu Jun 12 15:15:38 2014 +0100

    Convert service.py to use_stored_context

    Related-Bug: #1306294

    Change-Id: Ia75126ad77e6cad187a1953c2fb43eceb50299f2

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/100365
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d5095309cb6bb4cef279b25fd87c7d0a3eff53e4
Submitter: Jenkins
Branch: master

commit d5095309cb6bb4cef279b25fd87c7d0a3eff53e4
Author: Steven Hardy <email address hidden>
Date: Mon Jun 16 18:37:27 2014 +0100

    Remove test_autoscaling _stub_validate

    The _stub_validate is stubbing the wrong thing, it stubs the top-level
    validate which hides the fact that the image validation fails for all
    delete operations, where the validation always fails because the call
    to glance is not correctly stubbed.

    Partial-Bug: #1306294

    Change-Id: Ibe3480f5ff358c00a4ad3c2df9b6cd7d15bda71d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/100366
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4320178066565608351c37a720922cbd36ac1f03
Submitter: Jenkins
Branch: master

commit 4320178066565608351c37a720922cbd36ac1f03
Author: Steven Hardy <email address hidden>
Date: Mon Jun 16 19:08:49 2014 +0100

    test_autoscaling refactor suspend/resume stubbing

    Rework to encapsulate the suspend/resume stubbing in functions and add a
    stub of the image validation, which is required because the properties
    get revalidated when we create the events associated with the state
    transition (ref bug #1324102).

    With the current keystoneclient mocking this is not visible in the
    tests because the error creating the client to do the validation is
    obscured by the fact that the Event constructor tolerates a validation
    error and the error creating the client is just treated as an unexpected
    validation error by the CustomConstraint code.

    Moving to a model where clients.keystone() is consistently mocked
    exposes the fact that we will try to connect to glance on suspend when
    the event is created if the validation is not correctly stubbed.

    Partial-Bug: #1306294

    Change-Id: I001d3a432397d4cdfaa62145228aad7eaf051b98

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/100367
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f9d9fae2337354f37e948429a08584e7bc2d3ee0
Submitter: Jenkins
Branch: master

commit f9d9fae2337354f37e948429a08584e7bc2d3ee0
Author: Steven Hardy <email address hidden>
Date: Mon Jun 16 18:50:08 2014 +0100

    tests add stub_keystoneclient to base test class

    Add convenience function to allow easy stubbing of keystoneclient
    in all tests where explicit control and verification is not required.
    This is a precursor to making all clients use the auth_token property
    from keystoneclient (instead of using the context auth_token in
    preference) such that trust tokens can be refreshed before expiry.

    Partial-Bug: #1306294

    Change-Id: I9ba4595b8750ff769e76972cc30b55a68253e76d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/100700
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e39f80d25de12895410ce7d0715e55eeb8ca0965
Submitter: Jenkins
Branch: master

commit e39f80d25de12895410ce7d0715e55eeb8ca0965
Author: Steven Hardy <email address hidden>
Date: Tue Jun 17 16:43:13 2014 +0100

    engine.clients always use keystoneclient auth_token

    Always get the token from keystoneclient, even if there is one in
    the request context, as if the context contains a username and
    password, there is the possibility the keystoneclient auth_token
    may be a reissued token which should be preferred to the one in
    the context which may have expired.

    Change-Id: Icbce9ae7a8ff7cad5eadec3de4f69b977949c265
    Partial-Bug: #1306294

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/100701
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=a6a4920b8d68a29c7546a41b922be1612c1f6895
Submitter: Jenkins
Branch: master

commit a6a4920b8d68a29c7546a41b922be1612c1f6895
Author: Steven Hardy <email address hidden>
Date: Tue Jun 17 17:05:45 2014 +0100

    Don't set context auth_token in heat_keystoneclient

    Now that all clients refer directly to the auth_token in the auth_ref
    there's no need to set the auth_token in the context when consuming
    a trust.

    Change-Id: I2fe58de7b950a65672b490f1769c88dedda3895b
    Partial-Bug: #1306294

Revision history for this message
Steven Hardy (shardy) wrote :

Still working on the final part of this fix, which should be ready early in J3, it's not going to make J2 though unfortunately, so bumping the milestone target.

Changed in heat:
milestone: juno-2 → juno-3
Revision history for this message
Steven Hardy (shardy) wrote :

To capture a discussion re this bug with Robert at the tripleo meetup:

1. TripleO is not yet using trusts, so we need to get them using deferred_auth_method=trusts before we can switch to the trusts stored context, and the easiest way to do that is probably by fixing bug #1286157

2. Switching to the existing (password) stored context will only be possible if username/password are provided in the create request context, which will only be enforced by heat if one of the resources requiring deferred auth are used in the tripleo templates (not clear to me atm if they are or not):

https://github.com/openstack/heat/blob/master/heat/engine/service.py#L482

3. There is a desire for a config option which forces heat to switch to the stored context immediately on any create/update - this will probably (at least initally) be disabled by default to avoid reintroducing the overhead of always getting a new token, ref bug #1324102. It may be necessary in future for convergence agents etc to use the stored context, but for now it's desirable to default to reusing the request token where possible.

Remaining question is whether we allow the option to switch to the stored context during a create, if the global switch isn't set and the token expires? My expectation is that we will, or at least that it will be possible via some setting of the config file option?

Steven Hardy (shardy)
Changed in heat:
milestone: juno-3 → juno-rc1
Revision history for this message
Steven Hardy (shardy) wrote :

So, I've been working on and thinking about this recently, and a related discussion has happened on the ML:

http://lists.openstack.org/pipermail/openstack-dev/2014-September/045585.html

Summary:

1. Heat is not the only service which has this problem
2. Many deployers are already increasing token expiry to work around this issue
3. There's no good solution atm for services to do long-running operations via token auth
4. Passing username/password is "not acceptable", which obviously we do but should not rely on to solve this problem
5. Switching to a trust to defeat token expiry is probably/possibly not legit, or at the least a bit yucky and not something to do by default.
6. Long-lived "service tokens" have been mentioned, but atm I don't think really solve this problem
7. Users passing a trust in via a header may possibly provide a future solution, modulo some stuff we'd need to add to trusts ref the ML.

Based on the thread, I'm not really sure how to proceed. I'm hesitant to propose a quick-fix for the heat side of this (which would involve either using the trust or username/password to circumvent token expiry), as the consensus of the discussion seems to be:

1. Stopping doing stuff when credentials expire is probably reasonable and the right thing to do at a service level
2. Clients can work around the problem potentially by allowing retry logic to request a new token if the client uses sessions initialized with username/password (this is not helpful to a service implementor as we're expected to work with only a token)

Robert, can you provide feedback (or weigh in on the ML) as to the priority of the heat part of this from a TripleO perspective? Is doing nothing in heat right now and living with the extended token expiry an option?

Revision history for this message
Steven Hardy (shardy) wrote :

btw the alternative to doing nothing is probably implement "config option which forces heat to switch to the stored context immediately on any create/update" from #25 above, defaulted to off, which would probably "solve" the problem provided TripleO is passing a username/password into heat.

Revision history for this message
Steven Hardy (shardy) wrote :

Ok no feedback on this and no consensus on the ML thread referenced above on the right way to handle this, so I think our only option is to defer any action until K and continue the discussion with the keystone folks about the right way to handle this.

Changed in heat:
milestone: juno-rc1 → next
milestone: next → kilo-1
Steven Hardy (shardy)
tags: added: tripleo
Angus Salkeld (asalkeld)
Changed in heat:
milestone: kilo-1 → kilo-2
Steven Hardy (shardy)
Changed in heat:
milestone: kilo-2 → kilo-3
Angus Salkeld (asalkeld)
Changed in heat:
milestone: kilo-3 → next
Angus Salkeld (asalkeld)
Changed in heat:
importance: High → Medium
Changed in tripleo:
assignee: Robert Collins (lifeless) → nobody
Changed in heat:
status: In Progress → Triaged
Changed in heat:
assignee: Steven Hardy (shardy) → Oleksii Chuprykov (ochuprykov)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/235516

Changed in heat:
assignee: Oleksii Chuprykov (ochuprykov) → Steven Hardy (shardy)
Changed in heat:
assignee: Steven Hardy (shardy) → Oleksii Chuprykov (ochuprykov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Steven Hardy (<email address hidden>) on branch: master
Review: https://review.openstack.org/235516
Reason: Obsoleted by https://review.openstack.org/#/c/226384/

Changed in heat:
assignee: Oleksii Chuprykov (ochuprykov) → Steven Hardy (shardy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/226384
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=33bf717bbcdbdf5db47dc54f10ed1e33fdb0e5c9
Submitter: Jenkins
Branch: master

commit 33bf717bbcdbdf5db47dc54f10ed1e33fdb0e5c9
Author: Oleksii Chuprykov <email address hidden>
Date: Tue Nov 10 17:15:59 2015 +0200

    Reauthenticate on token expiration

    Add new config option reauthentication_auth_plugin
    for enforce usage of trusts_auth_plugin for making
    authenticated requests.
    Also add stale_token_duration option for determing
    time in seconds before token expiration for considering
    such token as about to expire.
    Try to get new token if old token is about to expire.
    Add reauthentication for stack create/update actions.

    Change-Id: Id7d6692128964f4ef35762d43ef2738df0d83f4b
    Co-Authored-By: Steven Hardy <email address hidden>
    Closes-Bug: #1306294

Changed in heat:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/252313

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/heat 6.0.0.0b1

This issue was fixed in the openstack/heat 6.0.0.0b1 development milestone.

Changed in heat:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/liberty)

Reviewed: https://review.openstack.org/252313
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4d4846e192f51263a3c5849b4fb3dcaffcfa73af
Submitter: Jenkins
Branch: stable/liberty

commit 4d4846e192f51263a3c5849b4fb3dcaffcfa73af
Author: Oleksii Chuprykov <email address hidden>
Date: Tue Nov 10 17:15:59 2015 +0200

    Reauthenticate on token expiration

    Add new config option reauthentication_auth_plugin
    for enforce usage of trusts_auth_plugin for making
    authenticated requests.
    Also add stale_token_duration option for determing
    time in seconds before token expiration for considering
    such token as about to expire.
    Try to get new token if old token is about to expire.
    Add reauthentication for stack create/update actions.

    Change-Id: Id7d6692128964f4ef35762d43ef2738df0d83f4b
    Co-Authored-By: Steven Hardy <email address hidden>
    Closes-Bug: #1306294
    (cherry picked from commit 33bf717bbcdbdf5db47dc54f10ed1e33fdb0e5c9)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/heat 5.0.1

This issue was fixed in the openstack/heat 5.0.1 release.

Revision history for this message
Brent Eagles (beagles) wrote :

If heat is handling this, is there anything left to do on tripleo?

Revision history for this message
Brent Eagles (beagles) wrote :

Added to https://etherpad.openstack.org/p/tripleo-bug-cleanup-2016 to address if we get together for a sweep.

Revision history for this message
Emilien Macchi (emilienm) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in tripleo:
status: In Progress → Incomplete
Revision history for this message
Ben Nemec (bnemec) wrote :

According to the bug status, heat fixed this. Updating the tripleo bug to match.

Changed in tripleo:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.