missing credential stops upgrade from running

Bug #1700434 reported by Ian Booth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew Wilkins

Bug Description

A controller gets stuck upgrading if a model had a missing credential.
In this case, the model for which the credential is missing is dying.

017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:499 "upgrade-steps-flag" manifold worker stopped: "upgrade-steps-gate" not running: dependency not available
2017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:499 "upgrade-steps-runner" manifold worker stopped: "agent" not running: dependency not available
2017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:485 "upgrade-steps-gate" manifold worker started
2017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:499 "upgrade-steps-runner" manifold worker stopped: <nil>
2017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:499 "upgrade-steps-runner" manifold worker stopped: "api-caller" not running: dependency not available
2017-06-26 01:41:10 DEBUG juju.worker.dependency engine.go:485 "upgrade-steps-flag" manifold worker started
2017-06-26 01:41:37 DEBUG juju.worker.dependency engine.go:485 "upgrade-steps-runner" manifold worker started
2017-06-26 01:41:37 INFO juju.worker.upgradesteps worker.go:250 checking that upgrade can proceed
2017-06-26 01:41:37 INFO juju.upgrade preupgradesteps.go:32 updating distro-info
2017-06-26 01:41:38 DEBUG juju.worker.dependency engine.go:499 "migration-fortress" manifold worker stopped: "upgrade-steps-flag" not set: dependency not available
2017-06-26 01:41:45 INFO juju.worker.upgradesteps worker.go:259 signalling that this controller is ready for upgrade
2017-06-26 01:41:45 INFO juju.worker.upgradesteps worker.go:267 waiting for other controllers to be ready for upgrade
2017-06-26 01:41:45 INFO juju.worker.upgradesteps worker.go:286 finished waiting - all controllers are ready to run upgrade steps
2017-06-26 01:41:45 INFO juju.worker.upgradesteps worker.go:345 starting upgrade from 2.1.3 to 2.2.1 for "machine-0"
2017-06-26 01:43:49 ERROR juju.worker.upgradesteps worker.go:375 upgrade from 2.1.3 to 2.2.1 for "machine-0" failed (will retry): cloud credential "azure/someuser@external/MicrosoftAzure12345" not found
2017-06-26 01:49:50 DEBUG juju.worker.dependency engine.go:499 "upgrade-steps-runner" manifold worker stopped: <nil>

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Having a model that's Alive or Dying but without credentials is a problem: that will prevent destroying the model's machines and such. Also, the Azure provider in particular will not be able to call the "Environ.Destroy" method, and so the model's resource group won't be deleted. It may be that we'll want to allow that, but we'll need to be able to clear that the model is disabled.

The other part is that we probably shouldn't block the entire controller upgrade because of one model. An alternative would be to do the environ upgrades later, when the model workers start. We would have to gate any Environ-related operations on the environ being upgraded.

Revision history for this message
Casey Marshall (cmars) wrote :

I believe this happened because we currently do not check whether a credential is in use before removing it with RevokeCredential. I've confirmed that RevokeCredential is used in the GUI to remove credentials directly. This can easily put affected models in a "stuck dying" state.

An incremental improvement here would be to add a "force" flag to RevokeCredential. If not set, the method will return an error if the credential is in use by any models, which can then be relayed back to the user as a prompt to clean up models.

Forcing a revoke would still leave models in a stuck-dying state, so we'll still want to explore the other possible improvements above. Updating credentials such that the provider loses access to the required resources could also leave a model in such a state.

Revision history for this message
Tim Penhey (thumper) wrote :

Either that or we just don't allow the removal of credentials from a model while it is not dead.

Revision history for this message
Tim Penhey (thumper) wrote :

I wonder if there is call from something like "abandon-model" which tells juju to just stop trying to deal with it. The user takes responsibility for taking down the machines.

juju could then just clean up the DB, and remove references to it.

Revision history for this message
Ian Booth (wallyworld) wrote :
Changed in juju:
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.