Machines are killed if mongo fails
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
Critical
|
Ian Booth | ||
| | 1.18 |
Undecided
|
Unassigned | ||
Bug Description
If Juju loses connection with mongo, the provisioner can interpret that as being the same as all machines being deleted. This is bad because it then goes and kills all running instances because it thinks they shouldn't be there anymore.
Related to : https:/
| tags: | added: landscape maas-provider |
| Mark Ramm (mark-ramm) wrote : | #1 |
| summary: |
- Make provisioner-safe-mode defaults to True on MAAS provider. + Machines are killed if mongo fails |
| Ian Booth (wallyworld) wrote : | #2 |
There are 2 issues here:
1. Juju lost connection with mongo, apparently because mongo stopped functioning
2. The lost connection resulted in Juju thinking that no machines existed in state. The default Juju behaviour is to make the instances in the cloud match the machines in state. Hence Juju went and destroyed the running cloud instances.
We can and will implement a fix for #2 above - that is a clear Juju bug.
If the Juju bug is fixed, there's no pressing need for any change to MAAS configuration. The original purpose of the safe mode was to be used when restore of a backup is running since during that time the state database will not be up to date. I'm not sure we want to spend engineering effort implementing a default configuration change that will make MAAS different to other providers. Such inconsistency adds to support costs etc etc.
So I'd like agreement that we not do any MAAS configuration changes, just fix the Juju bug. We can and should recommend that MAAS deployments turn on safe mode = true until the bug is fixed. And we can have a separate debate about whether safe mode should be the default for all environments.
| Changed in juju-core: | |
| milestone: | none → 1.20.1 |
| importance: | Undecided → Critical |
| Changed in juju-core: | |
| assignee: | nobody → Ian Booth (wallyworld) |
| status: | New → In Progress |
| tags: | added: canonical-is |
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |
| John A Meinel (jameinel) wrote : | #3 |
I agree it shouldn't be a MaaS specific change, but we could set safe-mode default True everywhere.
| description: | updated |
| no longer affects: | juju-core/1.18 |
| Raphaël Badin (rvb) wrote : | #4 |
> I'm not sure we want to spend engineering effort implementing a default configuration change that will make MAAS
> different to other providers. Such inconsistency adds to support costs etc etc.
> So I'd like agreement that we not do any MAAS configuration changes, just fix the Juju bug
Even if we saw this problem happen with MAAS, my understanding is that this bug is affecting *all* the providers right? If this is the case, then I see no reason why the MAAS provider should be different from the other providers.
> We can and should recommend that MAAS deployments turn on safe mode = true until the bug is fixed.
Agreed, sounds like the best course of action (change the Juju config on the existing deployments, recommend turning on the safe mode until the bug is fixed and of course, fix the bug itself).
| Nate Finch (natefinch) wrote : | #5 |
retargetted to 1.18 because the bug almost certainly exists in 1.18. Marked as won't fix, because we can't release new versions of 1.18 (also why there's no milestone set).
| Adam Conrad (adconrad) wrote : | #6 |
"Marked as won't fix, because we can't release new versions of 1.18."
Can't, or won't?
| Curtis Hovey (sinzui) wrote : | #7 |
1.18 is based on bzr, the current release and test infrastructure is based on git and 1.20.+
| Adam Conrad (adconrad) wrote : | #8 |
That still sounds like a "won't", not a "can't". Surely, there's a way to do another release of an older branch and test it.
| no longer affects: | juju-core (Ubuntu) |
| Adam Conrad (adconrad) wrote : | #9 |
Oh wait, nevermind. I somehow completely missed the part where this bug *is* fixed in trusty already, which was my concern above. So, if it's not being updated correctly in production, I assume this is because juju's updates are out-of-band, rather than using the archive? :/


Juju should not be triggering MAAS release commands just because it goes down even with safe-mode set to False. Are there any details available on what is actually happening?
But I agree that defaulting to True on MAAS is a good first step.