Machines are killed if mongo fails

Bug #1339770 reported by Jorge Niedbalski
50
This bug affects 9 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Ian Booth
1.18
Won't Fix
Undecided
Unassigned

Bug Description

If Juju loses connection with mongo, the provisioner can interpret that as being the same as all machines being deleted. This is bad because it then goes and kills all running instances because it thinks they shouldn't be there anymore.

Related to : https://bugs.launchpad.net/juju-core/+bug/1254729

tags: added: landscape maas-provider
Revision history for this message
Mark Ramm (mark-ramm) wrote :

Juju should not be triggering MAAS release commands just because it goes down even with safe-mode set to False. Are there any details available on what is actually happening?

But I agree that defaulting to True on MAAS is a good first step.

Ian Booth (wallyworld)
summary: - Make provisioner-safe-mode defaults to True on MAAS provider.
+ Machines are killed if mongo fails
Revision history for this message
Ian Booth (wallyworld) wrote :

There are 2 issues here:

1. Juju lost connection with mongo, apparently because mongo stopped functioning
2. The lost connection resulted in Juju thinking that no machines existed in state. The default Juju behaviour is to make the instances in the cloud match the machines in state. Hence Juju went and destroyed the running cloud instances.

We can and will implement a fix for #2 above - that is a clear Juju bug.

If the Juju bug is fixed, there's no pressing need for any change to MAAS configuration. The original purpose of the safe mode was to be used when restore of a backup is running since during that time the state database will not be up to date. I'm not sure we want to spend engineering effort implementing a default configuration change that will make MAAS different to other providers. Such inconsistency adds to support costs etc etc.

So I'd like agreement that we not do any MAAS configuration changes, just fix the Juju bug. We can and should recommend that MAAS deployments turn on safe mode = true until the bug is fixed. And we can have a separate debate about whether safe mode should be the default for all environments.

Changed in juju-core:
milestone: none → 1.20.1
importance: Undecided → Critical
Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: New → In Progress
James Troup (elmo)
tags: added: canonical-is
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
John A Meinel (jameinel) wrote :

I agree it shouldn't be a MaaS specific change, but we could set safe-mode default True everywhere.

Ian Booth (wallyworld)
description: updated
Curtis Hovey (sinzui)
no longer affects: juju-core/1.18
Revision history for this message
Raphaël Badin (rvb) wrote :

> I'm not sure we want to spend engineering effort implementing a default configuration change that will make MAAS
> different to other providers. Such inconsistency adds to support costs etc etc.

> So I'd like agreement that we not do any MAAS configuration changes, just fix the Juju bug

Even if we saw this problem happen with MAAS, my understanding is that this bug is affecting *all* the providers right? If this is the case, then I see no reason why the MAAS provider should be different from the other providers.

> We can and should recommend that MAAS deployments turn on safe mode = true until the bug is fixed.

Agreed, sounds like the best course of action (change the Juju config on the existing deployments, recommend turning on the safe mode until the bug is fixed and of course, fix the bug itself).

Revision history for this message
Nate Finch (natefinch) wrote :

retargetted to 1.18 because the bug almost certainly exists in 1.18. Marked as won't fix, because we can't release new versions of 1.18 (also why there's no milestone set).

Revision history for this message
Adam Conrad (adconrad) wrote :

"Marked as won't fix, because we can't release new versions of 1.18."

Can't, or won't?

Revision history for this message
Curtis Hovey (sinzui) wrote :

1.18 is based on bzr, the current release and test infrastructure is based on git and 1.20.+

Revision history for this message
Adam Conrad (adconrad) wrote :

That still sounds like a "won't", not a "can't". Surely, there's a way to do another release of an older branch and test it.

no longer affects: juju-core (Ubuntu)
Revision history for this message
Adam Conrad (adconrad) wrote :

Oh wait, nevermind. I somehow completely missed the part where this bug *is* fixed in trusty already, which was my concern above. So, if it's not being updated correctly in production, I assume this is because juju's updates are out-of-band, rather than using the archive? :/

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.