Canonical Juju

Client loses information about current model

Bug #1796148 reported by Jacek Nykis on 2018-10-04

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Anastasia	Canonical Juju 2.5-beta1
	2.4	Fix Released	High	Anastasia	Canonical Juju 2.4.5

Bug Description

We have a large, shared juju controller with multiple models.

Recently clients started loosing information about current model. When this happens "juju status" returns:
ERROR current model for controller XXX not found

I can see model information in ~/.local/share/juju/models.yaml but it's missing "current-model" key.

"juju list-models" shows the model and when I run "juju switch $model" it works just fine:

XXX (controller) -> XXX:admin/YYY

After this operation current-model appears in models.yaml and juju status works fine.

juju client version 2.4.3-trusty-amd64
Both controller and my "user" model are on 2.4.3

Tags:

Revision history for this message

Richard Harding (rharding) wrote on 2018-10-04:

Is there any sort of hint what precedes the loss of the current model info? I can understand that if the current model is empty you get the message about status giving an error, but I wonder how that got lost. Is there multiple folks on there running simultaneous commands as the same user or something that might be confusing the code around recording the current model?

Changed in juju:
status:	New → Incomplete

Paul Gear (paulgear) on 2018-10-15

summary:

- Client looses information about current model
+ Client loses information about current model

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-10-15:

I've seen this 3 times now, even filed a bug earlier today - LP:1797823. The most recent happening to a bunch of environments over the weekend, when no one from IS is around to run things - see pastebin of nagios output for a bunch of environments - https://pastebin.canonical.com/p/wQqr2y9Mbb/

We have a cron job that periodically dumps the 'juju status' output for monitoring. From what I've seen, the juju client somehow overwrite ~/.local/share/juju/models.yaml with something very bare:

| controllers:
| prodstack-is: {}

We've only started seeing this since upgrading client+controller from 2.3.9 to 2.4.3, if this helps.

'juju switch <model>' fixes this and writes out a proper models.yaml with 'current-model' but seems this is temporary until something happens, perhaps to do with issues with the controller? Still, the client shouldn't be overwriting models.yaml overwriting 'current-model' unless someone runs 'juju switch'.

Changed in juju:
status:	Incomplete → Confirmed

Richard Harding (rharding) on 2018-10-15

Changed in juju:
status:	Confirmed → Triaged
importance:	Undecided → High
milestone:	none → 2.5-beta1

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-16:

I agree with Rick - we need to know what is happening in between having correct current model and 'losing' it.

Besides at 'switch', current model is set when you 'bootstrap' (obviously not in this case), 'add-model', 'login', 'register'.

Is it possible that one of these was executed in between and have potentially unset your expected current model?

Changed in juju:
status:	Triaged → Incomplete

Paul Gear (paulgear) on 2018-10-17

Changed in juju:
status:	Incomplete → Confirmed

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-17: Re: [Bug 1796148] Re: Client loses information about current model

Hey Paul,

I'm pretty sure that the shift to incomplete was to ask for what
commands had been run. What is the likelihood of someone inadvertently
calling 'juju switch'?

Is the information being lost as part of an automated process?

Tim

On 17/10/18 13:10, Paul Gear wrote:
> ** Changed in: juju
> Status: Incomplete => Confirmed
>

Revision history for this message

Paul Gear (paulgear) wrote on 2018-10-17:

@anastasia-macmood We use dedicated environments for every model, and these environments have been working as-is for months (even years in some cases). We don't use add-model, login, or register in them unless there's a problem like this. Whilst we can't rule out that someone ran one of those commands in one of the environments in question, we hard-code JUJU_MODEL in every environment, so there's almost never call for running any of those.

@hloeung's pastebin above (lines 9-16 & 19-32) shows this is happening in clusters to multiple different Linux user ids which don't have write access to each others' juju state. This makes it very unlikely that there's something inadvertent happening that might cause it. (And, as Haw mentioned, it happened over the weekend when no one was working on them.)

We have timestamped individual bash histories for each juju environment, so given that we have close-to-exact times in the pastebin, I'll update this shortly with what I've found in them.

Revision history for this message

Paul Gear (paulgear) wrote on 2018-10-17:

OK - I've audited all of the environments listed in the pastebin. They were all relatively new environments (by canonical-is standards), and did not have JUJU_MODEL set. They each have a dedicated user account (therefore juju switch commands run elsewhere should not affect them), and bash history timestamps indicate that no juju commands were run at all for many hours prior to juju status indicating a problem. https://docs.google.com/spreadsheets/d/1mbDz8-7QBBgExQqpHiHmAT7EKE2vcxegiWdKaasZf7Q/ has full details, but essentially, juju status just broke without anyone doing anything that we can see. (I'm assuming there's no possibility of a malicious attacker whose sole purpose was to disrupt juju status monitoring on a few random environments and not delete or steal data.)

Here's a snapshot of the controller performance statistics for the time covering the faults, plus about an hour on either side: https://grafana.admin.canonical.com/dashboard/snapshot/xLD9ALupU5U2tYmlztFttJM3iogDwE0Y I'll grab controller logs from same time shortly, but my guess is that the CPU peaks correspond to times when the controller was restarted, and the sort of bugs we need to be looking for are race conditions or unexpected controller responses when juju status is being run on the client while the server is restarting.

tags:

added: canonical-is

Revision history for this message

Paul Gear (paulgear) wrote on 2018-10-17:

Controller logs uploaded to https://private-fileshare.canonical.com/~paulgear/lp1796148/

John A Meinel (jameinel) on 2018-10-18

Changed in juju:
status:	Confirmed → Triaged

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-18:

I am still puzzled as to what is going on...

Since you are only running 'juju status' and no other user activity during the times of failures, the best that I can guess is that:
1. we have recently introduce a retry within 'juju status' as some plugings (I think 'wait') was having troubles;
2. we also have a code that will seamlessly remove any model from a client cache if in communication with the controller, it was not found. Of course, if it is the current model that was not found it'll be removed as well. https://github.com/juju/juju/blob/develop/cmd/modelcmd/base.go#L159

So it is possible, that on one of the 'juju status' runs, we did not get model details, removed them from the cache, 'juju status' went into retry... This would have left current model cleared.

Note that this is just a speculation about what has happened.

The other side of the story is 'how do we fix it?' and prevent it from happening again...
I do not think it is worthwhile storing previous current model when we are updating the record. So what should we do? We can stop storing current model altogether and require a user to always specify what model the command should be running again. But I think that we all agree that having the concept of current model is practical :)

Revision history for this message

Paul Gear (paulgear) wrote on 2018-10-19:

Is there a particular need to remove models from client cache? Why bother cleaning it at all? I've never seen a problem which I could attribute to that, and if there's no compelling need for it, it seems like something we could do without. My ~/.local/share/juju hasn't been cleaned for months (probably years), and it's only 328K.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#10

If more than one client is used and model is destroyed, then only the client used to destroy will have updated cache. How would you update caches on other clients?

Revision history for this message

Richard Harding (rharding) wrote on 2018-10-19:

#13

I think the use case is especially true in multi user environments where
your coworker removed a model and it was last active for you. We can
definitely make sure to check the various situations and make sure we only
do what's required.

On Thu, Oct 18, 2018, 8:10 PM Paul Gear <email address hidden> wrote:

> Is there a particular need to remove models from client cache? Why
> bother cleaning it at all? I've never seen a problem which I could
> attribute to that, and if there's no compelling need for it, it seems
> like something we could do without. My ~/.local/share/juju hasn't been
> cleaned for months (probably years), and it's only 328K.
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1796148
>
> Title:
> Client loses information about current model
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1796148/+subscriptions
>

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-10-19:

#11

Why not expire models instead? So add an entry in models.yaml on when the model was last seen or alive. Models that's not seen on the controller and last seen over X days will then be removed.

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-10-19:

#12

With how it is currently, would it remove / clear models.yaml if the controllers were taken down for maintenance (backup, mgopurge, MongoDB surgery)?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#14

Yes, current model locally will be:
i. updated when certain commands are used [login, register, bootstrap, add model, switch];
ii. removed when a model is destroyed or not found.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#15

I'll double check current behavior and will ensure that current model is only touch when users issue a command. If we just get 'not found' from the server, we'll keep asking users to switch but keep the reference in the local cache.

Changed in juju:
assignee:	nobody → Anastasia (anastasia-macmood)
status:	Triaged → In Progress

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#16

PR against 2.4: https://github.com/juju/juju/pull/9345

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#17

PR in comment # 16 only changes client behavior - we will no longer reset current model if we cannot find it while making API connection from any command.

Current model will be reset only by the commands that the user knows *could* change current models - bootstrap, add/destroy-model, login, register, etc. full list is in the PR.

The question remains - what has happened on the server for the client to decide to clear current model reference? The only circumstance under which this could have happened is if we got a 'model not found' when making a connection to the server from the command, in this case, from 'juju status'. This error can only come from StatePool. Is there a race? Is it possible that StatePool does not become operational immediately as the controller comes up after restart?

I'll let Tim decide as he has recently done some changes to StatePool. I am out of ideas on server-side.

To sum up:
1. Controller was restarted;
2. Client has received a 'model not found' for the current model and reset its local cache
3. However, the model actually still exists on the controller...

Could 'juju status' sometimes fall intothe black hole? That window between 'controller started' and 'state pool can see all models'?...

Revision history for this message

Jacek Nykis (jacekn) wrote on 2018-10-19:

#18

Personally I think it's not helpful for juju to remove current model from cache in any situation.

There are two possibilities:
1. Model still exists, client can't reach it temporarily
2. Model is gone and never coming back. Possibly destroyed by another user.

In scenario 1 removing cached model actually adds extra work for operators - they have to figure out previous model name and juju switch.

In scenario 2 it causes confusion. juju status stops working and operators have to figure out why.
"juju models" will show them list of models but there is no way of knowing which model used to be set in the environment.
Eventually they may establish that the model was destroyed and "juju switch", remove controller creds if it's no longer needed or even destroy controller.

I think more tasteful way to deal with missing models would be:
$ juju status
ERROR: unable to get status for model "my-model-name". If if's a temporary problem try again later. If the model was destroyed you can switch to another one using one of the commands below:
juju switch available-model1
juju switch available-model2
juju switch available-model3

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-19:

#19

I agree, @Jacek Nykis (jacekn). Thank you for your support.
Linked PR will stopped surprises. I am just waiting for 2nd +1 from Juju leadership and it will land :)

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-23:

#20

PR against develop (heading to 2.5): https://github.com/juju/juju/pull/9354

We think that on the controller side, the problem was caused by http/api servers start up timing and has been fixed as part of bug # 1792299

Anastasia (anastasia-macmood) on 2018-10-23