juju models reports an error on half-dead model

Bug #1737480 reported by John A Meinel on 2017-12-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.3
High
John A Meinel

Bug Description

With the new patch against having 'juju models' do direct queries against the database for all models, we can once again end up in an inconsistent state and return an error to "juju models".

To test this, I set up these scripts.

First the monitoring script:
$ cat monitor.sh
#!/bin/bash
set -o pipefail
set -e
tmpfile=$(mktemp ./out.XXXXXX)
juju models --debug 2>&1 | tee $tmpfile || cat ${tmpfile} >> bad.txt
rm ${tmpfile}

Then in 2 windows to stress it, I run:
$ watch -n 0.1 ./monitor.sh

Ultimately you can then find the failures in 'bad.txt'.

I then ran these scripts in 2 other windows to stress the system:
$ for i in `seq 50`; do juju add-model ma$i --no-switch; juju status -m ma$i 2>&1 >st.tmp || cat st.tmp; juju destroy-model -y ma$i; done
$ for i in `seq 50`; do juju add-model mb$i --no-switch; juju status -m mb$i 2>&1 >st.tmp || cat st.tmp; juju destroy-model -y mb$i; done

This is creating and destroying 2 models concurrently, which makes it more likely to fail.

After doing that, I was able to see failures like this:

10:26:40 INFO juju.cmd supercommand.go:56 running juju [2.3.2 gc go1.9.2]
10:26:40 DEBUG juju.cmd supercommand.go:57 args: []string{"juju", "models", "--debug"}
10:26:40 INFO juju.juju api.go:67 connecting to API addresses: [10.67.99.11:17070]
10:26:40 DEBUG juju.api apiclient.go:844 successfully dialed "wss://10.67.99.11:17070/api"
10:26:40 INFO juju.api apiclient.go:598 connection established to "wss://10.67.99.11:17070/api"
10:26:40 INFO cmd listmodels.go:130 cannot list models: could not find settings/config for models: [138ee6d6-f617-414e-86a6-e399e44f5486]
10:26:40 DEBUG juju.api monitor.go:35 RPC connection died
ERROR cannot list models: could not find settings/config for models: [138ee6d6-f617-414e-86a6-e399e44f5486]
10:26:40 DEBUG cmd supercommand.go:459 error stack:
could not find settings/config for models: [138ee6d6-f617-414e-86a6-e399e44f5486]
github.com/juju/juju/rpc/client.go:149:
github.com/juju/juju/api/apiclient.go:925:
github.com/juju/juju/api/modelmanager/modelmanager.go:190:
github.com/juju/juju/cmd/juju/controller/listmodels.go:169:
github.com/juju/juju/cmd/juju/controller/listmodels.go:124: cannot list models

On the controller I can see that it is calling the new ListModelSummaries API:
machine-0: 10:27:59 DEBUG juju.apiserver <- [234E] user-admin {"request-id":2,"type":"ModelManager","version":4,"request":"ListModelSummaries","params":"'params redacted'"}

John A Meinel (jameinel) wrote :

Note that this failure was also seen as intermittent failures in CI like:
http://qa.jujucharms.com/releases/6057/job/model-migration-azure-arm/attempt/1071

John A Meinel (jameinel) wrote :

The code in question is doing:
 if !remaining.IsEmpty() {
  // XXX: What error is appropriate? Do we need to care about models that its ok to be missing?
  return errors.Errorf("could not find settings/config for models: %v", remaining.SortedValues())
 }

we just need to figure out what the *right* behavior is.
Do we report a model, just without extra information.
Do we omit the model entirely? (I think this is probably the correct behavior).

Changed in juju:
status: Triaged → In Progress
John A Meinel (jameinel) wrote :

The really easy fix that I'm playing with is:
diff --git a/state/modelsummaries.go b/state/modelsummaries.go
index 923be6b..29b50db 100644
--- a/state/modelsummaries.go
+++ b/state/modelsummaries.go
@@ -162,7 +162,7 @@ func (p *modelSummaryProcessor) fillInFromConfig() error {
        }
        if !remaining.IsEmpty() {
                // XXX: What error is appropriate? Do we need to care about models that its ok to be missing?
- return errors.Errorf("could not find settings/config for models: %v", remaining.SortedValues())
+ // return errors.Errorf("could not find settings/config for models: %v", remaining.SortedValues())
        }
        return nil
 }

Essentially, just don't worry if the model's config doc goes missing.

While watching "juju models --format=yaml" I can see it going into this state:
...
- name: admin/ma30
  short-name: ma30
  model-uuid: 02533045-2e87-4047-8c20-973747bf0375
  controller-uuid: 2c6b2b72-7c30-46dc-8368-a7680fad3aa3
  controller-name: lxd
  owner: admin
  cloud: lxd
  type: lxd
  life: dead
  status:
    current: destroying
    message: tearing down cloud environment
    since: just now
  access: admin
  last-connection: 2 seconds ago
  sla-owner: admin
  agent-version: 2.3.2.2
- name: admin/mb30
  short-name: mb30
  model-uuid: 9f6c0753-322b-4143-8cb6-5034a9bcd895
  controller-uuid: 2c6b2b72-7c30-46dc-8368-a7680fad3aa3
  controller-name: lxd
  owner: admin
  cloud: lxd
  type: lxd
  life: dead
  status:
    current: available
    since: just now
  access: admin
  last-connection: 3 seconds ago
  sla-owner: admin

I've also seen status: {}

But this means that the Model itself is going to Dead, even though we are still doing things. In that second model, we even see status as "available" even though the life is dead.
The latter is being read from the Statuses collection, while the former is from the Models and Settings collections.

So there is an argument that Model.Life attribute should be "dying" in these circumstances, and not "dead", but for now, we can just return them naturally.

John A Meinel (jameinel) on 2017-12-11
Changed in juju:
milestone: 2.3.2 → 2.4-beta1
Anastasia (anastasia-macmood) wrote :

I am sure that the above PR has been merged into develop (2.4+) as part of a bigger merge.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers