removed model can cause allmodelwatcher to die permanently

Bug #1745231 reported by Roger Peppe on 2018-01-24
This bug affects 3 people
Affects Status Importance Assigned to Milestone

Bug Description

Every so often, we see that the all-model watcher becomes unavailable
and only available again when the controller machine agent is restarted.

One cause of this is when there's a dead model. We see log messages like this,
repeated over and over again:

 2018-01-24 15:02:32 INFO juju.state multiwatcher.go:214 store manager loop failed: model c037a410-bf55-4d3b-8ffb-ddead567bef9 has been removed
 2018-01-24 15:02:32 INFO juju.worker runner.go:483 stopped "allmodelmanager", err: model c037a410-bf55-4d3b-8ffb-ddead567bef9 has been removed
 2018-01-24 15:02:32 ERROR juju.worker runner.go:392 exited "allmodelmanager": model c037a410-bf55-4d3b-8ffb-ddead567bef9 has been removed
 2018-01-24 15:02:32 INFO juju.worker runner.go:467 restarting "allmodelmanager" in 1s

When we look at the models with "juju show-models", we see that the model with that UUID
does exist, but is dead:

  "agent-version": "2.2.9",
  "cloud": "aws",
  "controller-name": "jaas-aws-eu-west-1-001",
  "controller-uuid": "086f0bf8-da79-4ad4-8d73-890721332c8b",
  "life": "dead",
  "model-uuid": "c037a410-bf55-4d3b-8ffb-ddead567bef9",
  "name": "redacted@external/redacted",
  "owner": "redacted@external",
  "region": "eu-west-1",
  "short-name": "redacted",
  "sla": "unsupported",
  "status": {
   "current": "destroying",
   "message": "tearing down cloud environment",
   "since": "just now"
  "type": "ec2",
  "users": {
   "admin": {
    "access": "admin",
    "display-name": "admin",
    "last-connection": "never connected"
   "redacted@external": {
    "access": "admin",
    "display-name": "redacted",
    "last-connection": "6 hours ago"

It seems like allModelWatcherStateBacking.loadAllWatcherEntitiesForModel is
returning the error when State.Get is called.
A simple fix might be to return a nil error when the cause is ErrNotFound.

It seems like the Pool retains some state on each model in memory
(PoolItem.remove) and this would explain why restarting the machine
agent fixes the issue.

Anastasia (anastasia-macmood) wrote :

@Roger Peppe,

Thank you for investigating this failure and such a detailed analysis in this report \o/

I do completely agree with your reasoning and believe that more comprehensive fix would be to filter out dead models from the initial query. This will prevent State.Get trying to populate model details and will fix other places that potentially cannot handle dead models. The way we handle dead models, if at all, should be an exceptional, case-by-case handling: we should never get dead models in that list.

I'll propose against 2.3 branch first and forward port to develop (heading into 2.4) once the patch lands.

Changed in juju:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Anastasia (anastasia-macmood)
Roger Peppe (rogpeppe) wrote :

@Anastasia I thought about filtering out dead models from the query, but I'm not sure that's quite sufficient, as then there can be a race between the query and a model being removed - the model might not have been dead when we issued the query, but could be when we get around to asking for it.

Anastasia (anastasia-macmood) wrote :

Yes, agreed :) I'll do both- this way query patch will reduce problems in other areas, at lest for dead models, and ErrNotFound in here, i.e. the specific place you need it, will make your life better immediately :)

Anastasia (anastasia-macmood) wrote :

PR that filters out dead models on 2.3 -

Anastasia (anastasia-macmood) wrote :

PR that deals with NotFound in the above codepath directly (against 2.3):

Anastasia (anastasia-macmood) wrote :

PR catering for NotFound in the code path described above against develop (heading into 2.4):

Changed in juju:
status: In Progress → Fix Committed
Anastasia (anastasia-macmood) wrote :

PR that filters out dead models on develop (2.4) -

Changed in juju:
milestone: none → 2.4-beta1
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers