The Client.WatchAll API command never responds when the model has no machines

Bug #1560201 reported by Francis Ginther on 2016-03-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Critical
Eric Snow

Bug Description

I'm using this API call:
https://godoc.org/github.com/juju/juju/apiserver/client#Client.WatchAll

with version 2.0-beta2-xenial-amd64,

and deploying to an openstack cloud (canonistack lcy02).

I have 2 models:

fginther@sandman:~$ juju list-models
NAME OWNER LAST CONNECTION
lcy02-ctl admin@local 1 hour ago
jd-test* admin@local 13 minutes ago

Whenever I target jd-test, the Client.WatchAll call never returns if the model has no machines. It will wait for a response forever. However, as soon as I 'juju add-machine' in another terminal the method returns a response and all appears well again. The problem only occurs when the target model has no machines.

Targeting the controller model, 'lcy02-ctl', works just fine. This model always has at least one machine, the bootstrap instance.

I can supply debug traces of the response and requests if it would help.

tags: added: landscape
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0-beta4
tags: added: juju-release-support
Changed in juju-core:
importance: High → Critical
Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
Changed in juju-core:
status: Triaged → In Progress
Eric Snow (ericsnowcurrently) wrote :

Here are the steps I took to try (and fail) to reproduce:

--------------
git up master
godeps -u dependencies.tsv
go install ./...
juju bootstrap --upload-tools test-a lxd
juju create-model model-a
go run /tmp/test-lp1560201.go
--------------

(test-lp1560201.go is attached)

The critical part of the attached code is the following section:

--------------
func (c *testCommand) Run(ctx *cmd.Context) error {
        client, err := c.NewAPIClient()
        if err != nil {
                return err
        }
        defer client.Close()

        watcher, err := client.WatchAll()
        if err != nil {
                return err
        }
        if watcher == nil {
                panic("???")
        }
        fmt.Fprintln(ctx.Stdout, "did not block")
        return nil
}
--------------

It did not block.

Eric Snow (ericsnowcurrently) wrote :

@fginther, what would be the best way to reproduce this? The obvious-to-me approach did not block at all.

Note that I tested against master (effectively beta 3), so it may be that the problem has already been resolved. I'll double-check against the beta 2 code while waiting for your reply.

Eric Snow (ericsnowcurrently) wrote :

Using the code at the beta 2 tag, I was not able to reproduce the described failure using the attached script.

Changed in juju-core:
status: In Progress → Incomplete
tags: added: 2.0-count
Francis Ginther (fginther) wrote :

Eric, I'm trying to reproduce with beta3, but have been slowed by lp:1563576 and slowness with lcy01. If I get it to reproduce, I'll provide an API trace.

Eric Snow (ericsnowcurrently) wrote :

Thanks!

Francis Ginther (fginther) wrote :

Eric, I was able to reproduce this with beta3 on an lxd deployment. In my original report, I failed to notice that the hang occurs when the AllWatcherId is consumed by an AllWatcher API call. I'll attach a full trace.

I have pushed some code and a short test script to demonstrate this to lp:~fginther/python-jujuclient/lp-1560201-test.

To reproduce:
$ bzr branch lp:~fginther/python-jujuclient/lp-1560201-test
$ cd lp-1560201-test
$ virtualenv .venv
$ . .venv/bin/activate
$ python setup.py develop
$ ./test-case.py local.lxd-ctl:default

This assumes that you have bootstrapped an lxd controller named 'lxd-ctl' with an empty model named 'default'. The test script will dump out the juju API calls as they are being made. If the targeted model has no machines, it should hang. Adding a machine should break the hang.

Francis Ginther (fginther) wrote :

Trace of juju api calls leading to the hang.

Changed in juju-core:
status: Incomplete → New
tags: added: kanban-cross-team
Changed in juju-core:
status: New → Triaged
Eric Snow (ericsnowcurrently) wrote :

Thanks for the great instructions, Francis. I was able to reproduce the behavior you described. I also verified that after removing the machine, the API call goes back to hanging.

I also noticed that the API call finishes immediately after the the "juju add-machine" command returns. It does not wait for the machine to be provisioned. This implies to me that the watcher proceeds once the machine entry is added to the DB. I'm going to investigate further.

Changed in juju-core:
status: Triaged → In Progress
Eric Snow (ericsnowcurrently) wrote :

So...I'd expect a Next() call on the watcher to block until the first change happens (e.g. add a machine). After the watcher gets the first change I'd expect it to block until the next change, and so forth. However, that second part does not match up with what I'm seeing. I'm looking into that.

As to what changed in 2.0, under 1.X the old environment started off with an added machine (from bootstrap). Under 2.0, that remains the case for the "admin" model under the controller. For the "default" model nothing has changed yet because the model is still empty at first. That's why the the initial watcher Next() call blocks.

What's the use case for watching the model? To know when the controller has finished bootstrapping?

Eric Snow (ericsnowcurrently) wrote :

Ah, every watcher is automatically started off with a single "change", basically a snapshot of the model at that moment. Subsequent Next() calls for the same watcher will block until the next change.

So, it may be that jujuclient.py is making some assumptions (e.g. in get_stat()) that no longer hold true in the multi-model-controller world of Juju 2.0. Did you run into the problem originally while using jujuclient.py or via some other code?

Francis Ginther (fginther) wrote :

I ran into this from jujuclient.py. get_stat() is utilized as a general purpose method to get the status of everything in the model, which is why it can be called before anything has been deployed. Juju-deployer uses this to determine what's been deployed before doing anything else. The code indicates that the watcher API is used instead of FullStatus due to an older bug (lp:1203105) related to that interface. While that bug was fixed, I suspect no-one wanted to rework the code to use the status API.

I'm a little confused by the comment 'Ah, every watcher is automatically started off with a single "change"'. Does this hold for the default model that has no machines? Should it return some state?

Eric Snow (ericsnowcurrently) wrote :

Ah, then it sounds like jujuclient.py may need to accommodate the fact that new models start out empty in 2.0, meaning watchers do not get an initial "change" event. The alternative is for Juju to *always* send an initial change event to watchers, even if it is empty. I'll look into this.

Eric Snow (ericsnowcurrently) wrote :

We're going to fix this in Juju (core).

Francis Ginther (fginther) wrote :

Thanks for the update.

Eric Snow (ericsnowcurrently) wrote :

pending fix: http://reviews.vapour.ws/r/4406/

I've verified that the provided repro script works (does not hang) with the fix applied.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2016-04-14
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta4 → none
milestone: none → 2.0-beta4
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers