Canonical Juju

juju rc1 loses agents during a lxd deploy

Bug #1572237 reported by Adam Stokes on 2016-04-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Eric Snow	Canonical Juju 2.0-beta6

Bug Description

Using juju commit http://paste.ubuntu.com/15933149/, im seeing a few 'agent is lost, sorry!' errors come up:

Here is my juju status output: https://paste.ubuntu.com/15933366/

And log output from one of the failed services 'rabbitmq' http://paste.ubuntu.com/15934148/

I also enabled DEBUG for logging and capture this log from ceph-osd that is failing:

https://paste.ubuntu.com/15934122/

See original description

Tags:

Adam Stokes (adam-stokes) on 2016-04-19

description:	updated
description:	updated

Martin Packman (gz) on 2016-04-19

tags:	added: lxd-provider
Changed in juju-core:
importance:	Undecided → Critical
milestone:	none → 2.0-rc1
status:	New → Triaged

Revision history for this message

Adam Stokes (adam-stokes) wrote on 2016-04-19:

This is the rabbitq machine log with TRACE debugging enabled:

http://paste.ubuntu.com/15935428/

This is the same unit's log with TRACE enabled:

http://paste.ubuntu.com/15935446/

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-20:

Once I was able to examine the environment, all the unit agents were lost. I saw the same error in the log over and over:

2016-04-19 23:42:43 TRACE juju.worker.dependency engine.go:422 failed to start "is-responsible-flag" manifold worker: lease manager stopped
2016-04-19 23:42:43 DEBUG juju.worker.dependency engine.go:479 "is-responsible-flag" manifold worker stopped: lease manager stopped
2016-04-19 23:42:43 ERROR juju.worker.dependency engine.go:526 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
2016-04-19 23:42:43 TRACE juju.apiserver apiserver.go:303 -> [2CC] unit-neutron-api-0 {"RequestId":2292,"Response":{"Results":[{"Error":{"Message":"lease manager stopped","Code":""}}]}}

I could not find any indication of why the lease manager was stopped. It is similar bug 1539656, but in that case the apiserver died because it couldn't connect to mongo, and I didn't see any indication in the logs that there were mongo connectivity issues. Interestingly enough, though, I was unable to connect to mongo on the controller manually to inspect the state of the units in the DB.

I re-bootstrapped with a juju binary that included some extra logging, but haven't been able to recreate the issue. Most units are up, with 4 still waiting to install packages. I will need to check back in the morning.

Curtis Hovey (sinzui) on 2016-04-20

Changed in juju-core:
milestone:	2.0-beta5 → 2.0-rc1

Cheryl Jennings (cherylj) on 2016-04-20

Changed in juju-core:
assignee:	nobody → William Reade (fwereade)

Revision history for this message

William Reade (fwereade) wrote on 2016-04-20:

Controller logs would be especially helpful; the section of code that could most plausibly lose the agents is all server-side.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-20:

1572237-machine-0.log Edit (84.7 MiB, text/plain)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-20:

The above logs are for the controller at a TRACE level. In that case, I did not see the lease manager error.

I see entries for pinging the unit that was having problems:
2016-04-19 19:11:03 TRACE juju.state.presence presence.go:671 pinging "u#rabbitmq-server/0" with seq=45

But I looked at the logs again, and found this error hiding in plain sight shortly after the last of the above entries shows up in the log:

2016-04-19 19:11:03 INFO juju.state.presence presence.go:542 pinger loop failed: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "9f3fbe9b-f105-412c-85af-3487741c5159:1461093060" }

After that error (which really should be logged at ERROR not INFO), I see no more logs for pinging the unit that was showing up as lost.

Revision history for this message

William Reade (fwereade) wrote on 2016-04-21:

The core problem here is that we don't track apiserver resource lifetimes -- a Resource just has a Stop method, so there's no way to see if it's broken without breaking it. It's fundamentally the same problem as the "watcher has stopped" errors we're already familiar with, but more harmful because at least those have an external client that can start a fresh watch: this has nothing.

It's *also* the same core problem as we have with sub-state workers ("lease manager stopped") -- some of them are at least proper workers now, but that's just a baby step: there's still no management structure in place, and it's not trivial to build (which is why it hasn't happened yet).

The resources side of the problem is *slightly* more tractable (and is the one that triggered this particular case) so I'm working on that, but it's unlikely to be a quick fix (and indirectly affects most things in juju, so it's... not a great thing to start working on at RC time. For reference, see http://reviews.vapour.ws/r/3147/ -- that's the 9-page diff in which I fixed the lifetime issues with agent-side watchers *alone* -- and does not include the non-trivial infrastructure work that change depended upon. And there are more things than watchers that can be apiserver resources...).

[incidentally, we shouldn't *really* log that error at all -- we wouldn't need to if we tracked resource lifetime -- but that's by the by, and I'm glad we did because it was valuable confirmation of what went wrong]

Revision history for this message

William Reade (fwereade) wrote on 2016-04-21:

EOD update: I'm working on a bounded fix for the agent-lost issue in isolation, and I should be able to push it tomorrow.

The fix will wrap the Pinger in a new worker, which is responsible for starting fresh Pingers when anything goes wrong, but the technique is not generally applicable -- the Pinger resource is (uniquely?) not actually *used* as a resource by anything else, and is only in the resource pool so it'll get shut down when the conn drops for whatever reason.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-22:

I've reviewed William's changes with him in a hangout and feel comfortable with his proposed solution. He is handing off the bug to me and I hope to have a patch up today based on his work.

Changed in juju-core:
status:	Triaged → In Progress
assignee:	William Reade (fwereade) → Eric Snow (ericsnowcurrently)

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-23:

http://reviews.vapour.ws/r/4692/

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-23:

#10

I'm still wrapping up tests but have high confidence in the solution.

Revision history for this message

Mike McCracken (mikemc) wrote on 2016-04-23:

#11

A couple of tests run by danwest and I of deploying an openstack bundle on LXD both worked with binaries built from William and Eric's PR: https://github.com/juju/juju/pull/5267

Previously there were regular errors, now no 'agent lost' errors at all in two tries.

Revision history for this message

Adam Stokes (adam-stokes) wrote on 2016-04-23:

#12

I can also say that I've verified this is working as expected, no agents lost :D

Eric Snow (ericsnowcurrently) on 2016-04-25

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-25:

#13

actual change: https://github.com/juju/juju/pull/5270

Curtis Hovey (sinzui) on 2016-04-25

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta6 → none
milestone:	none → 2.0-beta6

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

1572237-machine-0.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.