juju rc1 loses agents during a lxd deploy

Bug #1572237 reported by Adam Stokes
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Eric Snow

Bug Description

Using juju commit http://paste.ubuntu.com/15933149/, im seeing a few 'agent is lost, sorry!' errors come up:

Here is my juju status output: https://paste.ubuntu.com/15933366/

And log output from one of the failed services 'rabbitmq' http://paste.ubuntu.com/15934148/

I also enabled DEBUG for logging and capture this log from ceph-osd that is failing:

https://paste.ubuntu.com/15934122/

Tags: lxd-provider
description: updated
description: updated
Martin Packman (gz)
tags: added: lxd-provider
Changed in juju-core:
importance: Undecided → Critical
milestone: none → 2.0-rc1
status: New → Triaged
Revision history for this message
Adam Stokes (adam-stokes) wrote :

This is the rabbitq machine log with TRACE debugging enabled:

http://paste.ubuntu.com/15935428/

This is the same unit's log with TRACE enabled:

http://paste.ubuntu.com/15935446/

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Once I was able to examine the environment, all the unit agents were lost. I saw the same error in the log over and over:

2016-04-19 23:42:43 TRACE juju.worker.dependency engine.go:422 failed to start "is-responsible-flag" manifold worker: lease manager stopped
2016-04-19 23:42:43 DEBUG juju.worker.dependency engine.go:479 "is-responsible-flag" manifold worker stopped: lease manager stopped
2016-04-19 23:42:43 ERROR juju.worker.dependency engine.go:526 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
2016-04-19 23:42:43 TRACE juju.apiserver apiserver.go:303 -> [2CC] unit-neutron-api-0 {"RequestId":2292,"Response":{"Results":[{"Error":{"Message":"lease manager stopped","Code":""}}]}}

I could not find any indication of why the lease manager was stopped. It is similar bug 1539656, but in that case the apiserver died because it couldn't connect to mongo, and I didn't see any indication in the logs that there were mongo connectivity issues. Interestingly enough, though, I was unable to connect to mongo on the controller manually to inspect the state of the units in the DB.

 I re-bootstrapped with a juju binary that included some extra logging, but haven't been able to recreate the issue. Most units are up, with 4 still waiting to install packages. I will need to check back in the morning.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-rc1
Changed in juju-core:
assignee: nobody → William Reade (fwereade)
Revision history for this message
William Reade (fwereade) wrote :

Controller logs would be especially helpful; the section of code that could most plausibly lose the agents is all server-side.

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :

The above logs are for the controller at a TRACE level. In that case, I did not see the lease manager error.

I see entries for pinging the unit that was having problems:
2016-04-19 19:11:03 TRACE juju.state.presence presence.go:671 pinging "u#rabbitmq-server/0" with seq=45

But I looked at the logs again, and found this error hiding in plain sight shortly after the last of the above entries shows up in the log:

2016-04-19 19:11:03 INFO juju.state.presence presence.go:542 pinger loop failed: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "9f3fbe9b-f105-412c-85af-3487741c5159:1461093060" }

After that error (which really should be logged at ERROR not INFO), I see no more logs for pinging the unit that was showing up as lost.

Revision history for this message
William Reade (fwereade) wrote :

The core problem here is that we don't track apiserver resource lifetimes -- a Resource just has a Stop method, so there's no way to see if it's broken without breaking it. It's fundamentally the same problem as the "watcher has stopped" errors we're already familiar with, but more harmful because at least those have an external client that can start a fresh watch: this has nothing.

It's *also* the same core problem as we have with sub-state workers ("lease manager stopped") -- some of them are at least proper workers now, but that's just a baby step: there's still no management structure in place, and it's not trivial to build (which is why it hasn't happened yet).

The resources side of the problem is *slightly* more tractable (and is the one that triggered this particular case) so I'm working on that, but it's unlikely to be a quick fix (and indirectly affects most things in juju, so it's... not a great thing to start working on at RC time. For reference, see http://reviews.vapour.ws/r/3147/ -- that's the 9-page diff in which I fixed the lifetime issues with agent-side watchers *alone* -- and does not include the non-trivial infrastructure work that change depended upon. And there are more things than watchers that can be apiserver resources...).

[incidentally, we shouldn't *really* log that error at all -- we wouldn't need to if we tracked resource lifetime -- but that's by the by, and I'm glad we did because it was valuable confirmation of what went wrong]

Revision history for this message
William Reade (fwereade) wrote :

EOD update: I'm working on a bounded fix for the agent-lost issue in isolation, and I should be able to push it tomorrow.

The fix will wrap the Pinger in a new worker, which is responsible for starting fresh Pingers when anything goes wrong, but the technique is not generally applicable -- the Pinger resource is (uniquely?) not actually *used* as a resource by anything else, and is only in the resource pool so it'll get shut down when the conn drops for whatever reason.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I've reviewed William's changes with him in a hangout and feel comfortable with his proposed solution. He is handing off the bug to me and I hope to have a patch up today based on his work.

Changed in juju-core:
status: Triaged → In Progress
assignee: William Reade (fwereade) → Eric Snow (ericsnowcurrently)
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I'm still wrapping up tests but have high confidence in the solution.

Revision history for this message
Mike McCracken (mikemc) wrote :

A couple of tests run by danwest and I of deploying an openstack bundle on LXD both worked with binaries built from William and Eric's PR: https://github.com/juju/juju/pull/5267

Previously there were regular errors, now no 'agent lost' errors at all in two tries.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

I can also say that I've verified this is working as expected, no agents lost :D

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta6 → none
milestone: none → 2.0-beta6
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.