Canonical Juju

agent lost, even though agents are still responding

Bug #1703526 reported by John A Meinel on 2017-07-11

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	John A Meinel	Canonical Juju 2.3-beta1
	2.2	Fix Released	Critical	John A Meinel	Canonical Juju 2.2.2

Bug Description

With 2.2.1 we are seeing agents show up as missing (lost), even though they seem to still be responding to events (update-status still fires, changing configuration still seems to apply that configuration to the running applications.)

It is possible that some of our recent changes around how we aggregate and record Pings have caused liveliness to get lost.

Debugging the presence tables (with the attached python script) has shown that we see a bunch of agents without pings in the recent ping slots (as many as 7 slots without any ticks, would be 3.5min of not having presence information for those agents).

We are not seeing Ping entries that correspond to beings that are no longer in the database, so it is unlikely that pruning is removing active beings. (2.2.2 also has a patch already about this, so that instead of only preserving the beings for the most-recent-2-slots we preserve beings that are active in any slot that is still in the database.)

It seems to be necessary to have HA controllers in order to trigger this behavior.

At this point it is unclear whether the agents have stopped sending Ping() messages, or whether we are just failing to record them.

It is possible we have a bug where a PingBatcher will get wedged trying to write to the database, and then still be accepting Pings and not getting restarted, but just never writes any more presence information. (And thus we lose the presence for all agents that are connected to that specific batcher.)

We haven't yet reproduced this in isolation.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

pings.py Edit (1.4 KiB, text/x-python)

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

So I managed to trigger a failure that we don't recover from. I don't know if this is the problem that is happening in the wild, but it is *a* problem, which we should fix.

If you inject an invalid entry in the database to force the PingBatcher to die:
db.presence.pings.insert({"_id": "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870", "slot" : NumberLong(1499793870), "alive" : { "1" : "a"}})

(note the slot was picked as 60 greater than the biggest slot already)

Then PingBatcher will fail with whatever error:
machine-0: 21:24:32 ERROR juju.worker exited "pingbatcher": Cannot apply $inc to a value of non-numeric type. {_id: "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870"} has the field '1' of non-numeric type String

At that point, we actually *restart* PingBatcher, but all of the existing Pinger objects continue to use the now-dead PingBatcher, so they all actually end up blocked/timing out.

Now, this is overly forceful, as it will cause all PingBatchers on all controllers to die. But imagine that 1 PingBatcher was dying on 1 controller. Then it would have a similar symptom.

The issue is that we construct Pingers passing in a PingBatcher to use, but we don't have them use an updated PingBatcher if there is any reason that we need to restart the PingBatcher.
Instead, we need to give them a function that will return whatever the currently live PingBatcher we have.

(I'm also not sure what happens if a Pinger actually dies due to an error, as near as I can tell we don't ever restart them, either.)

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

Possible fix (needs real testing):
https://github.com/juju/juju/pull/7626

It may be that we need to change a different layer to restart Pinger objects, rather than fixing Pinger objects to reuse a freshly restarted PingBatcher object.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

Something seems wrong with the internals. I'm not 100% sure what happened. But it appears that after restarting the controllers, the agents still don't show up as alive. But watching TRACE debug-log I notice this:
machine-0: 22:48:49 TRACE juju.state.presence [8e869b] pinging "u#ul/6#charm" with seq=152
machine-0: 22:48:49 TRACE juju.state.presence [8e869b] pinging "u#peer-xplod/7#charm" with seq=154

Notice that all of those have "#charm" appending to their keys.
When I dump the database, I can see both:
{ "_id" : "8e869b13-85a7-4d86-8a3e-4d332b4306e8:116", "seq" : NumberLong(116), "key" : "u#peer-xplod/5" }
and
{ "_id" : "8e869b13-85a7-4d86-8a3e-4d332b4306e8:159", "seq" : NumberLong(159), "key" : "u#peer-xplod/5#charm" }

I wonder if the bug is that when we restart Pinger objects, we accidentally restart them with #charm instead of the actual global key for the unit itself.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

There is definitely a discrepancy because we see:
machine-2: 23:13:34 TRACE juju.state.presence [8e869b] got request: presence.reqAlive{key:"u#peer-xplod/0", result:(chan bool)(0xc4220669a0)} for model

That is clearly asking "is u#peer-xplod/0" alive, but then we see:
machine-0: 23:15:32 TRACE juju.apiserver.presence.unit-peer-xplod-0 starting pinger...
machine-0: 23:15:33 TRACE juju.state.presence [8e869b] starting pinger for "u#peer-xplod/0#charm" with seq=181

^- that is starting a ping record for "u#peer-xplod/0#charm" which is obviously a mismatch.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

(this could just be a bug in my patch that was trying to fix it)

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

u#peer-xplod/0#charm was, indeed, a bug in my patch. I changed a call from Unit.globalAgentKey() to u.globalKey() because I was copying the code from Machine which doesn't distinguish and only has globalKey :(

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

Digging back through one of the machine-0.log files where we saw this happening, we actually see the PingBatcher dying mid-way through which validates the theory:

2017-07-04 11:39:31 INFO juju.state.presence pingbatcher.go:119 ping batching loop failed: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }
ca084bdd-5a88-41cf-8d67-f9a9dbd7dfd9: machine-1 2017-07-04 11:39:31 INFO juju.worker runner.go:483 stopped "pingbatcher", err: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }
ca084bdd-5a88-41cf-8d67-f9a9dbd7dfd9: machine-1 2017-07-04 11:39:31 ERROR juju.worker runner.go:392 exited "pingbatcher": E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-13:

#10

Built a snap with https://github.com/juju/juju/pull/7626 included. Will post something tomorrow on whether it reproduces or not after an upgrade of an existing controller. If not, I will redeploy a model and try it out on a fresh environment.