agent lost, even though agents are still responding

Bug #1703526 reported by John A Meinel
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
John A Meinel
2.2
Fix Released
Critical
John A Meinel

Bug Description

With 2.2.1 we are seeing agents show up as missing (lost), even though they seem to still be responding to events (update-status still fires, changing configuration still seems to apply that configuration to the running applications.)

It is possible that some of our recent changes around how we aggregate and record Pings have caused liveliness to get lost.

Debugging the presence tables (with the attached python script) has shown that we see a bunch of agents without pings in the recent ping slots (as many as 7 slots without any ticks, would be 3.5min of not having presence information for those agents).

We are not seeing Ping entries that correspond to beings that are no longer in the database, so it is unlikely that pruning is removing active beings. (2.2.2 also has a patch already about this, so that instead of only preserving the beings for the most-recent-2-slots we preserve beings that are active in any slot that is still in the database.)

It seems to be necessary to have HA controllers in order to trigger this behavior.

At this point it is unclear whether the agents have stopped sending Ping() messages, or whether we are just failing to record them.

It is possible we have a bug where a PingBatcher will get wedged trying to write to the database, and then still be accepting Pings and not getting restarted, but just never writes any more presence information. (And thus we lose the presence for all agents that are connected to that specific batcher.)

We haven't yet reproduced this in isolation.

Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
John A Meinel (jameinel) wrote :

So I managed to trigger a failure that we don't recover from. I don't know if this is the problem that is happening in the wild, but it is *a* problem, which we should fix.

If you inject an invalid entry in the database to force the PingBatcher to die:
db.presence.pings.insert({"_id": "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870", "slot" : NumberLong(1499793870), "alive" : { "1" : "a"}})

(note the slot was picked as 60 greater than the biggest slot already)

Then PingBatcher will fail with whatever error:
machine-0: 21:24:32 ERROR juju.worker exited "pingbatcher": Cannot apply $inc to a value of non-numeric type. {_id: "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870"} has the field '1' of non-numeric type String

At that point, we actually *restart* PingBatcher, but all of the existing Pinger objects continue to use the now-dead PingBatcher, so they all actually end up blocked/timing out.

Now, this is overly forceful, as it will cause all PingBatchers on all controllers to die. But imagine that 1 PingBatcher was dying on 1 controller. Then it would have a similar symptom.

The issue is that we construct Pingers passing in a PingBatcher to use, but we don't have them use an updated PingBatcher if there is any reason that we need to restart the PingBatcher.
Instead, we need to give them a function that will return whatever the currently live PingBatcher we have.

(I'm also not sure what happens if a Pinger actually dies due to an error, as near as I can tell we don't ever restart them, either.)

Changed in juju:
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

Possible fix (needs real testing):
https://github.com/juju/juju/pull/7626

It may be that we need to change a different layer to restart Pinger objects, rather than fixing Pinger objects to reuse a freshly restarted PingBatcher object.

Revision history for this message
John A Meinel (jameinel) wrote :

Something seems wrong with the internals. I'm not 100% sure what happened. But it appears that after restarting the controllers, the agents still don't show up as alive. But watching TRACE debug-log I notice this:
machine-0: 22:48:49 TRACE juju.state.presence [8e869b] pinging "u#ul/6#charm" with seq=152
machine-0: 22:48:49 TRACE juju.state.presence [8e869b] pinging "u#peer-xplod/7#charm" with seq=154

Notice that all of those have "#charm" appending to their keys.
When I dump the database, I can see both:
{ "_id" : "8e869b13-85a7-4d86-8a3e-4d332b4306e8:116", "seq" : NumberLong(116), "key" : "u#peer-xplod/5" }
and
{ "_id" : "8e869b13-85a7-4d86-8a3e-4d332b4306e8:159", "seq" : NumberLong(159), "key" : "u#peer-xplod/5#charm" }

I wonder if the bug is that when we restart Pinger objects, we accidentally restart them with #charm instead of the actual global key for the unit itself.

Revision history for this message
John A Meinel (jameinel) wrote :

There is definitely a discrepancy because we see:
machine-2: 23:13:34 TRACE juju.state.presence [8e869b] got request: presence.reqAlive{key:"u#peer-xplod/0", result:(chan bool)(0xc4220669a0)} for model

That is clearly asking "is u#peer-xplod/0" alive, but then we see:
machine-0: 23:15:32 TRACE juju.apiserver.presence.unit-peer-xplod-0 starting pinger...
machine-0: 23:15:33 TRACE juju.state.presence [8e869b] starting pinger for "u#peer-xplod/0#charm" with seq=181

^- that is starting a ping record for "u#peer-xplod/0#charm" which is obviously a mismatch.

Revision history for this message
John A Meinel (jameinel) wrote :

(this could just be a bug in my patch that was trying to fix it)

Revision history for this message
John A Meinel (jameinel) wrote :

u#peer-xplod/0#charm was, indeed, a bug in my patch. I changed a call from Unit.globalAgentKey() to u.globalKey() because I was copying the code from Machine which doesn't distinguish and only has globalKey :(

Revision history for this message
John A Meinel (jameinel) wrote :

Digging back through one of the machine-0.log files where we saw this happening, we actually see the PingBatcher dying mid-way through which validates the theory:

2017-07-04 11:39:31 INFO juju.state.presence pingbatcher.go:119 ping batching loop failed: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }
ca084bdd-5a88-41cf-8d67-f9a9dbd7dfd9: machine-1 2017-07-04 11:39:31 INFO juju.worker runner.go:483 stopped "pingbatcher", err: E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }
ca084bdd-5a88-41cf-8d67-f9a9dbd7dfd9: machine-1 2017-07-04 11:39:31 ERROR juju.worker runner.go:392 exited "pingbatcher": E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Built a snap with https://github.com/juju/juju/pull/7626 included. Will post something tomorrow on whether it reproduces or not after an upgrade of an existing controller. If not, I will redeploy a model and try it out on a fresh environment.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I've left a deployment to run overnight - no agent loss this time.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@Dmitrii: that's great news. Thanks for taking the time to confirm.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This fix would have been merged into develop (2.3) as part of a bigger merge.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.