Comment 3 for bug 1703526

Revision history for this message
John A Meinel (jameinel) wrote :

So I managed to trigger a failure that we don't recover from. I don't know if this is the problem that is happening in the wild, but it is *a* problem, which we should fix.

If you inject an invalid entry in the database to force the PingBatcher to die:
db.presence.pings.insert({"_id": "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870", "slot" : NumberLong(1499793870), "alive" : { "1" : "a"}})

(note the slot was picked as 60 greater than the biggest slot already)

Then PingBatcher will fail with whatever error:
machine-0: 21:24:32 ERROR juju.worker exited "pingbatcher": Cannot apply $inc to a value of non-numeric type. {_id: "8e869b13-85a7-4d86-8a3e-4d332b4306e8:1499793870"} has the field '1' of non-numeric type String

At that point, we actually *restart* PingBatcher, but all of the existing Pinger objects continue to use the now-dead PingBatcher, so they all actually end up blocked/timing out.

Now, this is overly forceful, as it will cause all PingBatchers on all controllers to die. But imagine that 1 PingBatcher was dying on 1 controller. Then it would have a similar symptom.

The issue is that we construct Pingers passing in a PingBatcher to use, but we don't have them use an updated PingBatcher if there is any reason that we need to restart the PingBatcher.
Instead, we need to give them a function that will return whatever the currently live PingBatcher we have.

(I'm also not sure what happens if a Pinger actually dies due to an error, as near as I can tell we don't ever restart them, either.)