PingBatcher could gracefully handle "DuplicateKeyError"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.2 |
Fix Released
|
High
|
John A Meinel |
Bug Description
Mongo has an underlying slightly-unreliable nature to Upsert:
https:/
It seems if you have 2 sessions to the same database, and both issue Upsert affecting a document that doesn't exist yet, they have the potential to race and both try to create it.
We've always had this race potential with Pings. I'm not sure if its just that Pingers weren't logging ERRORS and restarts. Or if the fact that PingBatcher actually randomized its sync time we made it slightly more likely.
Regardless it manifests itself with a log entry like:
ERROR juju.worker runner.go:392 exited "pingbatcher": E11000 duplicate key error collection: presence.
You can only trigger this in HA, because otherwise there is only a single controller that is managing all the data anyway. And you *might* need to have sufficient load to cause the gap between 'does the document exist' and 'create the document' to be wide enough for it to fail.
Because we know we're doing an Upsert we should never be getting "duplicate key error" because we would want to be updating the document that exists, so we can just treat that error as "retry the operation".
We need to be a little careful given PingBatcher is trying to use a Bulk api, which means we don't know which entries in Bulk actually were applied, and what failed. (And $inc is not idempotent.)
However, if we switch to {$bit: {$or: }} then it *is* idempotent and we could just trap E11000 and re-apply the same request.
Changed in juju: | |
milestone: | 2.3-beta2 → 2.3-beta1 |
This may not be High/need 2.2 target. It shouldn't be a huge patch, and with the fix for bug #1703526 it shouldn't actually matter because the PingBatcher will just get restarted. You might get a blip where for 30s-1min some agents show as down. Which means this is still a bug, just may not need a high priority to fix it.