PingBatcher could gracefully handle "DuplicateKeyError"

Bug #1703675 reported by John A Meinel
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.2
Fix Released
High
John A Meinel

Bug Description

Mongo has an underlying slightly-unreliable nature to Upsert:
https://jira.mongodb.org/browse/SERVER-14322

It seems if you have 2 sessions to the same database, and both issue Upsert affecting a document that doesn't exist yet, they have the potential to race and both try to create it.

We've always had this race potential with Pings. I'm not sure if its just that Pingers weren't logging ERRORS and restarts. Or if the fact that PingBatcher actually randomized its sync time we made it slightly more likely.

Regardless it manifests itself with a log entry like:
ERROR juju.worker runner.go:392 exited "pingbatcher": E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }

You can only trigger this in HA, because otherwise there is only a single controller that is managing all the data anyway. And you *might* need to have sufficient load to cause the gap between 'does the document exist' and 'create the document' to be wide enough for it to fail.

Because we know we're doing an Upsert we should never be getting "duplicate key error" because we would want to be updating the document that exists, so we can just treat that error as "retry the operation".

We need to be a little careful given PingBatcher is trying to use a Bulk api, which means we don't know which entries in Bulk actually were applied, and what failed. (And $inc is not idempotent.)

However, if we switch to {$bit: {$or: }} then it *is* idempotent and we could just trap E11000 and re-apply the same request.

Revision history for this message
John A Meinel (jameinel) wrote :

This may not be High/need 2.2 target. It shouldn't be a huge patch, and with the fix for bug #1703526 it shouldn't actually matter because the PingBatcher will just get restarted. You might get a blip where for 30s-1min some agents show as down. Which means this is still a bug, just may not need a high priority to fix it.

Revision history for this message
Jacek Nykis (jacekn) wrote :

We've just hit this bug in production and it resulted it very hight load on the controller.

We are currently on juju 2.2.4

Is there any workaround available?

tags: added: canonical-is
Revision history for this message
John A Meinel (jameinel) wrote :

In discussing on IRC, this does not seem to be the primary cause, just a symptom of what happens when Mongo starts to get unhappy. It probably doesn't *help* as restarting PingBatcher will have some load on the system.

So I put together this PR, which simplifies the code a bit and addresses bug #1699678 at the same time.

https://github.com/juju/juju/pull/7863

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

this should have landed in 2.2-beta1

Changed in juju:
status: In Progress → Fix Released
milestone: none → 2.3-beta2
Revision history for this message
John A Meinel (jameinel) wrote :

sorry, 2.3beta1

Changed in juju:
milestone: 2.3-beta2 → 2.3-beta1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.