Canonical Juju

PingBatcher could gracefully handle "DuplicateKeyError"

Series 2.2
Bug #1703675

Bug #1703675 reported by John A Meinel on 2017-07-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.3-beta1
	2.2	Fix Released	High	John A Meinel	Canonical Juju 2.2.5

Bug Description

Mongo has an underlying slightly-unreliable nature to Upsert:
https://jira.mongodb.org/browse/SERVER-14322

It seems if you have 2 sessions to the same database, and both issue Upsert affecting a document that doesn't exist yet, they have the potential to race and both try to create it.

We've always had this race potential with Pings. I'm not sure if its just that Pingers weren't logging ERRORS and restarts. Or if the fact that PingBatcher actually randomized its sync time we made it slightly more likely.

Regardless it manifests itself with a log entry like:
ERROR juju.worker runner.go:392 exited "pingbatcher": E11000 duplicate key error collection: presence.presence.pings index: _id_ dup key: { : "ae6dfe7a-e99d-45d9-8230-da097440e17f:1499168370" }

You can only trigger this in HA, because otherwise there is only a single controller that is managing all the data anyway. And you *might* need to have sufficient load to cause the gap between 'does the document exist' and 'create the document' to be wide enough for it to fail.

Because we know we're doing an Upsert we should never be getting "duplicate key error" because we would want to be updating the document that exists, so we can just treat that error as "retry the operation".

We need to be a little careful given PingBatcher is trying to use a Bulk api, which means we don't know which entries in Bulk actually were applied, and what failed. (And $inc is not idempotent.)

However, if we switch to {$bit: {$or: }} then it *is* idempotent and we could just trap E11000 and re-apply the same request.

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-11:

This may not be High/need 2.2 target. It shouldn't be a huge patch, and with the fix for bug #1703526 it shouldn't actually matter because the PingBatcher will just get restarted. You might get a blip where for 30s-1min some agents show as down. Which means this is still a bug, just may not need a high priority to fix it.

Revision history for this message

Jacek Nykis (jacekn) wrote on 2017-09-19:

We've just hit this bug in production and it resulted it very hight load on the controller.

We are currently on juju 2.2.4

Is there any workaround available?

tags:

added: canonical-is

Revision history for this message

John A Meinel (jameinel) wrote on 2017-09-19:

In discussing on IRC, this does not seem to be the primary cause, just a symptom of what happens when Mongo starts to get unhappy. It probably doesn't *help* as restarting PingBatcher will have some load on the system.

So I put together this PR, which simplifies the code a bit and addresses bug #1699678 at the same time.

https://github.com/juju/juju/pull/7863

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-11:

this should have landed in 2.2-beta1

Changed in juju:
status:	In Progress → Fix Released
milestone:	none → 2.3-beta2

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-11:

sorry, 2.3beta1

Anastasia (anastasia-macmood) on 2017-10-11

Changed in juju:
milestone:	2.3-beta2 → 2.3-beta1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.