juju2beta12: E11000 duplicate key error collection: juju.txns.stash

Bug #1604644 reported by Adam Stokes
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Menno Finlay-Smits

Bug Description

This problem still exists with Juju 2 beta 12:

```
flume-syslog/0: cannot assign unit "flume-syslog/0" to machine: cannot assign unit "flume-syslog/0" to new machine or container: cannot assign unit "flume-syslog/0" to new machine: E11000 duplicate key error collection: juju.txns.stash index: _id_ dup key: { : { c: "assignUnits", id: "08760b44-e3cf-4a7e-81d8-164cd846bd3e:flume-syslog/0" } }
```

This can easily be reproduced by creating a default model, deploy a bundle, destroy-model, juju add-model, repeat bundle deployment.

Here is the machine-0 output:

http://paste.ubuntu.com/20126519/

This is ongoing from bug #1593828

summary: - E11000 duplicate key error collection: juju.txns.stash
+ juju2beta12: E11000 duplicate key error collection: juju.txns.stash
tags: added: conjure
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.0-beta13
tags: added: mongodb
tags: added: blocker
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

It looks like the fix for bug #1593828 didn't make the beta12 release despite it being marked "Fix Committed" and it being mentioned in the release notes.

The fix was committed on July 15: 99cb2d1c148f5ed1d246bf4fe44064363226e12e
(PR: https://github.com/juju/juju/pull/5812)

As the fix is in master it will be part of 2.0-beta13. We need to decide whether a release which includes this fix needs to go out sooner than whenever beta13 is ready. beta12 (without the fix) has already been uploaded into the archives.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

It turns out that there were some issues during the build of the release and that the Git tag for juju-2.0beta12 is incorrect and that the revision with the mgo patch was probably included in the release.

The source for the release is here: http://reports.vapour.ws/releases/4141/binaries. The log for the creation of this tarball is here: http://reports.vapour.ws/releases/4141/job/build-revision/attempt/4141. The application of the mgo patch can seen towards the bottom.

There is also a source tarball for the release here: https://github.com/juju/juju/releases/tag/juju-2.0-beta12. This only contains the source for github.com/juju/juju though. The source for Juju's dependencies (including the patched mgo) is not included in this.

Things to figure out now:

1. Determine with complete certainty that the patched mgo was to build the beta12 release that's now in the PPA.
2. If #1 holds, figure out why the patch isn't working.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

At the moment mgo will retry an upsert a maximum of 5 times - it's possible that that's not enough attempts in some cases. Unfortunately there's no way to increase the number of retries other than changing the code. We can add some logging on the too-many-attempts path to see if that's the problem.

Revision history for this message
Curtis Hovey (sinzui) wrote :
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Ok, trying both Menno's build and retrying a fresh Xenial image using the juju beta12 from ppa:juju/devel I can no longer reproduce.

I'm going to try a few more different bundles and variations of destroy-model/add-model to see if anything changes. Otherwise, I think we can go ahead and mark this invalid for now.

Changed in juju-core:
status: Triaged → Invalid
milestone: 2.0-beta13 → none
tags: removed: blocker
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Ran into the issue again, this time was on a fresh bootstrap:

ceph-osd/2: cannot assign unit "ceph-osd/2" to machine: cannot assign unit "ceph-osd/2" to new machine or container: cannot assign unit "ceph-osd/2" to new machine: E11000 duplicate key error collection: juju.txns.stash index: _id_ dup key: { : { c: "assignUnits", id: "5d51713a-2356-40ec-8c4d-1967abb086de:ceph-osd/2" } }

machine-0 output: http://paste.ubuntu.com/20180271/

This is using the package from ppa:juju/devel, and run with the following:

juju bootstrap marin localhost --upload-tools --config image-stream=daily --config enable-os-upgrade=false --bootstrap-series=xenial

Changed in juju-core:
status: Invalid → New
Changed in juju-core:
status: New → Triaged
milestone: none → 2.0-beta13
Revision history for this message
Martin Packman (gz) wrote :

CI just hit bug 1604959 on master which is likely a different, restore-specific issue, but maybe worth thinking about.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Looking at the mgo patch, there's a bug in it that prevents it from propagating the duplicate key error, even after 5 failed attempts. That means that when we see the E11000 errors, they must be coming from binaries without the patch. So I guess that means the binaries from ppa:juju/devel aren't right.

You can see the bug here (on mgo.v2-unstable - the patch puts the same loop into v2): https://github.com/go-mgo/mgo/blob/v2-unstable/session.go#L4276

If the loop exits because there was a duplicate key error and i >= maxUpsertRetries, it needs to repeat the checks in L4271-4274 after the loop. As the code is now it can't return the E11000 error, only NotFound or an unmarshalling error.

I'm going to fix the patch and make another PR for mgo. Menno suggests adding something to log once that the patch is present so we can be sure which binaries have it and which don't, I'll do that too.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I had provided Adam with a custom build of 2.0-beta12 which definitely had the mgo patch applied and he's seeing the E11000 error even with that. This contradicts what Christian is saying about E11000 not being possible at all for an upsert due to a bug with the patch.

I'm still digging...

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Due to the just-discovered bug 1605050, the patched custom build I provided to Adam will have been automatically "upgrading" to the beta12 from the PPA. This is why Adam was still seeing the E11000 errors with that build.

I'm producing a new beta12 variant which won't have this problem.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Adam: please try to reproduce the E11000 issue with this build. It's called juju-2.0-gamma1 and it's beta12 with the mgo patch applied, but is called gamma1 to work around bug 1605050.

https://www.dropbox.com/sh/qg89e4y3ge6qeri/AAAzRZL_460S93AoJ4MpcpXfa?dl=0

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

My current thinking is that the mgo patch is probably ok (apart from the issue which Christian discusses above) and that beta12 *didn't* have the patch applied. We need confirmation that the problem isn't reproducible using the "gamma1" bug above to help confirm.

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Ive been testing the binaries from https://bugs.launchpad.net/juju-core/+bug/1604644/comments/11 for the last 3 hours.

Doing various combinations:

1. bootstrap, deploy bundle, destroy-controller <controller> --destroy-all-models
2. bootstrap, deploy bundle, destroy-model <model>, add-model <model>, deploy bundle

All of this done on localhost (LXD).

I haven't run into this duplicate error yet. I'm going to continue testing for another hour or so but wanted to update with my findings so far.

Note: I usually hit the error after 10 deployments (in either combination above). So far im at about 40 deployments with no hiccup.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

Update, going on 5 hours of continuous testing and no issues with the binary from comment #11

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Adam says the patched binary Menno gave him is good so far - no E11000s seen. (Thanks Adam!)

PR for the fix to mgo.v2-unstable:
https://github.com/go-mgo/mgo/pull/302

PR to update the patch in the Juju tree, and add logging there so we can see easily whether the running binary has the mgo patch:
https://github.com/juju/juju/pull/5854

Revision history for this message
Felipe Reyes (freyes) wrote :

After roughly ~3 hours doing bootstraps (maas provider) I couldn't reproduce this issue anymore. Using beta12 I can reproduce the issue in 9 out of 10 deployments.

Felipe Reyes (freyes)
tags: added: sts
Tim Penhey (thumper)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Adam and Felipe: Thanks for all your testing efforts so far. It would be great if you could also test juju-2.0-gamma2 from here:

https://www.dropbox.com/sh/qg89e4y3ge6qeri/AAAzRZL_460S93AoJ4MpcpXfa?dl=0

This is beta12 with the next iteration of the mgo patch. The updated patch addresses a problem that Christian noticed when the upsert retry limit is reached.

It would be awesome to get some confidence that the updated patch is ok before beta13 is cut (soon). I've done a few tests with the battlemidget/openstack-novalxd bundle and haven't managed to trigger the E11000 issue myself.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1604644] Re: juju2beta12: E11000 duplicate key error collection: juju.txns.stash

On Fri, 22 Jul 2016 00:57:57 -0000
Menno Smits <email address hidden> wrote:

> It would be awesome to get some confidence that the updated patch is
> ok before beta13 is cut (soon). I've done a few tests with the
> battlemidget /openstack-novalxd bundle and haven't managed to trigger
> the E11000 issue myself.

I've been bootstrapping with this version for the last hour or so and I
couldn't reproduce the problem, all the controllers were provisioned
successfully \o/

Great work, kudos for everyone involved in this fix, because it was a
tricky one.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Felipe: That's great news. Thanks very much for testing.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Should this have been marked "Fixed released" for beta13?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Menno

The release was done on an earlier commit. So this fix did not make it into beta 13.
However, beta13 is patched with the original fix that was intended to go with beta12.

This particular fix will be released with beta14.

Revision history for this message
Felipe Reyes (freyes) wrote :

On Mon, 25 Jul 2016 22:06:09 -0000
Anastasia <email address hidden> wrote:

> @Menno
>
> The release was done on an earlier commit. So this fix did not make
> it into beta 13. However, beta13 is patched with the original fix
> that was intended to go with beta12.

Anastasia, from the release notes "Fix for Mongo 'duplicate key
error' applied to both client and agents.", so they seem to be
inaccurate, unless there is another (similar) bug that I'm not aware of.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Felipe

Beta13 is patched with original mongo fix and should work for you as per your recent testing of binary provided by Menno. The patch is applied to release not codebase. You should be all good :)

However, the code changes that Menno refers to above will be in the code base for beta14.

I hope this clears the mud \o/

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta14 → none
milestone: none → 2.0-beta14
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.