cannot assign unit E11000 duplicate key error collection: juju.txns.stash

Bug #1593828 reported by Adam Stokes
62
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Christian Muirhead

Bug Description

Beta 9 fails to deploy 1 of 15 applications in our openstack deployment. Juju status reports:

cannot assign unit "nova-compute/0" to machine: cannot assign unit "nova-compute/0" to new machine or container: cannot assign unit "nova-compute/0" to new machine: E11000 duplicate key error collection: juju.txns.stash index: _id_ dup key: { : { c: "assignUnits", id: "7918e6b1-9b65-41ff-84b1-ac5c250adb80:nova-compute/0" }

Can easily be reproduced using the conjure bundle/spell from https://github.com/battlemidget/openstack-novalxd

tags: added: conjure
Revision history for this message
Reed O'Brien (reedobrien) wrote :

Can you please provide the logs for /var/log/juju/machine-0.log and /var/log/syslog from the controller?
Have you seen this more than once?
Was this the release beta-9 or the daily PPA?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Reed O'Brien (reedobrien) wrote :

I tried reproducing this on aws using the instructions at: http://conjure-up.io/get-started/ -- no dice, but a different error.

Also tried on LXD locally, but that isn't doable AFAICT.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Which version of MongoDB is Juju using? Use this to check:

dpkg -l juju-\* | grep ^ii

If it's 3.2 I believe this is a known issue which is still to be resolved. I can't find the ticket right now though.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've poked babbageclunk as I believe he was looking into this issue recently.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

The bug I dup'ed to this one, bug #1594290 was on xenial, which would be using mongo 3.2

Changed in juju-core:
status: Incomplete → Triaged
importance: Undecided → High
Revision history for this message
Christian Muirhead (2-xtian) wrote :

This is the same problem that I logged in bug #1588784. It seems to be an underlying mgo issue - I've created a github issue for it with a self-contained reproduction of the problem.
https://github.com/go-mgo/mgo/issues/277

I'll ask Gustavo if he has any ideas or advice for how to deal with it.

Revision history for this message
Martin Packman (gz) wrote :

We have now seen this in CI on the multi model functional test, but only once so the issue is apparently intermittent:

<http://reports.vapour.ws/releases/issue/576a8c03749a560e33f7c294>

tags: added: ci deploy intermittent-failure
Revision history for this message
Mike McCracken (mikemc) wrote :

I am seeing this regularly while testing deploys using a development version of conjure-up to deploy openstack services onto the lxd provider.

It is indeed intermittent, but not rare.

Some details that might help reproduce:

Because of design goals for conjure-up, we do not just 'juju deploy bundle.yaml' - conjure-up uses the juju API to deploy the applications in a bundle one at a time, and so issues many individual deploy-application requests. The most common thing I do in testing is to hit the 'deploy them all immediately' button, which issues them in a tight loop.

Let me know if I can help with tracking this down.

William Reade (fwereade)
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Christian and I have figured out what's happening with this. More details here:

https://lists.ubuntu.com/archives/juju-dev/2016-June/005706.html

We're about to proposed a change to mgo/txn to deal with this.

Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
assignee: Menno Smits (menno.smits) → Christian Muirhead (2-xtian)
Revision history for this message
Adam Stokes (adam-stokes) wrote :

I've seen this: https://github.com/go-mgo/mgo/pull/291#issuecomment-230511589

Can I get an update on where this issue stands?

Thank you!

Changed in juju-core:
milestone: none → 2.0-beta12
Changed in juju-core:
importance: High → Critical
Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1593828] Re: cannot assign unit E11000 duplicate key error collection: juju.txns.stash

Thanks, I'm seeing it very frequently with b11 too.

Mark

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Hi Mark -

Sorry about that. We've got a fix for the underlying problem, but because
it's an intermittent error it's been hard to reproduce it in a test. It
turns out there is a test in the txn sub-package that shows the error (and
it passes when the fix is applied), but those tests weren't being run in
CI. So I'm turning those on and fixing some other problems that revealed.
Hopefully with Gustavo's help I can get the fix merged in today.

Christian

On Wed, Jul 6, 2016 at 10:01 AM Mark Shuttleworth <
<email address hidden>> wrote:

> Thanks, I'm seeing it very frequently with b11 too.
>
> Mark
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1593828
>
> Title:
> cannot assign unit E11000 duplicate key error collection:
> juju.txns.stash
>
> Status in juju-core:
> In Progress
>
> Bug description:
> Beta 9 fails to deploy 1 of 15 applications in our openstack
> deployment. Juju status reports:
>
> cannot assign unit "nova-compute/0" to machine: cannot assign unit
> "nova-compute/0" to new machine or container: cannot assign unit
> "nova-compute/0" to new machine: E11000 duplicate key error
> collection: juju.txns.stash index: _id_ dup key: { : { c:
> "assignUnits", id: "7918e6b1-9b65-41ff-84b1-ac5c250adb80:nova-
> compute/0" }
>
> Can easily be reproduced using the conjure bundle/spell from
> https://github.com/battlemidget/openstack-novalxd
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1593828/+subscriptions
>

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Thank you - a good reminder of the value of tests compared to the value
of tests-turned-off ;)

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Ok, the problems in those newly-enabled tests are resolved now. Gustavo was a bit busy to review the changes today, but hopefully he'll be able to look at them tomorrow and we can merge them across to the v2 branch.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Just hit this testing in the OIL lab, trying to add a model, on beta 11.

jenkins@jason-slave:~/test$ juju add-model test
ERROR failed to create new model: E11000 duplicate key error collection: juju.txns.stash index: _id_ dup key: { : { c: "statuseshistory", id: ObjectId('577e784b2b98233549d82577') } }

tags: added: oil
Revision history for this message
Christian Muirhead (2-xtian) wrote :

I got some good review feedback from Gustavo on the mgo PR - I've addressed his comments (I think) and I'll ask him to make another pass, and hopefully merge it into v2-unstable and v2. Once that's merged I can start updating the mgo dependencies for packages in our tree (and also any packages that depend on these packages):

gopkg.in/juju/jujusvg.v1
gopkg.in/juju/charmstore.v5-unstable
gopkg.in/juju/charm.v6-unstable
gopkg.in/juju/charmrepo.v2-unstable
gopkg.in/macaroon-bakery.v1
github.com/juju/bundlechanges
github.com/juju/gomaasapi
github.com/juju/idmclient
github.com/juju/romulus
github.com/juju/juju
github.com/juju/utils

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Topological sort of those so I can do them in dependency order:

gopkg.in/mgo.v2
github.com/juju/utils
github.com/juju/gomaasapi
gopkg.in/juju/charm.v6-unstable
gopkg.in/macaroon-bakery.v1
github.com/juju/idmclient
gopkg.in/juju/jujusvg.v1
gopkg.in/juju/charmrepo.v2-unstable
github.com/juju/bundlechanges
gopkg.in/juju/charmstore.v5-unstable
github.com/juju/romulus
github.com/juju/juju

This list ignores the dependency cycles between (romulus and juju) and (charmstore and charmrepo). I'm not too sure what to do about those yet.

Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

Chris, I've built juju with PR for mgo fix. I haven't seen the dup errors anymore in logs anymore.

However, if you bootstrap LXD and do:

juju bootstrap localxd lxd --upload-tools
for i in {1..30}; do juju deploy ubuntu ubuntu$i; sleep 90; done

Somewhere between 10-20-th deploy fails with machine in pending state (nothin useful in logs) and none of the new deploys after that first pending succeeds. Might be a different bug, but it's easy to verify with running that for loop.

So, this particular error was not in my logs, but the controller still ends up unable to provision at least 30 machines ...

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Good catch Uros, let's save ourselves another whole beta iteration but
nailing this before b13 :)

Mark

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Hi Uros -
That sounds like something different - although still something we need to fix! I've created another bug for it. https://bugs.launchpad.net/juju-core/+bug/1602192

I've seen behaviour that sounds similar when I ran out of space in the underlying ZFS pool - is that happening in your case? Sorry if that's obvious, just figured I'd check. Trying it out I'm amazed at how little of the pool each new node needs - so that's probably not it.

tags: added: oil-2.0
Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

Hey, Chris. I didn't check at the time, redid the test with today's master tip, but unfortunately ZFS pool was not even close to being full.

Revision history for this message
Ian Booth (wallyworld) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
Changed in juju-core:
milestone: 2.0-beta13 → none
status: In Progress → Fix Released
Revision history for this message
Adam Stokes (adam-stokes) wrote :

FYI, this problem still persists with beta12: https://bugs.launchpad.net/juju-core/+bug/1604644

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Guys, this is surely the top blocker before we can actually ask for
wider testing of 2.0.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

The outcome of bug #1604644 is that the mgo patch didn't make it into beta12. An updated version of that patch is landed now and will be in beta13.

The new version of the patch has logging that will let us verify that it's applied to the binary that's running, which was a big part of the difficulty to working out what was happening. It also has a fix for the bug when the retry limit is hit, although we haven't seen that happen anywhere - I only found it when I changed the retry limit to 0.

affects: juju-core → juju
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.