mgo-txn-resumer cannot resume transaction because missing document

Bug #1727679 reported by Junien F
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

On a controller running juju 2.2.4, these are the last lines of a controller log (see at the bottom).

The document is indeed gone from MongoDB :

juju:PRIMARY> db.leases.find({"_id":"d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#"}).count()
0

This has led (I believe) to very large txn queues in documents related to this model (which is apparently still running version 2.2.2) :
{ "_id" : "d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:clock#application-leadership#", "num" : 147466 }
{ "_id" : "d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#ceph-osd#", "num" : 84825 }
{ "_id" : "d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#ceph-mon#", "num" : 62633 }

Also, the txn-resumer isn't working, so there are currently 4.5M txns in the DB :
juju:PRIMARY> db.txns.count()
4550581

This is also, I think, creating large load spikes from times to times on the controller.

How can we fix this situation ?

Thanks

machine logs :
2017-10-26 10:18:24 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:25 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:26 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:27 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:27 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:28 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:29 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:30 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:31 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:31 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:32 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:33 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:34 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:34 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:35 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:36 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:37 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:38 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:38 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:39 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:40 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:41 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:42 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:43 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:43 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:44 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:45 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:46 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:47 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:48 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:49 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:50 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:50 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:51 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:52 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:53 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:54 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:55 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:55 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:18:56 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:57 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:58 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:18:59 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:00 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:01 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:02 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:03 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:03 ERROR juju.worker.dependency engine.go:546 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases d3ab9c43-dcbc-48b6-8efa-25fa800f73fb:lease#application-leadership#easyrsa#} for applying transaction 59ef3f595f5ce87b42d1e774_3b03bbea
2017-10-26 10:19:04 ERROR juju.worker runner.go:392 exited "txnlog": EOF
2017-10-26 10:19:05 ERROR juju.worker runner.go:392 exited "txnlog": EOF

Junien F (axino)
tags: added: canonical-is
Revision history for this message
Junien F (axino) wrote :

This got worked around by applying this change :
db.txns.update({_id: ObjectId("59ef3f595f5ce87b42d1e774")}, {$unset: {"n": 1}, $set: {"s": 1}})

And the, running a full mgopurge (with controllers stopped) to handle the big txn-queues.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1727679] Re: mgo-txn-resumer cannot resume transaction because missing document

That query moves the TXN from prepared (all docs should be ready to apply)
to preparing (create what needs to exist).

If we caught it earlier we probably wouldn't need the mgopurge

John
=:->

On Oct 27, 2017 11:21, "Junien Fridrick" <email address hidden> wrote:

> This got worked around by applying this change :
> db.txns.update({_id: ObjectId("59ef3f595f5ce87b42d1e774")}, {$unset:
> {"n": 1}, $set: {"s": 1}})
>
> And the, running a full mgopurge (with controllers stopped) to handle
> the big txn-queues.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1727679
>
> Title:
> mgo-txn-resumer cannot resume transaction because missing document
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1727679/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

So the quick summary (IIRC), is that we had a transaction which had been flagged as "prepared", but didn't actually have an entry in the txn-queue for every document that was referenced by the transaction.
Note that some of the changes to mgo that should be in 2.3 will at least prevent this from growing inordinately large (as txns that would cause a queue to grow >1000 long should end up immediately flagged as aborted.)

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
tags: added: cpe-onsite
tags: added: 4010
Michał Ajduk (majduk)
tags: removed: 4010
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.