txn suddenly went missing in a txn queue

Bug #1776673 reported by Junien F
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.3
Fix Released
High
John A Meinel
2.4
Fix Released
High
John A Meinel

Bug Description

Hi,

Last night, we suddenly got the following logs on our Juju 2.3.7 controller :

2018-06-12 01:45:23 ERROR juju.state cleanup.go:150 cleanup failed for units("ubuntu"): cannot destroy unit "ubuntu/37264": cannot find transaction 5b1f25665f5ce833efeff19b_0e03bf64 in queue for document {annotations 6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu}

I have no idea how it happened. There was some load on the mongodb primary, presumably because we were in the process of backing it up (mongodump).

Also, this happened during an internal prune :

2018-06-12 01:43:50 INFO juju.txn prune.go:127 txns after last prune: 209061, txns now: 356448, pruning: too many new transactions

2018-06-12 01:49:19 INFO juju.txn prune.go:152 txn pruning complete after 5m29.68502757s. txns now: 168206, inspected 68 collections, 24978 docs (623 cleaned)
   removed 334 stash docs and 208935 txn docs

The txn was indeed the first in the queue for the unit. However, it wasn't in the queue for the annotation document (which was already in txns.stash). jam advised to try to add it, which I did, using :

db.txns.stash.update(
    {"_id.id": "6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu"},
    {$push: {
        "txn-queue": {
            $each: ["5b1f25665f5ce833efeff19b_0e03bf64"],
            $position: 0
        }
    }}
)

(with the controllers down). After that I ran an mgopurge, and the problem went away.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1776673] [NEW] txn suddenly went missing in a txn queue

Note that another possibility would be to update the txn from status "2" ==
"prepared" to status "1" which is "preparing", in which case it should try
to add the txn id to all the queues. I think you also need to delete the
"n" field. So something like:

db.txns.update({_id: " 5b1f25665f5ce833efeff19b"}, {$set: {"s": 1}, $unset:
{"n": 1}})

On Wed, Jun 13, 2018 at 5:21 PM, Junien Fridrick <<email address hidden>
> wrote:

> Public bug reported:
>
> Hi,
>
> Last night, we suddenly got the following logs on our Juju 2.3.7
> controller :
>
> 2018-06-12 01:45:23 ERROR juju.state cleanup.go:150 cleanup failed for
> units("ubuntu"): cannot destroy unit "ubuntu/37264": cannot find
> transaction 5b1f25665f5ce833efeff19b_0e03bf64 in queue for document
> {annotations 6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu}
>
> I have no idea how it happened. There was some load on the mongodb
> primary, presumably because we were in the process of backing it up
> (mongodump).
>
> Also, this happened during an internal prune :
>
> 2018-06-12 01:43:50 INFO juju.txn prune.go:127 txns after last prune:
> 209061, txns now: 356448, pruning: too many new transactions
>
> 2018-06-12 01:49:19 INFO juju.txn prune.go:152 txn pruning complete after
> 5m29.68502757s. txns now: 168206, inspected 68 collections, 24978 docs (623
> cleaned)
> removed 334 stash docs and 208935 txn docs
>
>
> The txn was indeed the first in the queue for the unit. However, it wasn't
> in the queue for the annotation document (which was already in txns.stash).
> jam advised to try to add it, which I did, using :
>
> db.txns.stash.update(
>
> {"_id.id": "6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu"},
> {$push: {
> "txn-queue": {
> $each: ["5b1f25665f5ce833efeff19b_0e03bf64"],
> $position: 0
> }
> }}
> )
>
> (with the controllers down). After that I ran an mgopurge, and the
> problem went away.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1776673
>
> Title:
> txn suddenly went missing in a txn queue
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1776673/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

This seems to be the bogus code. I'm pretty sure it isn't legal to delete
an object and then insert a new one with the same name in a single txn.

func newUpdateUnitResourceOps(unitID string, stored storedResource,
progress *int64) []txn.Op {
doc := newUnitResourceDoc(unitID, stored)
doc.DownloadProgress = progress

// TODO(ericsnow) Using "update" doesn't work right...
return append([]txn.Op{{
C: resourcesC,
Id: doc.DocID,
Assert: txn.DocExists,
Remove: true,
}}, newInsertUnitResourceOps(unitID, stored, progress)...)
}

On Wed, Jun 27, 2018 at 8:25 AM, John Meinel <email address hidden> wrote:

> Note that another possibility would be to update the txn from status "2"
> == "prepared" to status "1" which is "preparing", in which case it should
> try to add the txn id to all the queues. I think you also need to delete
> the "n" field. So something like:
>
> db.txns.update({_id: " 5b1f25665f5ce833efeff19b"}, {$set: {"s": 1},
> $unset: {"n": 1}})
>
>
> On Wed, Jun 13, 2018 at 5:21 PM, Junien Fridrick <
> <email address hidden>> wrote:
>
>> Public bug reported:
>>
>> Hi,
>>
>> Last night, we suddenly got the following logs on our Juju 2.3.7
>> controller :
>>
>> 2018-06-12 01:45:23 ERROR juju.state cleanup.go:150 cleanup failed for
>> units("ubuntu"): cannot destroy unit "ubuntu/37264": cannot find
>> transaction 5b1f25665f5ce833efeff19b_0e03bf64 in queue for document
>> {annotations 6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu}
>>
>> I have no idea how it happened. There was some load on the mongodb
>> primary, presumably because we were in the process of backing it up
>> (mongodump).
>>
>> Also, this happened during an internal prune :
>>
>> 2018-06-12 01:43:50 INFO juju.txn prune.go:127 txns after last prune:
>> 209061, txns now: 356448, pruning: too many new transactions
>>
>> 2018-06-12 01:49:19 INFO juju.txn prune.go:152 txn pruning complete after
>> 5m29.68502757s. txns now: 168206, inspected 68 collections, 24978 docs (623
>> cleaned)
>> removed 334 stash docs and 208935 txn docs
>>
>>
>> The txn was indeed the first in the queue for the unit. However, it
>> wasn't in the queue for the annotation document (which was already in
>> txns.stash). jam advised to try to add it, which I did, using :
>>
>> db.txns.stash.update(
>>
>>
>> {"_id.id": "6cd2c273-658c-44c9-8cff-42f47eb586ea:a#ubuntu"},
>> {$push: {
>> "txn-queue": {
>> $each: ["5b1f25665f5ce833efeff19b_0e03bf64"],
>> $position: 0
>> }
>> }}
>> )
>>
>> (with the controllers down). After that I ran an mgopurge, and the
>> problem went away.
>>
>> ** Affects: juju
>> Importance: Undecided
>> Status: New
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1776673
>>
>> Title:
>> txn suddenly went missing in a txn queue
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1776673/+subscriptions
>>
>
>

Revision history for this message
Junien F (axino) wrote :

Note that the transaction in this bug didn't do delete/insert : https://pastebin.canonical.com/p/xcH85Jjqm6/ - the target object wasn't a resource but an annotation.

Revision history for this message
John A Meinel (jameinel) wrote :

Good point, Junien. Though the remove+add is something we can certainly just fix.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → High
status: New → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

Code for fixing resource updates:
https://github.com/juju/juju/pull/8928

Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
milestone: none → 2.5-beta1
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Junien F (axino) wrote :

This happened again yesterday on a 2.4.3 controller. I don't have the exact transaction, but it was one creating juju.ip.addresses documents. The txn was in state 2. The juju.ip.addresses documents did not exist in that collection, they were present only in juju.txns.stash, and 2 of them did not have the txn in their queue. mgopurge snap failed to clear things up and I had to add the txn at the beginning of the queue for the 2 documents like we did here.

Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Andrea Ieri (aieri) wrote :

For future travelers, here's the workaround.

Given the following log line as an example:

machine-0: 17:44:45 ERROR juju.worker.dependency "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction 5e4f3a531e9
0b87add39e39e_72cbd94e in queue for document {ip.addresses 69acf424-2796-47d9-8d34-659df740ffba:m#2#d#lo#ip#127.0.0.1}

1. connect to mongo, use juju
2. ensure the txn exists:

juju:PRIMARY> db.txns.find({"n":"72cbd94e"}).pretty()

3. ensure that the same txn is not present in this list:

juju:PRIMARY> db.txns.stash.find({"_id.id":"69acf424-2796-47d9-8d34-659df740ffba:m#2#d#lo#ip#127.0.0.1"}).pretty()

4. fix by putting a reference on top of the list:

db.txns.stash.update(
    {"_id.id": "69acf424-2796-47d9-8d34-659df740ffba:m#2#d#lo#ip#127.0.0.1"},
    {$push: {
        "txn-queue": {
            $each: ["5e4f3a531e90b87add39e39e_72cbd94e"],
            $position: 0
        }
    }}
)

There is no need to stop mongo, the juju controller, or to run a mgopurge.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.