Looking at the txns in 'db.txns' it seems there are 3 types of transactions that Preparing or Prepared:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
88195
> db.txns.find({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
189162
88k transactions that are Preparing that involve an action, and 189k that are Prepared that involve an action. However, that doesn't account for 595k transactions on that unit that are in Preparing state:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).count()
595887
Now that is trying to update the password for the unit. Which typically happens when the unit agent is starting up. Regardless it is unable to actually apply the transaction, because the queue is already full.
I don't know a way to do a search involving a sub-document that has a "$" in it, because unfortunately that is magic syntax as part of the mongo query language. However, we can approximate it with:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {"$exists": 1} }).count()
507692
so if we have 507k updates to the document pending, 88k preparing actions and 189k prepared actions that amounts to: 785,049 documents.
1) cleanup txns.stash from all pending documents that are actions or action notifications
> db.txns.stash.find({"_id.c": "actions"}).count()
> db.txns.stash.find({"_id.c": "actionnotifications"}).count()
277356
> db.txns.stash.remove({"_id.c": "actions"})
> db.txns.stash.remove({"_id.c": "actionnotifications"})
WriteResult({ "nRemoved" : 277356 })
# note, I accidentally removed some of the 'actions' before stopping it, so I don't have the accurate counts there. I would expect it to be the same as "actionnotifications".
2) cleanup the txn-queue on the unit doc
> db.units.update({"_id": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"$set": {"txn-queue": []}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
3) remove the Prepared and Preparing transactions from 'db.txns' that involve this document:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
88195
> db.txns.find({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
189162
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {$exists: 1}}).count()
507692
# VERIFY the above numbers first
> db.txns.remove({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"})
WriteResult({ "nRemoved" : 88195 })
This finds a document that we shouldn't have deleted:
2018-10-16 09:27:27 ERROR failed stage resume: cannot find document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:5c141543-7e4b-4bbb-8881-7458104cc334} for applying transaction 5ac24a1e5540e3051792857b_ded72eda
^- this was a transaction that was in progress that was trying to delete an actionnotification.
However, the action it was associated with is now missing. Maybe one of the other transactions was deleting it? Or trying to add it? Seems odd that it would be adding one that it was then trying to flag as completed, if it never succeeded in adding it.
One easy option is to roll this back to Preparing, which will cause it to naturally be aborted.
Looking at the txns in 'db.txns' it seems there are 3 types of transactions that Preparing or Prepared: find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"}).count() find({" o.c": "units", "s": 2, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"}).count()
> db.txns.
88195
> db.txns.
189162
88k transactions that are Preparing that involve an action, and 189k that are Prepared that involve an action. However, that doesn't account for 595k transactions on that unit that are in Preparing state: find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0"}).count( )
> db.txns.
595887
Poking around we find: find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0"}).skip( 100).limit( 10).pretty( ) { "5b958bd85540e3 ba88519b81" ),
"c" : "units",
"d" : "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0",
"a" : {
"life" : {
"$ ne" : 2
}
},
"u" : {
"$set" : {
"passwordh ash" : <omitted>
}
}
> db.txns.
"_id" : ObjectId(
"s" : 1,
"o" : [
{
}
]
}
Now that is trying to update the password for the unit. Which typically happens when the unit agent is starting up. Regardless it is unable to actually apply the transaction, because the queue is already full.
I don't know a way to do a search involving a sub-document that has a "$" in it, because unfortunately that is magic syntax as part of the mongo query language. However, we can approximate it with: find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.u": {"$exists": 1} }).count()
> db.txns.
507692
so if we have 507k updates to the document pending, 88k preparing actions and 189k prepared actions that amounts to: 785,049 documents.
> db.txns. find({" o.c": "units", "s": {$lt: 3}, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0"}).count( )
785049
> db.txns.count() 786349
So here is the cleanup I propose:
1) cleanup txns.stash from all pending documents that are actions or action notifications stash.find( {"_id.c" : "actions"}).count() stash.find( {"_id.c" : "actionnotifica tions"} ).count( ) stash.remove( {"_id.c" : "actions"}) stash.remove( {"_id.c" : "actionnotifica tions"} )
> db.txns.
> db.txns.
277356
> db.txns.
> db.txns.
WriteResult({ "nRemoved" : 277356 })
# note, I accidentally removed some of the 'actions' before stopping it, so I don't have the accurate counts there. I would expect it to be the same as "actionnotifica tions".
2) cleanup the txn-queue on the unit doc update( {"_id": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0"}, {"$set": {"txn-queue": []}})
> db.units.
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
3) remove the Prepared and Preparing transactions from 'db.txns' that involve this document: find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"}).count() find({" o.c": "units", "s": 2, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"}).count() find({" o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.u": {$exists: 1}}).count()
> db.txns.
88195
> db.txns.
189162
> db.txns.
507692
# VERIFY the above numbers first remove( {"o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"})
> db.txns.
WriteResult({ "nRemoved" : 88195 })
> db.txns. remove( {"o.c": "units", "s": 2, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.c": "actions"})
WriteResult({ "nRemoved" : 189162 })
> db.txns. remove( {"o.c": "units", "s": 1, "o.d": "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 0", "o.u": {$exists: 1}})
WriteResult({ "nRemoved" : 507692 })
> db.txns.count() stash.count( )
1300
> db.txns.
0
4) run mgopurge.
This finds a document that we shouldn't have deleted: 0b48-45a3- 87fc-9646f8bd82 de:5c141543- 7e4b-4bbb- 8881-7458104cc3 34} for applying transaction 5ac24a1e5540e30 51792857b_ ded72eda
2018-10-16 09:27:27 ERROR failed stage resume: cannot find document {actions 6a783ac4-
Inspecting the transaction: find({" _id": ObjectId( "5ac24a1e5540e3 051792857b" )}).pretty( ) "5ac24a1e5540e3 051792857b" ),
"c" : "actions",
"d" : "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:5c141543- 7e4b-4bbb- 8881-7458104cc3 34",
"a" : {
"status" : {
"$ nin" : [
" completed" ,
" cancelled" ,
" failed"
]
}
},
"u" : {
"$set" : {
"status" : "completed",
"message" : "",
"results" : {
" Stdout" : "100.70.0.127\n",
" Code" : "0",
" Stderr" : ""
},
"completed " : ISODate( "2018-04- 02T15:19: 58Z")
}
}
"c" : "actionnotifica tions",
"d" : "6a783ac4- 0b48-45a3- 87fc-9646f8bd82 de:zookeeper/ 2_a_5c141543- 7e4b-4bbb- 8881-7458104cc3 34",
"r" : true
> db.txns.
{
"_id" : ObjectId(
"s" : 2,
"o" : [
{
},
{
}
],
"n" : "ded72eda"
}
^- this was a transaction that was in progress that was trying to delete an actionnotification.
However, the action it was associated with is now missing. Maybe one of the other transactions was deleting it? Or trying to add it? Seems odd that it would be adding one that it was then trying to flag as completed, if it never succeeded in adding it.
One easy option is to roll this back to Preparing, which will cause it to naturally be aborted.
> db.txns. update( {"_id": ObjectId( "5ac24a1e5540e3 051792857b" )}, {"$set": {"s": 1}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
And then running mgopurge again completes successfully.