Looking at the txns in 'db.txns' it seems there are 3 types of transactions that Preparing or Prepared: > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count() 88195 > db.txns.find({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count() 189162 88k transactions that are Preparing that involve an action, and 189k that are Prepared that involve an action. However, that doesn't account for 595k transactions on that unit that are in Preparing state: > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).count() 595887 Poking around we find: > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).skip(100).limit(10).pretty() { "_id" : ObjectId("5b958bd85540e3ba88519b81"), "s" : 1, "o" : [ { "c" : "units", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "a" : { "life" : { "$ne" : 2 } }, "u" : { "$set" : { "passwordhash" : } } } ] } Now that is trying to update the password for the unit. Which typically happens when the unit agent is starting up. Regardless it is unable to actually apply the transaction, because the queue is already full. I don't know a way to do a search involving a sub-document that has a "$" in it, because unfortunately that is magic syntax as part of the mongo query language. However, we can approximate it with: > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {"$exists": 1} }).count() 507692 so if we have 507k updates to the document pending, 88k preparing actions and 189k prepared actions that amounts to: 785,049 documents. > db.txns.find({"o.c": "units", "s": {$lt: 3}, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).count() 785049 > db.txns.count() 786349 So here is the cleanup I propose: 1) cleanup txns.stash from all pending documents that are actions or action notifications > db.txns.stash.find({"_id.c": "actions"}).count() > db.txns.stash.find({"_id.c": "actionnotifications"}).count() 277356 > db.txns.stash.remove({"_id.c": "actions"}) > db.txns.stash.remove({"_id.c": "actionnotifications"}) WriteResult({ "nRemoved" : 277356 }) # note, I accidentally removed some of the 'actions' before stopping it, so I don't have the accurate counts there. I would expect it to be the same as "actionnotifications". 2) cleanup the txn-queue on the unit doc > db.units.update({"_id": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"$set": {"txn-queue": []}}) WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }) 3) remove the Prepared and Preparing transactions from 'db.txns' that involve this document: > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count() 88195 > db.txns.find({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count() 189162 > db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {$exists: 1}}).count() 507692 # VERIFY the above numbers first > db.txns.remove({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}) WriteResult({ "nRemoved" : 88195 }) > db.txns.remove({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}) WriteResult({ "nRemoved" : 189162 }) > db.txns.remove({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {$exists: 1}}) WriteResult({ "nRemoved" : 507692 }) > db.txns.count() 1300 > db.txns.stash.count() 0 4) run mgopurge. This finds a document that we shouldn't have deleted: 2018-10-16 09:27:27 ERROR failed stage resume: cannot find document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:5c141543-7e4b-4bbb-8881-7458104cc334} for applying transaction 5ac24a1e5540e3051792857b_ded72eda Inspecting the transaction: > db.txns.find({"_id": ObjectId("5ac24a1e5540e3051792857b")}).pretty() { "_id" : ObjectId("5ac24a1e5540e3051792857b"), "s" : 2, "o" : [ { "c" : "actions", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:5c141543-7e4b-4bbb-8881-7458104cc334", "a" : { "status" : { "$nin" : [ "completed", "cancelled", "failed" ] } }, "u" : { "$set" : { "status" : "completed", "message" : "", "results" : { "Stdout" : "100.70.0.127\n", "Code" : "0", "Stderr" : "" }, "completed" : ISODate("2018-04-02T15:19:58Z") } } }, { "c" : "actionnotifications", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/2_a_5c141543-7e4b-4bbb-8881-7458104cc334", "r" : true } ], "n" : "ded72eda" } ^- this was a transaction that was in progress that was trying to delete an actionnotification. However, the action it was associated with is now missing. Maybe one of the other transactions was deleting it? Or trying to add it? Seems odd that it would be adding one that it was then trying to flag as completed, if it never succeeded in adding it. One easy option is to roll this back to Preparing, which will cause it to naturally be aborted. > db.txns.update({"_id": ObjectId("5ac24a1e5540e3051792857b")}, {"$set": {"s": 1}}) WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }) And then running mgopurge again completes successfully.