While investigating the mongodump, I think I found the issue. The issue is that there is a document in 'units' that has 272k transactions queued up for it: > db.units.aggregate([{$match: {"txn-queue.100": {$exists: 1}}}, {$project: {"_id": 1, len: {$size: "$txn-queue"}}}]) { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 } We've seen that in the past, but when we've seen it, it was because of a transaction that kept being attempted, that involved the *same* documents. However, in this dump, sampling some of those 272k transactions, we find that they involve *different* action documents: > db.txns.stash.count() 554712 ^- that means that we have 500k docs that are all referenced in the transactions being tried againts the zookeeper unit. The logic that we had around "if you have 200k transactions, is it ok to just ignore all of them", used the idea that all documents in each transaction would also have a very high number of transactions in their queue. However, for actions, it seems that each attempt to create an action is using a unique document id. That is why the graph is failing to load, and also why our logic is failing to cleanly handle this case. I can think of ways that we can manually clean this up, but I'll save that for another post. I'm leaving my debugging notes around, in case it helps other people follow how I got to this conclusion in the future. Looking at the mongodump, I came across: > db.txns.count() 786352 > db.txns.find({"s": 2}).count() 189167 > db.txns.find({"s": 1}).count() 595945 > db.txns.find({"s": 3}).count() 0 So we have 786k transactions, 189k of them are is 'Prepared', and 596k of them are in 'Preparing'. Likely all of those are because of a backlog on the document that has a broken txn. Since we suspected units: > db.units.find({"txn-queue.100": {"$exists": 1}}, {"_id": 1}) { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" } So there is a unit doc that has at least 100 transactions. How many? > db.units.aggregate([{"$match": { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" }}, {"$project": {"_id": 1, len: {$size: "$txn-queue"}}}]) { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 } 272k transactions on that document. What are the first few? > db.units.find({ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"_id": 1, "txn-queue" : {$slice: 4}}).pretty() { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "txn-queue" : [ "5b87d51c5540e3051751d249_0c5d64c0", "5b8907bb5540e370277cb5a9_69306aff", "5b8907bb5540e370277cb5b9_2753c79f", "5b8907bb5540e370277cb9ee_44c2dde3" ] } The first one is trying to run an action on zookeeper: > db.txns.find({"_id": ObjectId("5b87d51c5540e3051751d249")}).pretty() { "_id" : ObjectId("5b87d51c5540e3051751d249"), "s" : 2, "o" : [ { "c" : "units", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "a" : { "life" : { "$ne" : 2 } } }, { "c" : "actions", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf", "a" : "d-", "i" : { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf", "model-uuid" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de", "receiver" : "zookeeper/0", "name" : "juju-run", "parameters" : { "command" : "relation-get -r zkpeer:47 private-address zookeeper/0", "timeout" : NumberLong("300000000000") }, "enqueued" : ISODate("2018-08-30T11:29:33Z"), "started" : ISODate("0001-01-01T00:00:00Z"), "completed" : ISODate("0001-01-01T00:00:00Z"), "status" : "pending", "message" : "", "results" : { } } }, { "c" : "actionnotifications", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0_a_eb08e238-dfa8-4c46-842e-47a4eb929adf", "a" : "d-", "i" : { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0_a_eb08e238-dfa8-4c46-842e-47a4eb929adf", "model-uuid" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de", "receiver" : "zookeeper/0", "actionid" : "eb08e238-dfa8-4c46-842e-47a4eb929adf" } } ], "n" : "63cdf932" } Now both of those are "insert" requests. Which means that the associated document is probably in the txns.stash collection, since it hasn't yet been inserted into the actions collection. Looking in there: > db.txns.stash.find({"_id.c": "actions", "_id.id": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf"}.pretty() we see another really long txn-queue. Now, mgopurge 1.6 had code to handle long transactions queues, but didn't handle when one of the documents with a long txn queue was in the stash. mgopurge 1.7 explicitly handles that problem. However, when trying to run, I also see it hit OOM, and get killed. And looking at the docs, the ones in the stash actually only have 100/200 items in their queue: > db.txns.stash.aggregate([{$match: {"txn-queue.100": {$exists: 1}}}, {$project: {"_id.id": 1, len: {$size: "$txn-queue"}}}]) { "_id" : { "id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0_a_eb08e238-dfa8-4c46-842e-47a4eb929adf" }, "len" : 210 } { "_id" : { "id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf" }, "len" : 209 } so you can see that one has 210, and the other has 209, but neither is anywhere close to the 200k that the units document has. So lets try looking at some other transactions > db.units.find({ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"_id": 1, "txn-queue" : {$slice: [2000, 10]}}).pretty() { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "txn-queue" : [ "5b8912cd5540e371d9212c06_3409559f", "5b8912cd5540e371d9212c07_281d9f4c", "5b8912d15540e371d92133f6_7bb4979c", "5b8912d15540e371d92133f7_94d1f373", "5b8912d15540e371d92133f8_af5b80d2", "5b8912d55540e371d9213bec_fb4fc874", "5b8912d55540e371d9213bed_45a843a7", "5b8912d55540e371d9213bee_d8e5a554", "5b8912d95540e371d92143ca_b3b75daa", "5b8912d95540e371d92143cb_373902c9" ] } > db.txns.find({_id: ObjectId("5b8912cd5540e371d9212c06")}).pretty() { "_id" : ObjectId("5b8912cd5540e371d9212c06"), "s" : 2, "o" : [ { "c" : "units", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "a" : { "life" : { "$ne" : 2 } } }, { "c" : "actions", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:abc6f369-db63-46fa-8106-38158aca29c1", "a" : "d-", "i" : { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:abc6f369-db63-46fa-8106-38158aca29c1", "model-uuid" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de", "receiver" : "zookeeper/0", "name" : "juju-run", "parameters" : { "command" : "relation-get -r zkpeer:47 private-address zookeeper/1", "timeout" : NumberLong("300000000000") }, "enqueued" : ISODate("2018-08-31T10:05:01Z"), "started" : ISODate("0001-01-01T00:00:00Z"), "completed" : ISODate("0001-01-01T00:00:00Z"), "status" : "pending", "message" : "", "results" : { } } }, { "c" : "actionnotifications", "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0_a_abc6f369-db63-46fa-8106-38158aca29c1", "a" : "d-", "i" : { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0_a_abc6f369-db63-46fa-8106-38158aca29c1", "model-uuid" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de", "receiver" : "zookeeper/0", "actionid" : "abc6f369-db63-46fa-8106-38158aca29c1" } } ], "n" : "3409559f" } seems to be a similar change. And they all look roughly similar, and maybe this is the hint: > db.txns.stash.count() 554712 So in this case, most of the changes to the original 'units' document, create a unique action and actionnotification documents. So while *units* has 200k items in the queue, most of the created action documents only have a small number of references, and there are just 200k actions and 200k actionnotification documents pending. And loading *that* graph, involving the 400k + 1 documents is what is causing the OOM.