Thank you tremendously for the dump. This is what we noted:

1) The machine document has 1000 transactions in its queue, which historically has been a sign that some transaction has gone wrong, and blocked us from making progress.

2) When we started investigating the transactions, we saw that they all looked to be in state "6", which means Completed, eg:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{
        "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0",
        "txn-queue" : [
                "5bc7a5298073ce118e39fdcd_9a32a0c9",
                "5bc7a52a8073ce118e39fdcf_103296be",
                "5bc7a52a8073ce118e39fdd0_4e663da2",
                "5bc7a52a8073ce118e39fdd1_16ff06ec",
                "5bc7a5bd8073ce118e3a05e6_6e9c4147",
                "5bc7a5bd8073ce118e3a05e7_25da133b",
                "5bc7ae668073ce118e3a295f_1e6966ec",
                "5bc7ae668073ce118e3a2960_826da260",
                "5bc7ae668073ce118e3a2961_41276ce4",
                "5bc7ae678073ce118e3a2963_2206d401"
        ]
}

> db.txns.find({"_id": ObjectId("5bc7a52a8073ce118e39fdcf")}).pretty()
{
        "_id" : ObjectId("5bc7a52a8073ce118e39fdcf"),
        "s" : 6,
        "o" : [
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },
                {
                        "c" : "linklayerdevices",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:m#0#d#lo",
                        "a" : "d-",
...

3) We looked through several of them. Typically when there is a 'broken' transaction, it is the first transaction in the queue, and then all transactions after that are in state 2 (prepared).

However, we then noticed that the original transaction was an "assert only" transaction:
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },

That says "ensure the machine is still alive before completing this transaction", but it doesn't actually modify the machine document at all.

4) Running:  mgopurge -stages prune on the database ended with:
...
2018-10-18 07:57:22 DEBUG pruning completed: removed 1525 txns
2018-10-18 07:57:22 INFO  clean and prune cleaned 150 docs in 68 collections
  removed 1525 transactions and 250 stash documents

After running that you can see that the txn queue is empty:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{ "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0", "txn-queue" : [ ] }


5) So three final takeaways:

a) there wasn't any database corruption. all of the transactions are in a happy state.
b) 'prune' is run automatically inside the Juju agent every hour.
The transactions involve start at (object ids have an embedded timestamp):
ObjectId("5bc7a4ca8073ce118e39fd9f") => 2018-10-17T21:08:26.000Z
ObjectId("5bc7afa28073ce118e3a43f3") => 2018-10-17T21:54:42.000Z

So that is 1000 assert-only txns happening in < 1hr.

c) When we came up with the limit of 1000, we were aware of assertion only transactions, but didn't think they occurred at a frequency high enough to cause problems. Having a txn-queue get very long is going to impact performance, because every change you make needs to check if it needs to apply any of the other 1000 transactions, or whether they are all completed.

Assertion-only operations in a transaction were taking a performance "shortcut" that since they didn't have to update the document right now, they would avoid updating the txn-queue assuming that some future write would come along to clean it up. But we can change it so that if there are 'sufficient' transactions to be removed, we go ahead and do the write. I'm thinking a cap of either 10 or 100 would be sufficient to not cause sufficient writes to be a problem, and will help future updates.