Bug #1798485 “txn-queue for $ID is “machines” has too many trans...” : Series 2.4 : Bugs : Canonical Juju

Revision history for this message

David Ames (thedac) wrote on 2018-10-17:

#1

txn-queue-error.machine-log.tar.gz Edit (59.5 KiB, application/x-tar)

Revision history for this message

David Ames (thedac) wrote on 2018-10-17:

#2

juju-backup-20181017-221641.tar.gz Edit (80.0 MiB, application/x-tar)

John A Meinel (jameinel) on 2018-10-18

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
importance:	Undecided → High
milestone:	none → 2.5-beta1
status:	New → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-18:

#3

Download full text (4.1 KiB)

Thank you tremendously for the dump. This is what we noted:

1) The machine document has 1000 transactions in its queue, which historically has been a sign that some transaction has gone wrong, and blocked us from making progress.

2) When we started investigating the transactions, we saw that they all looked to be in state "6", which means Completed, eg:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{
        "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0",
        "txn-queue" : [
                "5bc7a5298073ce118e39fdcd_9a32a0c9",
                "5bc7a52a8073ce118e39fdcf_103296be",
                "5bc7a52a8073ce118e39fdd0_4e663da2",
                "5bc7a52a8073ce118e39fdd1_16ff06ec",
                "5bc7a5bd8073ce118e3a05e6_6e9c4147",
                "5bc7a5bd8073ce118e3a05e7_25da133b",
                "5bc7ae668073ce118e3a295f_1e6966ec",
                "5bc7ae668073ce118e3a2960_826da260",
                "5bc7ae668073ce118e3a2961_41276ce4",
                "5bc7ae678073ce118e3a2963_2206d401"
        ]
}

> db.txns.find({"_id": ObjectId("5bc7a52a8073ce118e39fdcf")}).pretty()
{
        "_id" : ObjectId("5bc7a52a8073ce118e39fdcf"),
        "s" : 6,
        "o" : [
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },
                {
                        "c" : "linklayerdevices",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:m#0#d#lo",
                        "a" : "d-",
...

3) We looked through several of them. Typically when there is a 'broken' transaction, it is the first transaction in the queue, and then all transactions after that are in state 2 (prepared).

However, we then noticed that the original transaction was an "assert only" transaction:
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },

That says "ensure the machine is still alive before completing this transaction", but it doesn't actually modify the machine document at all.

4) Running: mgopurge -stages prune on the database ended with:
...
2018-10-18 07:57:22 DEBUG pruning completed: removed 1525 txns
2018-10-18 07:57:22 INFO clean and prune cleaned 150 docs in 68 collections
removed 1525 transactions and 250 stash documents

After running that you can see that the txn queue is empty:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{ "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0", "txn-queue" : [ ] }

5) So three final takeaways:

a) there wasn't any database corruption. all of the transactions are in a happy state.
b) 'prune' is run automatically inside the Juju agent every hour.
The transactions involve start at (object ids have an embedded timestamp):
ObjectId("5bc7a4ca8073ce118e39fd9f") =>...

Thank you tremendously for the dump. This is what we noted:

1) The machine document has 1000 transactions in its queue, which historically has been a sign that some transaction has gone wrong, and blocked us from making progress.

2) When we started investigating the transactions, we saw that they all looked to be in state "6", which means Completed, eg:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{
        "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0",
        "txn-queue" : [
                "5bc7a5298073ce118e39fdcd_9a32a0c9",
                "5bc7a52a8073ce118e39fdcf_103296be",
                "5bc7a52a8073ce118e39fdd0_4e663da2",
                "5bc7a52a8073ce118e39fdd1_16ff06ec",
                "5bc7a5bd8073ce118e3a05e6_6e9c4147",
                "5bc7a5bd8073ce118e3a05e7_25da133b",
                "5bc7ae668073ce118e3a295f_1e6966ec",
                "5bc7ae668073ce118e3a2960_826da260",
                "5bc7ae668073ce118e3a2961_41276ce4",
                "5bc7ae678073ce118e3a2963_2206d401"
        ]
}

> db.txns.find({"_id": ObjectId("5bc7a52a8073ce118e39fdcf")}).pretty()
{
        "_id" : ObjectId("5bc7a52a8073ce118e39fdcf"),
        "s" : 6,
        "o" : [
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },
                {
                        "c" : "linklayerdevices",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:m#0#d#lo",
                        "a" : "d-",
...

3) We looked through several of them. Typically when there is a 'broken' transaction, it is the first transaction in the queue, and then all transactions after that are in state 2 (prepared).

However, we then noticed that the original transaction was an "assert only" transaction:
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },

That says "ensure the machine is still alive before completing this transaction", but it doesn't actually modify the machine document at all.

4) Running:  mgopurge -stages prune on the database ended with:
...
2018-10-18 07:57:22 DEBUG pruning completed: removed 1525 txns
2018-10-18 07:57:22 INFO  clean and prune cleaned 150 docs in 68 collections
  removed 1525 transactions and 250 stash documents

After running that you can see that the txn queue is empty:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{ "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0", "txn-queue" : [ ] }

5) So three final takeaways:

a) there wasn't any database corruption. all of the transactions are in a happy state.
b) 'prune' is run automatically inside the Juju agent every hour.
The transactions involve start at (object ids have an embedded timestamp):
ObjectId("5bc7a4ca8073ce118e39fd9f") => 2018-10-17T21:08:26.000Z
ObjectId("5bc7afa28073ce118e3a43f3") => 2018-10-17T21:54:42.000Z

So that is 1000 assert-only txns happening in < 1hr.

c) When we came up with the limit of 1000, we were aware of assertion only transactions, but didn't think they occurred at a frequency high enough to cause problems. Having a txn-queue get very long is going to impact performance, because every change you make needs to check if it needs to apply any of the other 1000 transactions, or whether they are all completed.

Assertion-only operations in a transaction were taking a performance "shortcut" that since they didn't have to update the document right now, they would avoid updating the txn-queue assuming that some future write would come along to clean it up. But we can change it so that if there are 'sufficient' transactions to be removed, we go ahead and do the write. I'm thinking a cap of either 10 or 100 would be sufficient to not cause sufficient writes to be a problem, and will help future updates.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-18:

#4

Possible patch against mgo/txn:
$ git diff flusher.go
diff --git a/txn/flusher.go b/txn/flusher.go
index c0fc36d..012933f 100644
--- a/txn/flusher.go
+++ b/txn/flusher.go
@@ -895,6 +895,14 @@ func (f *flusher) apply(t *transaction, pull map[bson.ObjectId]*transaction) err
                        }
                case op.Assert != nil:
                        // Pure assertion. No changes to apply.
+ if len(pullAll) >= 20 {
+ var d bson.D
+ if d, err = addToDoc(d, "$pullAll", bson.D{{"txn-queue", pullAll}}); err != nil {
+ return err
+ }
+ chaos("update asserted document")
+ err = c.Update(qdoc, d)
+ }
                }

Needs testing, etc. But that should cleanup docs that have lots of only assertions on them still be able to be cleaned up as a natural process.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-18:

#5

assertion-flush.diff Edit (4.8 KiB, text/plain)

Attached is a patch along with tests that check the assertion behavior.

This defaults to asking for the queue to be no more than 10 long before we will flush a cleanup.
In testing this, there was a very noticeable affect of changing the default allowed length.

Running the test that does 600 transactions and varying the length of the queue we get:

  1 1.18s
  2 1.17s
  5 1.28s
10 1.39s
20 1.81s
50 2.94s
100 4.81s
  0 22.99s

So you *can* see that updating on every txn is a bit slower (though probably within the noise), and that the longer you let the queue get, the more it degrades performance.
Having a value of 10 gives us IMO a reasonable "don't force writing all the time, but do so when it will help future transactions not have to do as much work".

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-23:

#6

https://github.com/juju/juju/pull/9340 was merging into 2.3
https://github.com/juju/juju/pull/9359 is merging 2.3 into 2.4

John A Meinel (jameinel) on 2018-11-13

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2019-03-22

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

txn-queue for $ID is "machines" has too many transactions (1001)

Bug Description

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.5-beta1
2.3	Fix Released	High	John A Meinel	Canonical Juju 2.3.10
2.4	Fix Released	High	John A Meinel	Canonical Juju 2.4.5