txn-queue for $ID is "machines" has too many transactions (1001)

Bug #1798485 reported by David Ames
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.3
Fix Released
High
John A Meinel
2.4
Fix Released
High
John A Meinel

Bug Description

Using snap version 2.5-beta1+develop-dda4cbc

The txn-queue has too many transaction error showed up.

$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
thedac thedac-serverstack serverstack/serverstack 2.5-beta1 unsupported 15:25:28-07:00 attempt 4 to destroy model failed (will retry): model not empty, found 1 machine, 5 applications (model not empty)

App Version Status Scale Charm Store Rev OS Charm version Notes
barbican 6.0.0 blocked 1 barbican jujucharms 53 ubuntu
barbican-softhsm 2.0.0 terminated 1 barbican-softhsm local 0 ubuntu 0004a77
keystone waiting 0 keystone jujucharms 386 ubuntu
percona-cluster waiting 0 percona-cluster jujucharms 312 ubuntu
rabbitmq-server waiting 0 rabbitmq-server jujucharms 310 ubuntu

Unit Workload Agent Machine Public address Ports Message
barbican/0* blocked idle 0 10.5.0.21 9311/tcp,9312/tcp Ready for do-release-upgrade and reboot. Set complete when finished.
  barbican-softhsm/0 terminated executing 10.5.0.21 (stop)

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.21 24520ee4-ae4b-4669-b795-d8f74e857390 xenial nova ACTIVE

$ juju remove-machine 0 --force
removing machine 0 failed: failed to run transaction: []txn.Op{
    {
        C: "machines",
        Id: "7e89219c-4431-4980-8687-ab38e0397809:0",
        Assert: bson.D{
            {
                Name: "jobs",
                Value: bson.D{
                    {
                        Name: "$nin",
                        Value: []state.MachineJob{2},
                    },
                },
            },
        },
        Insert: nil,
        Update: nil,
        Remove: false,
    },
    {
        C: "cleanups",
        Id: "5bc7b6628073ce118e3a530f",
        Assert: nil,
        Insert: &state.cleanupDoc{
            DocID: "5bc7b6628073ce118e3a530f",
            Kind: "machine",
            Prefix: "0",
            Args: nil,
        },
        Update: nil,
        Remove: false,
    },
}: txn-queue for 7e89219c-4431-4980-8687-ab38e0397809:0 in "machines" has too many transactions (1001)

Attaching juju controller backup and machine logs.

Revision history for this message
David Ames (thedac) wrote :
Revision history for this message
David Ames (thedac) wrote :
John A Meinel (jameinel)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → High
milestone: none → 2.5-beta1
status: New → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.1 KiB)

Thank you tremendously for the dump. This is what we noted:

1) The machine document has 1000 transactions in its queue, which historically has been a sign that some transaction has gone wrong, and blocked us from making progress.

2) When we started investigating the transactions, we saw that they all looked to be in state "6", which means Completed, eg:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{
        "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0",
        "txn-queue" : [
                "5bc7a5298073ce118e39fdcd_9a32a0c9",
                "5bc7a52a8073ce118e39fdcf_103296be",
                "5bc7a52a8073ce118e39fdd0_4e663da2",
                "5bc7a52a8073ce118e39fdd1_16ff06ec",
                "5bc7a5bd8073ce118e3a05e6_6e9c4147",
                "5bc7a5bd8073ce118e3a05e7_25da133b",
                "5bc7ae668073ce118e3a295f_1e6966ec",
                "5bc7ae668073ce118e3a2960_826da260",
                "5bc7ae668073ce118e3a2961_41276ce4",
                "5bc7ae678073ce118e3a2963_2206d401"
        ]
}

> db.txns.find({"_id": ObjectId("5bc7a52a8073ce118e39fdcf")}).pretty()
{
        "_id" : ObjectId("5bc7a52a8073ce118e39fdcf"),
        "s" : 6,
        "o" : [
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },
                {
                        "c" : "linklayerdevices",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:m#0#d#lo",
                        "a" : "d-",
...

3) We looked through several of them. Typically when there is a 'broken' transaction, it is the first transaction in the queue, and then all transactions after that are in state 2 (prepared).

However, we then noticed that the original transaction was an "assert only" transaction:
                {
                        "c" : "machines",
                        "d" : "7e89219c-4431-4980-8687-ab38e0397809:0",
                        "a" : {
                                "life" : 0
                        }
                },

That says "ensure the machine is still alive before completing this transaction", but it doesn't actually modify the machine document at all.

4) Running: mgopurge -stages prune on the database ended with:
...
2018-10-18 07:57:22 DEBUG pruning completed: removed 1525 txns
2018-10-18 07:57:22 INFO clean and prune cleaned 150 docs in 68 collections
  removed 1525 transactions and 250 stash documents

After running that you can see that the txn queue is empty:
> db.machines.find({"_id": "7e89219c-4431-4980-8687-ab38e0397809:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty()
{ "_id" : "7e89219c-4431-4980-8687-ab38e0397809:0", "txn-queue" : [ ] }

5) So three final takeaways:

a) there wasn't any database corruption. all of the transactions are in a happy state.
b) 'prune' is run automatically inside the Juju agent every hour.
The transactions involve start at (object ids have an embedded timestamp):
ObjectId("5bc7a4ca8073ce118e39fd9f") =>...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Possible patch against mgo/txn:
$ git diff flusher.go
diff --git a/txn/flusher.go b/txn/flusher.go
index c0fc36d..012933f 100644
--- a/txn/flusher.go
+++ b/txn/flusher.go
@@ -895,6 +895,14 @@ func (f *flusher) apply(t *transaction, pull map[bson.ObjectId]*transaction) err
                        }
                case op.Assert != nil:
                        // Pure assertion. No changes to apply.
+ if len(pullAll) >= 20 {
+ var d bson.D
+ if d, err = addToDoc(d, "$pullAll", bson.D{{"txn-queue", pullAll}}); err != nil {
+ return err
+ }
+ chaos("update asserted document")
+ err = c.Update(qdoc, d)
+ }
                }

Needs testing, etc. But that should cleanup docs that have lots of only assertions on them still be able to be cleaned up as a natural process.

Revision history for this message
John A Meinel (jameinel) wrote :

Attached is a patch along with tests that check the assertion behavior.

This defaults to asking for the queue to be no more than 10 long before we will flush a cleanup.
In testing this, there was a very noticeable affect of changing the default allowed length.

Running the test that does 600 transactions and varying the length of the queue we get:

  1 1.18s
  2 1.17s
  5 1.28s
 10 1.39s
 20 1.81s
 50 2.94s
100 4.81s
  0 22.99s

So you *can* see that updating on every txn is a bit slower (though probably within the noise), and that the longer you let the queue get, the more it degrades performance.
Having a value of 10 gives us IMO a reasonable "don't force writing all the time, but do so when it will help future transactions not have to do as much work".

Revision history for this message
John A Meinel (jameinel) wrote :
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.