txn-queue for $ID is "machines" has too many transactions (1001)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.3 |
Fix Released
|
High
|
John A Meinel | ||
2.4 |
Fix Released
|
High
|
John A Meinel |
Bug Description
Using snap version 2.5-beta1+
The txn-queue has too many transaction error showed up.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
thedac thedac-serverstack serverstack/
App Version Status Scale Charm Store Rev OS Charm version Notes
barbican 6.0.0 blocked 1 barbican jujucharms 53 ubuntu
barbican-softhsm 2.0.0 terminated 1 barbican-softhsm local 0 ubuntu 0004a77
keystone waiting 0 keystone jujucharms 386 ubuntu
percona-cluster waiting 0 percona-cluster jujucharms 312 ubuntu
rabbitmq-server waiting 0 rabbitmq-server jujucharms 310 ubuntu
Unit Workload Agent Machine Public address Ports Message
barbican/0* blocked idle 0 10.5.0.21 9311/tcp,9312/tcp Ready for do-release-upgrade and reboot. Set complete when finished.
barbican-
Machine State DNS Inst id Series AZ Message
0 started 10.5.0.21 24520ee4-
$ juju remove-machine 0 --force
removing machine 0 failed: failed to run transaction: []txn.Op{
{
C: "machines",
Id: "7e89219c-
Assert: bson.D{
{
},
},
},
Insert: nil,
Update: nil,
Remove: false,
},
{
C: "cleanups",
Id: "5bc7b6628073ce
Assert: nil,
Insert: &state.cleanupDoc{
DocID: "5bc7b6628073ce
Kind: "machine",
Prefix: "0",
Args: nil,
},
Update: nil,
Remove: false,
},
}: txn-queue for 7e89219c-
Attaching juju controller backup and machine logs.
Changed in juju: | |
assignee: | nobody → John A Meinel (jameinel) |
importance: | Undecided → High |
milestone: | none → 2.5-beta1 |
status: | New → In Progress |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Thank you tremendously for the dump. This is what we noted:
1) The machine document has 1000 transactions in its queue, which historically has been a sign that some transaction has gone wrong, and blocked us from making progress.
2) When we started investigating the transactions, we saw that they all looked to be in state "6", which means Completed, eg: find({" _id": "7e89219c- 4431-4980- 8687-ab38e03978 09:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty() 4431-4980- 8687-ab38e03978 09:0",
"5bc7a5298073 ce118e39fdcd_ 9a32a0c9" ,
"5bc7a52a8073 ce118e39fdcf_ 103296be" ,
"5bc7a52a8073 ce118e39fdd0_ 4e663da2" ,
"5bc7a52a8073 ce118e39fdd1_ 16ff06ec" ,
"5bc7a5bd8073 ce118e3a05e6_ 6e9c4147" ,
"5bc7a5bd8073 ce118e3a05e7_ 25da133b" ,
"5bc7ae668073 ce118e3a295f_ 1e6966ec" ,
"5bc7ae668073 ce118e3a2960_ 826da260" ,
"5bc7ae668073 ce118e3a2961_ 41276ce4" ,
"5bc7ae678073 ce118e3a2963_ 2206d401"
> db.machines.
{
"_id" : "7e89219c-
"txn-queue" : [
]
}
> db.txns. find({" _id": ObjectId( "5bc7a52a8073ce 118e39fdcf" )}).pretty( ) "5bc7a52a8073ce 118e39fdcf" ),
"c" : "machines",
"d" : "7e89219c- 4431-4980- 8687-ab38e03978 09:0",
"a" : {
"life" : 0
}
"c" : "linklayerdevices",
"d" : "7e89219c- 4431-4980- 8687-ab38e03978 09:m#0# d#lo",
"a" : "d-",
{
"_id" : ObjectId(
"s" : 6,
"o" : [
{
},
{
...
3) We looked through several of them. Typically when there is a 'broken' transaction, it is the first transaction in the queue, and then all transactions after that are in state 2 (prepared).
However, we then noticed that the original transaction was an "assert only" transaction:
"c" : "machines",
"d" : "7e89219c- 4431-4980- 8687-ab38e03978 09:0",
"a" : {
"life" : 0
}
{
},
That says "ensure the machine is still alive before completing this transaction", but it doesn't actually modify the machine document at all.
4) Running: mgopurge -stages prune on the database ended with:
...
2018-10-18 07:57:22 DEBUG pruning completed: removed 1525 txns
2018-10-18 07:57:22 INFO clean and prune cleaned 150 docs in 68 collections
removed 1525 transactions and 250 stash documents
After running that you can see that the txn queue is empty: find({" _id": "7e89219c- 4431-4980- 8687-ab38e03978 09:0"}, {"_id": 1, "txn-queue": {$slice: [0, 10]}}).pretty() 4431-4980- 8687-ab38e03978 09:0", "txn-queue" : [ ] }
> db.machines.
{ "_id" : "7e89219c-
5) So three final takeaways:
a) there wasn't any database corruption. all of the transactions are in a happy state. "5bc7a4ca8073ce 118e39fd9f" ) =>...
b) 'prune' is run automatically inside the Juju agent every hour.
The transactions involve start at (object ids have an embedded timestamp):
ObjectId(