Bug #1797816 “[2.3.8] jujud exhausts resources” : Bugs : Canonical Juju

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-10-15:

#1

mongostat.log Edit (8.7 KiB, text/plain)

This is mongostat capture from when jujud is started until it starts using too many resources

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-10-15:

#2

mongotop.log Edit (42.8 KiB, text/plain)

mongotop output for the same period of time

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-10-15:

#3

A mongodump is available for inspection at https://private-fileshare.canonical.com/~freyes/lp1797816/dump-after-mgopurge.tar.gz

Revision history for this message

Richard Harding (rharding) wrote on 2018-10-15:

#4

Can you please provide typical bug details such as what version of Juju this is, what substrate it's on. You reference a bug around running mgopurge on a large collection is this something you're also doing? Are you using the later updated batching mgopurge tool?

Changed in juju:
status:	New → Incomplete

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-16:

#5

Typically it is not the number of txns that causes OOM during resume, but the size of one of the docs that has too many txns.

namely db.txns.find({s: 2}).count()

and there should be some document that has way too many txns (100k)

can you check the version of mgopurgw you're running (mgopurge --version)

New ones are supposed to have a workaround for txn-queue being too long so this could be something new.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-16:

#6

Based on the mongostat results in the pastebin and the mongotop, it doesn't appear to be real mongo issues. More likely that there is something in the database that is tickling mgopurge in a bad way so it enters some form of infinite loop.

The juju controllers run some safe aspects of mgopurge on a regular basis, and it is seems like it is hitting the same infinite loop.

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
importance:	Undecided → High

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-16:

#7

Agreed. It also looks like the issue is fairly clearly in juju.units, since that is where we see all of the activity in mongotop.

Revision history for this message

Tilman Baumann (tilmanbaumann) wrote on 2018-10-16:

#8

mgopurge is current release version.
Juju is 2.3.3 running on xenial.

I can reconstruct this rough timeline

At some time, before the upgrade to 2.3.8 the complaints about this single transaction started already.
Juju was version 2.2.6

September the 3rd.
The uprade to 2.3.8 was done.
We were a little concerned about the big transaction backlog back then, that but decided it should not interfere with the upgrade.

Intermediate time.
ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
happenig constantly. Every few seconds.

2018-09-09
Error about single transaction turns into document too large.
2018-09-09 20:50:05 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
2018-09-09 20:50:09 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: Resulting document after update is larger than 16777216

October 12th
Controlers stopped
db.txns.update({"_id": ObjectId("5b87d51c5540e3051751d249")}, {"$set": {"s": 1}, "$unset": {"n": 1}})
Controllers started. Very heavy system load. Unreliable operations.

mgopurge. Noticed resume OOM.

mgopurge purge. Cleaned a lot of finished transactions.

mgopurge resume would still OOM.

jujud would still not work well.
2018-10-13 17:38:27 ERROR juju.worker.dependency engine.go:551 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 100.107.2.44:48498->100.107.2.44:37017: i/o timeout
log db massively bloated.

juju logs db dropped.

jujud would start. Log no error but ver very quickly run out of memory.

mgopurge is current release version.
Juju is 2.3.3 running on xenial.

I can reconstruct this rough timeline

At some time, before the upgrade to 2.3.8 the complaints about this single transaction started already.
Juju was version 2.2.6

September the 3rd.
The uprade to 2.3.8 was done.
We were a little concerned about the big transaction backlog back then, that but decided it should not interfere with the upgrade.

Intermediate time.
ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
happenig constantly. Every few seconds.

2018-09-09
Error about single transaction turns into document too large.
2018-09-09 20:50:05 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
2018-09-09 20:50:09 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: Resulting document after update is larger than 16777216

October 12th
Controlers stopped
db.txns.update({"_id": ObjectId("5b87d51c5540e3051751d249")}, {"$set": {"s": 1}, "$unset": {"n": 1}})
Controllers started. Very heavy system load. Unreliable operations.

mgopurge. Noticed resume OOM.

mgopurge purge. Cleaned a lot of finished transactions.

mgopurge resume would still OOM.

jujud would still not work well.
2018-10-13 17:38:27 ERROR juju.worker.dependency engine.go:551 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 100.107.2.44:48498->100.107.2.44:37017: i/o timeout
log db massively bloated.

juju logs db dropped.

jujud would start. Log no error but ver very quickly run out of memory.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-16:

#9

Download full text (10.1 KiB)

While investigating the mongodump, I think I found the issue. The issue is that there is a document in 'units' that has 272k transactions queued up for it:

> db.units.aggregate([{$match: {"txn-queue.100": {$exists: 1}}}, {$project: {"_id": 1, len: {$size: "$txn-queue"}}}])
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 }

We've seen that in the past, but when we've seen it, it was because of a transaction that kept being attempted, that involved the *same* documents. However, in this dump, sampling some of those 272k transactions, we find that they involve *different* action documents:

> db.txns.stash.count()
554712
^- that means that we have 500k docs that are all referenced in the transactions being tried againts the zookeeper unit.

The logic that we had around "if you have 200k transactions, is it ok to just ignore all of them", used the idea that all documents in each transaction would also have a very high number of transactions in their queue. However, for actions, it seems that each attempt to create an action is using a unique document id. That is why the graph is failing to load, and also why our logic is failing to cleanly handle this case.

I can think of ways that we can manually clean this up, but I'll save that for another post.

I'm leaving my debugging notes around, in case it helps other people follow how I got to this conclusion in the future.

Looking at the mongodump, I came across:
> db.txns.count()
786352
> db.txns.find({"s": 2}).count()
189167
> db.txns.find({"s": 1}).count()
595945
> db.txns.find({"s": 3}).count()
0

So we have 786k transactions, 189k of them are is 'Prepared', and 596k of them are in 'Preparing'.
Likely all of those are because of a backlog on the document that has a broken txn.

Since we suspected units:
> db.units.find({"txn-queue.100": {"$exists": 1}}, {"_id": 1})
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" }

So there is a unit doc that has at least 100 transactions. How many?
> db.units.aggregate([{"$match": { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" }}, {"$project": {"_id": 1, len: {$size: "$txn-queue"}}}])
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 }

272k transactions on that document.

What are the first few?
> db.units.find({ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"_id": 1, "txn-queue" : {$slice: 4}}).pretty()
{
        "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0",
        "txn-queue" : [
                "5b87d51c5540e3051751d249_0c5d64c0",
                "5b8907bb5540e370277cb5a9_69306aff",
                "5b8907bb5540e370277cb5b9_2753c79f",
                "5b8907bb5540e370277cb9ee_44c2dde3"
        ]
}

The first one is trying to run an action on zookeeper:
> db.txns.find({"_id": ObjectId("5b87d51c5540e3051751d249")}).pretty()
{
        "_id" : ObjectId("5b87d51c5540e3051751d249"),
        "s" : 2,
        "o" : [
                {
                        "c" : "units",
                        "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0",
                        "a" : {
                                "life" : {
                        ...

While investigating the mongodump, I think I found the issue. The issue is that there is a document in 'units' that has 272k transactions queued up for it:

> db.units.aggregate([{$match: {"txn-queue.100": {$exists: 1}}}, {$project: {"_id": 1, len: {$size: "$txn-queue"}}}])
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 }

We've seen that in the past, but  when we've seen it, it was because of a transaction that kept being attempted, that involved the *same* documents. However, in this dump, sampling some of those 272k transactions, we find that they involve *different* action documents: