[2.3.8] jujud exhausts resources

Bug #1797816 reported by Felipe Reyes
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel

Bug Description

We have an environment (HA) that was being affected by https://github.com/juju/mgopurge/issues/28 , so to apply the suggested change jujud was stopped, the query executed and when jujud was started it was detected that after a few minutes (~10 mins) that the controller was slow to respond to queries from the client (e.g. juju status), so we tried to run mgopurge which didn't get to complete the resume stage because it ended up allocating all the memory available in the server (256 GB), after some inspection it was detected that the txns collection had high number of transactions in "applied" state[0], running mgopurge's prune stage with a large max-txns allowed us to reduce the txns collection from 12 GB[1] to just 300 MB[2].

After all this we have a jujud that when it's started in about 2 to 3 minutes consumes all the memory available and a fairly high amount of cpu as well (~300%), jujud ends up being killed by the OOM.

[0] > db.txns.count();
32662679
> db.txns.find({ "s": 6 }).count();
31872010
[1] juju.txns: 12048562065 (4102180864) - full list: https://pastebin.canonical.com/p/VVqVBnJjc8/
[2] juju.txns: 390761578 (4084633600) - full list: https://pastebin.canonical.com/p/nBs4CCRNjC/

Tags: 4010 sts
Revision history for this message
Felipe Reyes (freyes) wrote :

This is mongostat capture from when jujud is started until it starts using too many resources

Revision history for this message
Felipe Reyes (freyes) wrote :

mongotop output for the same period of time

Revision history for this message
Felipe Reyes (freyes) wrote :
Revision history for this message
Richard Harding (rharding) wrote :

Can you please provide typical bug details such as what version of Juju this is, what substrate it's on. You reference a bug around running mgopurge on a large collection is this something you're also doing? Are you using the later updated batching mgopurge tool?

Changed in juju:
status: New → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote :

Typically it is not the number of txns that causes OOM during resume, but the size of one of the docs that has too many txns.

namely db.txns.find({s: 2}).count()

and there should be some document that has way too many txns (100k)

can you check the version of mgopurgw you're running (mgopurge --version)

New ones are supposed to have a workaround for txn-queue being too long so this could be something new.

Revision history for this message
Tim Penhey (thumper) wrote :

Based on the mongostat results in the pastebin and the mongotop, it doesn't appear to be real mongo issues. More likely that there is something in the database that is tickling mgopurge in a bad way so it enters some form of infinite loop.

The juju controllers run some safe aspects of mgopurge on a regular basis, and it is seems like it is hitting the same infinite loop.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → High
Revision history for this message
John A Meinel (jameinel) wrote :

Agreed. It also looks like the issue is fairly clearly in juju.units, since that is where we see all of the activity in mongotop.

Revision history for this message
Tilman Baumann (tilmanbaumann) wrote :

mgopurge is current release version.
Juju is 2.3.3 running on xenial.

I can reconstruct this rough timeline

At some time, before the upgrade to 2.3.8 the complaints about this single transaction started already.
Juju was version 2.2.6

September the 3rd.
The uprade to 2.3.8 was done.
We were a little concerned about the big transaction backlog back then, that but decided it should not interfere with the upgrade.

Intermediate time.
ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
happenig constantly. Every few seconds.

2018-09-09
Error about single transaction turns into document too large.
2018-09-09 20:50:05 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction
5b87d51c5540e3051751d249_0c5d64c0 in queue for document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:eb08e238-dfa8-4c46-842e-47a4eb929adf}
2018-09-09 20:50:09 ERROR juju.worker.dependency engine.go:551 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: Resulting document after update is larger than 16777216

October 12th
Controlers stopped
db.txns.update({"_id": ObjectId("5b87d51c5540e3051751d249")}, {"$set": {"s": 1}, "$unset": {"n": 1}})
Controllers started. Very heavy system load. Unreliable operations.

mgopurge. Noticed resume OOM.

mgopurge purge. Cleaned a lot of finished transactions.

mgopurge resume would still OOM.

jujud would still not work well.
2018-10-13 17:38:27 ERROR juju.worker.dependency engine.go:551 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 100.107.2.44:48498->100.107.2.44:37017: i/o timeout
log db massively bloated.

juju logs db dropped.

jujud would start. Log no error but ver very quickly run out of memory.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (10.1 KiB)

While investigating the mongodump, I think I found the issue. The issue is that there is a document in 'units' that has 272k transactions queued up for it:

> db.units.aggregate([{$match: {"txn-queue.100": {$exists: 1}}}, {$project: {"_id": 1, len: {$size: "$txn-queue"}}}])
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 }

We've seen that in the past, but when we've seen it, it was because of a transaction that kept being attempted, that involved the *same* documents. However, in this dump, sampling some of those 272k transactions, we find that they involve *different* action documents:

> db.txns.stash.count()
554712
^- that means that we have 500k docs that are all referenced in the transactions being tried againts the zookeeper unit.

The logic that we had around "if you have 200k transactions, is it ok to just ignore all of them", used the idea that all documents in each transaction would also have a very high number of transactions in their queue. However, for actions, it seems that each attempt to create an action is using a unique document id. That is why the graph is failing to load, and also why our logic is failing to cleanly handle this case.

I can think of ways that we can manually clean this up, but I'll save that for another post.

I'm leaving my debugging notes around, in case it helps other people follow how I got to this conclusion in the future.

Looking at the mongodump, I came across:
> db.txns.count()
786352
> db.txns.find({"s": 2}).count()
189167
> db.txns.find({"s": 1}).count()
595945
> db.txns.find({"s": 3}).count()
0

So we have 786k transactions, 189k of them are is 'Prepared', and 596k of them are in 'Preparing'.
Likely all of those are because of a backlog on the document that has a broken txn.

Since we suspected units:
> db.units.find({"txn-queue.100": {"$exists": 1}}, {"_id": 1})
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" }

So there is a unit doc that has at least 100 transactions. How many?
> db.units.aggregate([{"$match": { "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0" }}, {"$project": {"_id": 1, len: {$size: "$txn-queue"}}}])
{ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "len" : 272816 }

272k transactions on that document.

What are the first few?
> db.units.find({ "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"_id": 1, "txn-queue" : {$slice: 4}}).pretty()
{
        "_id" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0",
        "txn-queue" : [
                "5b87d51c5540e3051751d249_0c5d64c0",
                "5b8907bb5540e370277cb5a9_69306aff",
                "5b8907bb5540e370277cb5b9_2753c79f",
                "5b8907bb5540e370277cb9ee_44c2dde3"
        ]
}

The first one is trying to run an action on zookeeper:
> db.txns.find({"_id": ObjectId("5b87d51c5540e3051751d249")}).pretty()
{
        "_id" : ObjectId("5b87d51c5540e3051751d249"),
        "s" : 2,
        "o" : [
                {
                        "c" : "units",
                        "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0",
                        "a" : {
                                "life" : {
                        ...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (6.8 KiB)

Looking at the txns in 'db.txns' it seems there are 3 types of transactions that Preparing or Prepared:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
88195
> db.txns.find({"o.c": "units", "s": 2, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.c": "actions"}).count()
189162

88k transactions that are Preparing that involve an action, and 189k that are Prepared that involve an action. However, that doesn't account for 595k transactions on that unit that are in Preparing state:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).count()
595887

Poking around we find:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).skip(100).limit(10).pretty() {
        "_id" : ObjectId("5b958bd85540e3ba88519b81"),
        "s" : 1,
        "o" : [
                {
                        "c" : "units",
                        "d" : "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0",
                        "a" : {
                                "life" : {
                                        "$ne" : 2
                                }
                        },
                        "u" : {
                                "$set" : {
                                        "passwordhash" : <omitted>
                                }
                        }
                }
        ]
}

Now that is trying to update the password for the unit. Which typically happens when the unit agent is starting up. Regardless it is unable to actually apply the transaction, because the queue is already full.

I don't know a way to do a search involving a sub-document that has a "$" in it, because unfortunately that is magic syntax as part of the mongo query language. However, we can approximate it with:
> db.txns.find({"o.c": "units", "s": 1, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0", "o.u": {"$exists": 1} }).count()
507692

so if we have 507k updates to the document pending, 88k preparing actions and 189k prepared actions that amounts to: 785,049 documents.

> db.txns.find({"o.c": "units", "s": {$lt: 3}, "o.d": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}).count()
785049

> db.txns.count() 786349

So here is the cleanup I propose:

1) cleanup txns.stash from all pending documents that are actions or action notifications
> db.txns.stash.find({"_id.c": "actions"}).count()
> db.txns.stash.find({"_id.c": "actionnotifications"}).count()
277356
> db.txns.stash.remove({"_id.c": "actions"})
> db.txns.stash.remove({"_id.c": "actionnotifications"})
WriteResult({ "nRemoved" : 277356 })

# note, I accidentally removed some of the 'actions' before stopping it, so I don't have the accurate counts there. I would expect it to be the same as "actionnotifications".

2) cleanup the txn-queue on the unit doc
> db.units.update({"_id": "6a783ac4-0b48-45a3-87fc-9646f8bd82de:zookeeper/0"}, {"$set": {"txn-queue": []}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModi...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

I should make a few other comments

1) New versions of juju have a patch that should prevent ever getting a 200k transaction queue. (if when adding a transaction the queue would be >1000, we reject the txn as invalid.) That should prevent the txn queue from growing large, which was what blocked us from automatically pruning the table, and keeping the size under control.

2) We are currently looking at plans for how we can change the apply step so that it can note invalid transactions and abort them. That also helps get into this situation in the first place.

3) I don't think that we will add something to mgopurge explicitly for this use case, as we should mostly be able to avoid it in the future.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1797816] Re: [2.3.8] jujud exhausts resources

Ok, if the txn was broken in 2.2.6 that would explain why we're seeing the
large document in 2.3, and the fixes that we have around not letting the
queue grow too large wasn't in 2.2.

On Tue, Oct 16, 2018 at 2:40 PM John A Meinel <email address hidden>
wrote:

> I should make a few other comments
>
> 1) New versions of juju have a patch that should prevent ever getting a
> 200k transaction queue. (if when adding a transaction the queue would be
> >1000, we reject the txn as invalid.) That should prevent the txn queue
> from growing large, which was what blocked us from automatically pruning
> the table, and keeping the size under control.
>
> 2) We are currently looking at plans for how we can change the apply
> step so that it can note invalid transactions and abort them. That also
> helps get into this situation in the first place.
>
> 3) I don't think that we will add something to mgopurge explicitly for
> this use case, as we should mostly be able to avoid it in the future.
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1797816
>
> Title:
> [2.3.8] jujud exhausts resources
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1797816/+subscriptions
>

Revision history for this message
Tilman Baumann (tilmanbaumann) wrote :

Thanks John. Sounds good, I will give that a try.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Hello, John.

We are trying this on a 'isolated' mongodb, where we imported the dump pointed to by Felipe earlier in this bug. After applying all the .update and .remove commands you suggesteed mgopurge now breaks with different message:

2018-10-17 09:06:49 ERROR failed stage resume: cannot find document {actions 6a783ac4-0b48-45a3-87fc-9646f8bd82de:5c141543-7e4b-4bbb-8881-7458104cc334} for applying transaction 5ac24a1e5540e3051792857b_ded72eda

Indeed, I can not find that document in the database. Is it safe to just remove this transaction? (5ac24a1e5540e3051792857b_ded72eda)? Also, if we run into more of this 'orphaned' transactions, can we safely remove them?

Here is the full output of the commands run just after the import of the dump: https://paste.ubuntu.com/p/GP4ksw7xjK/

Revision history for this message
Tilman Baumann (tilmanbaumann) wrote :

We successfully repaired the database with Johns procedure.
Thanks John, that was very helpful.

The document Mario commented about was covered in Johns comment already.

Service restored. Bug can be closed. Thanks

John A Meinel (jameinel)
Changed in juju:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.