jujud on state server panic misses transaction in queue

Bug #1318366 reported by James Page on 2014-05-11
46
This bug affects 6 people
Affects Status Importance Assigned to Milestone
juju-core
High
Menno Smits
1.20
High
Menno Smits
juju-core (Ubuntu)
High
Unassigned

Bug Description

Environment: 256 server MAAS environment, juju-core 1.18.1 or 1.18.2.
Deployment: OpenStack charms with four instances of the nova-compute service, each containing ~64 servers

All 256 servers are provisioned correctly with the principle service being deployed, but as soon as the relations are added between services, the load on the bootstrap node goes high and the jujud machine 0 agent starts logging alot or errors (see attached).

Its a big log, with a large goroutine stack track, which is triggered by:

panic: rescanned document misses transaction in queue

at L20000

James Page (james-page) on 2014-05-11
summary: - provisioning large environment with complex relations caused jujud on
+ provisioning large environments with complex relations caused jujud on
bootstrap node to fail
James Page (james-page) on 2014-05-11
tags: added: sm15k
James Page (james-page) wrote :

I had to work round this by bringing up 128 servers to start with and then adding in 64 server chunks; see bug 1317909 for details on how this degrades as environments get large.

James Page (james-page) wrote :
Curtis Hovey (sinzui) on 2014-05-12
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → next-stable
tags: added: panic performance
James Page (james-page) on 2014-06-18
Changed in juju-core (Ubuntu):
importance: Undecided → High
status: New → Triaged
John A Meinel (jameinel) wrote :

I wonder if this is just saturating the transaction log. The txn log is a capped collection, so it only allows N entries before it starts overwriting earlier ones. I can imagine there are other issues going on, but we might want a knob that lets you say "I'm going to be running 1000s things, make sure my TXN log is oversized."

Adam Collard (adam-collard) wrote :

I just hit this on a much smaller deploy: bootstrap with 9 LXC containers and a hulk smashed unit on machine 0 with 4 other physical machines.

Note that there have been many, many such deployments with the exact same configuration, but this time it failed with a "panic: rescanned document misses transaction in queue"

tags: added: cloud-installer landscape
summary: - provisioning large environments with complex relations caused jujud on
- bootstrap node to fail
+ jujud on state server panic misses transaction in queue
Adam Collard (adam-collard) wrote :

Please see attached mongo log, it seems to be dying lots.

Adam Collard (adam-collard) wrote :

Note my panic was from Juju 1.20.1, I have the full log, am worried about leaking sensitive information though.

tags: added: orange-box
Curtis Hovey (sinzui) on 2014-07-24
Changed in juju-core:
milestone: next-stable → 1.21-alpha1
Nate Finch (natefinch) wrote :

I'm looking at this right now.

Nate Finch (natefinch) wrote :

Is there a more complete log than the one posted on james-page's link? That one appears to be cut off at a million lines, and doesn't contain the full panic output, and nothing before the panic, which would be very helpful.

Nate Finch (natefinch) wrote :

Note: I received full logs privately, as they may have sensitive info.

Gustavo Niemeyer (niemeyer) wrote :

This error should never happen on a healthy database. The only case I've debugged with such an issue was on a system that had a corrupted database due to an out-of-space situation.

The reason why this should never happen is clear in the code of the txn package: before anything is ever done with regards to a transaction, the document that represents that transaction is inserted into the database. Just after that will the id of the referred transaction be used to tag the documents that were affected by it. What the error is saying is that while attempting to flush pending transactions on an affected document it could find a reference to a transaction that does not exist.

Things that might lead to such a situation include, for example:

1. A corrupted database

2. An abrupt shutdown on a database without a journal

3. A capped transaction collection that is eating its tail while old transactions are still in use

Gustavo Niemeyer (niemeyer) wrote :

@John, it's definitely a bad idea to have transactions in a capped collection for that sort of reason, but as far as I can see the _txns_ collection, the one holding the transactions themselves, is not capped. Having missing logs for a transaction would not cause this issue.

Adam Collard (adam-collard) wrote :

If anyone hits this again, you should grab a dump of the Mongo db (mongodump pointed at juju-db address)

Gustavo Niemeyer (niemeyer) wrote :

Looking at the logs from Adam that Nate forwarded to me, I can see the database is being terminated and restarted over and over and over, every few seconds. Looking at logs around it, looks like at least rsyslogd is also being re-freshed on the same cadence.

By itself, this should not be an issue, and I could not yet find anything obvious to point the finger at, but the environment is definitely not healthy.

Mark Shuttleworth (sabdfl) wrote :

I've got this in a live environment from the cloud-installer too. How do I know where to point mongodump?

Mark Shuttleworth (sabdfl) wrote :

Yes, I saw the same restarting of Mongo, looks like every 10-15 seconds.

Mark Shuttleworth (sabdfl) wrote :

Attached is a dump of the Juju database in this case.

Mark Shuttleworth (sabdfl) wrote :

After some experiments in compression options, here are all the Juju logs you could ever want :)

http://people.canonical.com/~mark/juju-server-crash-logs.tar.xz

68M compressed, about 1.9G uncompressed. That's /var/log/juju/ from machine 0.

Kapil Thangavelu (hazmat) wrote :

And the db dump (732k) of the juju db in mongo on mark's state server is at
http://chinstrap.canonical.com/~kapil/bug-13183656-juju-db.dump.tbz2

Mark Shuttleworth (sabdfl) wrote :

Here is a snippet of syslog showing two cycles of Mongo starts and restarts. This is happening constantly!

Gustavo and I are wondering whether the numactl advice might be relevant.

On 29 July 2014 20:12, Mark Shuttleworth <email address hidden> wrote:

>
> Gustavo and I are wondering whether the numactl advice might be
> relevant.

I humbly suggest that's a red herring. Note this happened on an Orange Box
(see log attached in
https://bugs.launchpad.net/juju-core/+bug/1318366/comments/7 ) and,
unsurprisingly, there was no such advice from Mongo.

Mark Shuttleworth (sabdfl) wrote :

Digging in further, it appears that jujud is writing to /etc/init/juju-db.conf (the Upstart job for its database) every few seconds. I'll file a separate bug about this because it plausibly is the root cause of the mongo restarts we're seeing.

Menno Smits (menno.smits) wrote :

Summarising what we know so far and adding a few more details:

1. We understand why mongo was continually restarting. jujud currently restarts mongo every time it starts so every time jujud panicked (~ every 10s), upstart would restart jujud and jujud would restart mongo. This explains the observed mongo restarts and upstart config rewrites. This isn't ideal behaviour and is being fixed right now (bug #1349969)

2. As per @niemeyer, the txn collection is definitely not capped. txn.log is capped but in the dump provided by Kapil above it only has 15,671 documents in it. The cap is 10,000,000 so there's plenty of breathing room. Capped collections is unlikely to be at play here.

3. @sabdfl, @adam-collard and @james-page have all seen this issue in separate environments. There may be a scale/load component to this (although Adam says he's seen it with a small scale Juju environment).

Does anyone have any ideas on what to look at from here?

Mark Shuttleworth (sabdfl) wrote :

Is there a way to run Go processes under a debugger and generate very
high-resolution debugging output? I'm seeing this every second or third
attempt to build a cloud. It might be that debugging overhead makes the
problem vanish (yay Heisenberg) but it might give us a useful picture.

Alternatively, I can provide ssh access to an environment which
demonstrates the crash-and-restart behaviour.

Mark

Gustavo Niemeyer (niemeyer) wrote :
Download full text (4.6 KiB)

As posted in juju-dev last night:

Okay, I couldn't resist investigating a bit. I've been looking at the
database dump from earlier today and it's smelling like a simpler bug
in the txn package, and I might have found the cause already.

Here is a quick walkthrough while debugging the problem, to also serve
as future aid in similar quests.

Enabling full debug for the txn package with SetDebug and SetLogger,
and doing a ResumeAll to flush all pending transactions, we can
quickly get to the affected document and transaction:

2014/07/30 02:19:23 Resuming all unfinished transactions
2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared"
2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002eb_98124806: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775
2014/07/30 02:19:23 a) Rescanned document {services ntp} misses
53d6057930009a01ba0002ee_a83bd775 in queue:
[53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41
53
d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd
53d6057e30009a01ba000303_a26cb429]
2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared"
panic: rescanned document misses transaction in queue

So this error actually means something slightly different from what I
pointed out in the bug.

The transaction runner state machine creates transactions in the
"preparing" state, and then moves it over to "prepared" when all
affected documents were tagged with the transaction id+nonce. So what
this means is that there is a transaction in progress in the prepared
state, while the actual document misses the id in its local queue,
which is an impossible situation unless the document was fiddled with,
there was corruption, or a bug in the code.

So, let's have a look at the affected documents. First, the document
being changed:

    db.services.findOne({_id: "ntp"})
    http://paste.ubuntu.com/7902134/

We can see a few transactions in the queue, but the one raising the
issue is not there as reported by the error.

And this is full transaction raised by the error:

    db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")})
    http://paste.ubuntu.com/7902095/

One interesting thing we can do from here is verifying that other
documents which are part of this transaction are properly tagged, and
indeed they are, so this one document has the issue alone in this
transaction. Good hint.

Another detail worth checking is getting the set of all transactions
involving the affected document since it was created, stripping
operations from unrelated documents:

    db.txns.aggregate(
            {$match: {"o.c": "services", "o.d": "ntp"}},
            {$unwind: "$o"},
            {$match: {"o.c": "services", "o.d": "ntp"}})
    http://paste.ubuntu.com/7902097/

This has quite a few useful hints about the problem. The document was
c...

Read more...

Gustavo Niemeyer (niemeyer) wrote :

Alright, the guess last night was correct, and the candidate fix as well. I've managed to reproduce the problem by stressing out the scenario described with 4 concurrent runners running the following two operations, meanwhile the chaos mechanism injects random slowdowns in various critical points:

        []txn.Op{{
                C: "accounts",
                Id: 0,
                Update: M{"$inc": M{"balance": 1}},
        }, {
                C: "accounts",
                Id: 1,
                Update: M{"$inc": M{"balance": 1}},
        }}

To reach the bug, the stress test also has to run half of the transactions in this order, and the other half with these same operations in the opposite order, so that dependency cycles are created between the transactions. Note that the txn package guarantees that operations are always executed in the order provided in the transaction.

The fix and the complete test is available in this change:

    https://github.com/go-mgo/mgo/commit/3bc3ddaa

The numbers there are lower to run in a reasonable mount of time, but to give some confidence on the fix and the code in general, I've run this test for 100k transactions being concurrently executed with no problems.

Also, to give a better perspective of the sort of outcome that the logic for concurrent runners produce, this output was generated by that test while running for 100 transactions:

    http://paste.ubuntu.com/7906618/

The tokens like "a)" in these lines are the unique identifier for a given transaction runner. Note how every single operation is executed in precise lock-step despite the concurrency and the ordering issues, even assigning the same revision to both documents since they were created together.

Also, perhaps most interestingly, note the occurrences such as:

[LOG] 0:00.180 b) Applying 53d92a4bca654539e7000003_7791e1dc op 0 (update) on {accounts 0} with txn-revno 2: DONE
[LOG] 0:00.186 d) Applying 53d92a4bca654539e7000003_7791e1dc op 1 (update) on {accounts 1} with txn-revno 2: DONE

Note the first one is "b)" while the second one is "d)", which means there are two completely independent runners, in different goroutines (might as well be different machines), collaborating towards the completion of a single transaction.

So, I believe this is sorted. Please let me know how it goes there.

Mark Shuttleworth (sabdfl) wrote :

Thanks Gustavo, this is 50% of the issues I see on cloud builds so am excited to get a build of the tools with this fix applied. Curtis, think we can spin a build through CI asap that would show up in the testing tools bucket on S3?

Ian Booth (wallyworld) on 2014-07-31
Changed in juju-core:
status: Triaged → Fix Committed
assignee: nobody → Menno Smits (menno.smits)
Curtis Hovey (sinzui) on 2014-09-08
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers