1.23.2.1, mongo: document is larger than capped size
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Ian Booth | ||
| | 1.23 |
High
|
Ian Booth | ||
| | 1.24 |
High
|
Ian Booth | ||
Bug Description
Nothing too special here, just a local environment running a few ubuntu services where I was testing the landscape client charm.
machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576
machine-0: 2015-05-14 00:51:03 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659092 > 1048576
machine-0: 2015-05-14 00:52:12 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659192 > 1048576
machine-0: 2015-05-14 00:53:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659240 > 1048576
machine-0: 2015-05-14 00:54:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659320 > 1048576
machine-0: 2015-05-14 00:55:30 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659372 > 1048576
machine-0: 2015-05-14 00:56:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659452 > 1048576
machine-0: 2015-05-14 00:57:40 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659552 > 1048576
Dump of environment attached. Here is a stat dump requested by menn0 as well that didn't yield too many results:
MongoDB shell version: 2.6.3
connecting to: 127.0.0.
{
"ns" : "juju.txns.log",
"count" : 38600,
"size" : 4742664,
"nindexes" : 1,
"userFlags" : 0,
},
"capped" : true,
"max" : NumberLong(
"ok" : 1
}
bye
| David Britton (davidpbritton) wrote : | #1 |
| Changed in juju-core: | |
| assignee: | nobody → Menno Smits (menno.smits) |
| David Britton (davidpbritton) wrote : | #2 |
| tags: | added: mongodb |
| Changed in juju-core: | |
| status: | New → Triaged |
| milestone: | none → 1.25.0 |
| importance: | Undecided → High |
| John A Meinel (jameinel) wrote : | #3 |
| John A Meinel (jameinel) wrote : | #4 |
Note that I think the symptom:
machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576
Is just that things got out of hand internally and we ended up with so many txn ids on the document that we ran out of document size.
I have the feeling we might be running into a lost transaction, causing a bunch of pending transactions to queue up, and it is just happening on the leadership document because that is a document that we try to touch very often (every 30s?)
| Menno Finlay-Smits (menno.smits) wrote : | #5 |
I restored the DB locally and ran the (fixed) PurgeMissing function over all collections. It found no problems - so it doesn't appear that we have a txn missing from the txns collection.
Still looking.
| Menno Finlay-Smits (menno.smits) wrote : | #6 |
There are 4953 txns in the "preparing" state, 2 txns in the "prepared" and 1 (stuck?) in the "applying" state. They are all leadership update attempts for the "ubuntu" charm.
The code that generates these txns looks potentially problematic to me:
ops := []txn.Op{
// First remove anything that's there.
{
C: p.collectionName,
Id: id,
Remove: true,
},
// Then insert the token.
{
Assert: txn.DocMissing,
C: p.collectionName,
Id: id,
Insert: entity,
},
}
Removing and inserting the same id in the one txn could be triggering a problem in mgo/txn.
Could a simulated upsert be used instead?
http://
| Menno Finlay-Smits (menno.smits) wrote : | #7 |
For completeness here's how the stuck txns look:
{
"_id" : ObjectId(
"s" : 1,
"o" : [
{
"c" : "lease",
"d" : "ubuntu-
"r" : true
},
{
"c" : "lease",
"d" : "ubuntu-
"a" : "d-",
"i" : {
"lastupdate" : ISODate(
"token" : {
"namespace" : "ubuntu-
"id" : "ubuntu/0",
"expiration" : ISODate(
}
}
}
]
}
Only the lastupdate and expiration values vary. Looking at the last update values the problem started around 2015-05-07 18:34 with the latest txns being for right up to the time the DB dump was taken.
| Menno Finlay-Smits (menno.smits) wrote : | #8 |
Actually, it looks like things went bad a little before that time but certainly on 7 May.
Strangely it seems like the logs just stop from 2015-05-05 19:04:06 until 2015-05-07 14:30:51 where the resumer starts complaining. This is local provider deployment so perhaps the host was off/suspended?
When the resumer starts reporting problems the initial messages are:
2015-05-07 14:30:51 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:31:56 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:32:57 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1048580 > 1048576
I'm not sure what those first 2 messages mean.
| William Reade (fwereade) wrote : | #9 |
Hmm, yes, ISTM that that lease code will only work if the document's not already there. I'd better go take a look at that.
| William Reade (fwereade) wrote : | #10 |
1) Those `!loc.isNull()` messages seem to be coming out of mongo, but I'm not sure why.
2) Those transactions are (somewhat to my surprise) fine (at least in that the assert doesn't fail as I would have expected it to -- it seems that mgo/txn does handle that case).
3) Those transactions are very much *not* fine because they're unconditional writes that can be arbitrarily delayed and run in any order, rendering them... er, unhelpful for tracking lease ownership.
Can't immediately see a path from (3) to the bug, but that's what leaps out at me.
| David Britton (davidpbritton) wrote : | #11 |
Note, I just filed this issue, which is happening on a non-local
provider (openstack), 1.23.0. Is it the same thing? Should it be duped
out here?
If someone wants to log into that machine and take a look, they
certainly can. Let me know.
| Menno Finlay-Smits (menno.smits) wrote : | #12 |
The symptoms for bug 1457645 are somewhat different but the problems could be related. I've added some more detail to the other ticket.
| Changed in juju-core: | |
| assignee: | Menno Smits (menno.smits) → nobody |
| Ian Booth (wallyworld) wrote : | #13 |
I'm marking this as Fix Committed on the basis that the fix for bug 1457645 may have fixed this also. Please re-open if the issue re-appears.
| Changed in juju-core: | |
| assignee: | nobody → Ian Booth (wallyworld) |
| status: | Triaged → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |


In the initial paste (private link): https:/ /pastebin. canonical. com/131827/ server- leadership" }, n: "d5b7100b" } update: { $set: { txn-queue: [ "55352def91a7b1 0f460007b4_ 127104b4" , "55352e1091a7b1 0f46000802_ 01edfd48" ,...
log line attempted (677k) over max size(10k), printing beginning and end ... update juju.txns.stash query: { _id: { c: "lease", id: "trusty-
That looks like something that is having severe problems with conflicting transactions on the trusty- server- leadership document. Enough so that it ends up with hundreds/thousands? of txn-queue ids (these are transactions that want to be applied to the document, but are unable to be cleaned up for some reason.)
Also on the document: /github. com/go- mgo/mgo/ blob/v2/ txn/flusher. go#L481
txn-revno: -37441
I'm pretty sure txn-revno is supposed to stay a positive value. And negative values are used to toggle some sort of boolean state.
https:/
Remove operations try to invert the revno, as do Insert operations (Insert because it should start at -1), and Update operations are supposed to just increment it.
So maybe one of the txns is trying to Remove the document, and a bunch of others are trying to Update it?