Juju state server database is overly large

Bug #1344940 reported by Mark Shuttleworth on 2014-07-19
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
High
Andrew Wilkins
1.20
High
Andrew Wilkins

Bug Description

On a machine which is being used to build out an openstack:

ubuntu@pbdpm:~$ sudo du -csh /var/lib/juju/
51G /var/lib/juju/
51G total

Is there any reason to pre-allocate that much space?

Kapil Thangavelu (hazmat) wrote :

the problem is 1000s of writes without an actual change, ie. the transaction log is full of these

    stateServers:apiHostPorts update {u'$set': {u'apihostports': [[{u'networkscope': u'public', u'addresstype': u'hostname', u'port': 17070, u'value': u'pbdpm.cluster.mallards'}, {u'networkscope': u'local-cloud',\
 u'addresstype': u'ipv4', u'port': 17070, u'value': u'192.168.9.74'}, {u'networkscope': u'local-machine', u'addresstype': u'ipv4', u'port': 17070, u'value': u'127.0.0.1'}, {u'networkscope': u'local-machine', u'ad\
dresstype': u'ipv6', u'port': 17070, u'value': u'::1'}, {u'addresstype': u'ipv6', u'port': 17070, u'value': u'fe80::225:90ff:fe4c:e79e'}]]}}
2014/07/19-15:41:13 applied
    stateServers:apiHostPorts update {u'$set': {u'apihostports': [[{u'networkscope': u'public', u'addresstype': u'hostname', u'port': 17070, u'value': u'pbdpm.cluster.mallards'}, {u'networkscope': u'local-cloud',\
 u'addresstype': u'ipv4', u'port': 17070, u'value': u'192.168.9.74'}, {u'networkscope': u'local-machine', u'addresstype': u'ipv4', u'port': 17070, u'value': u'127.0.0.1'}, {u'networkscope': u'local-machine', u'ad\
dresstype': u'ipv6', u'port': 17070, u'value': u'::1'}, {u'addresstype': u'ipv6', u'port': 17070, u'value': u'fe80::225:90ff:fe4c:e79e'}]]}}

we should not be writing to the database when there is no actual change.

Kapil Thangavelu (hazmat) wrote :

here's a pretty print of the last 100 txns in the transaction collection.

http://paste.ubuntu.com/7820254/

Kapil Thangavelu (hazmat) wrote :

so the actual issue.. is that we're configuring a replica set with 1 member and mongodb is storing the replication oplog in a very large capped collection. combined with constant database writes, we get very large oplog being stored.

so solution would be

a) stop configuring a replica set before we have one, tune the settings on the replica set oplog ( currently set to 51200MB) which gives us the 51Gb number
b) have juju stop writing to the database when there are no changes to actually write cause its inflating the oplog

even steady state this system is producing 15-50 write ops a second, it has saturated the disk i/o during actual activity.

juju:PRIMARY> show dbs;
admin 0.03125GB
juju 0.03125GB
local 50.515380859375GB
presence 0.03125GB
test (empty)

# on local, back to back commands show the growth due to spurious writes.

juju:PRIMARY> db['oplog.rs'].count()
55033
juju:PRIMARY> db['oplog.rs'].count()
55041

# on juju

rs.status()
{
        "set" : "juju",
        "date" : ISODate("2014-07-19T16:19:49Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 1,
                        "name" : "192.168.9.74:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 10732,
                        "optime" : Timestamp(1405786789, 2),
                        "optimeDate" : ISODate("2014-07-19T16:19:49Z"),
                        "self" : true
                }
        ],
        "ok" : 1
}

juju:PRIMARY> db.printReplicationInfo()
configured oplog size: 51200MB
log length start to end: 11119secs (3.09hrs)
oplog first event time: Sat Jul 19 2014 13:20:46 GMT+0000 (UTC)
oplog last event time: Sat Jul 19 2014 16:26:05 GMT+0000 (UTC)
now: Sat Jul 19 2014 16:26:05 GMT+0000 (UTC)

James Troup (elmo) on 2014-07-19
tags: added: canonical-is
James Troup (elmo) wrote :

FWIW, we're seeing the same thing and this appears to be a regression
in 1.20.1:

450M /var/lib/juju/db
  ^-- an OpenStack cloud deployed by Juju 1.18.4

23G /var/lib/juju/db/
  ^-- an OpenStack cloud deployed by Juju 1.20.1

Kapil Thangavelu (hazmat) wrote :
Download full text (4.3 KiB)

we should also rethink replication on the presence database which is the majority source of constant writes (and O(n) size of env). from op log

{ "ts" : Timestamp(1405780166, 2), "h" : NumberLong("7248998997472728194"), "v" : 2, "op" : "u", "ns" : "juju.stateServers", "o2" : { "_id" : "apiHostPorts" }, "o" : { "$set" : { "txn-queue.0" : "53ca80c6ac02cd3e930005cc_14963302" } } }
{ "ts" : Timestamp(1405780166, 3), "h" : NumberLong("-5418504314646616294"), "v" : 2, "op" : "u", "ns" : "juju.txns", "o2" : { "_id" : ObjectId("53ca80c6ac02cd3e930005cc") }, "o" : { "$set" : { "n" : "14963302", "s" : 2 } } }
{ "ts" : Timestamp(1405780166, 4), "h" : NumberLong("7570052371187181427"), "v" : 2, "op" : "u", "ns" : "juju.txns", "o2" : { "_id" : ObjectId("53ca80c6ac02cd3e930005cc") }, "o" : { "$set" : { "r" : [ NumberLong(67) ], "s" : 4 } } }
{ "ts" : Timestamp(1405780166, 5), "h" : NumberLong("8591084146477772223"), "v" : 2, "op" : "u", "ns" : "juju.stateServers", "o2" : { "_id" : "apiHostPorts" }, "o" : { "$set" : { "apihostports" : [ [ { "value" : "pbdpm.cluster.mallards", "addresstype" : "hostname", "networkscope" : "public", "port" : 17070 }, { "value" : "192.168.9.74", "addresstype" : "ipv4", "networkscope" : "local-cloud", "port" : 17070 }, { "value" : "127.0.0.1", "addresstype" : "ipv4", "networkscope" : "local-machine", "port" : 17070 }, { "value" : "::1", "addresstype" : "ipv6", "networkscope" : "local-machine", "port" : 17070 }, { "value" : "fe80::225:90ff:fe4c:e79e", "addresstype" : "ipv6", "port" : 17070 } ] ], "txn-queue" : [ ], "txn-revno" : NumberLong(68) } } }
{ "ts" : Timestamp(1405780166, 6), "h" : NumberLong("-4967417421360925452"), "v" : 2, "op" : "i", "ns" : "juju.txns.log", "o" : { "_id" : ObjectId("53ca80c6ac02cd3e930005cc"), "stateServers" : { "d" : [ "apiHostPorts" ], "r" : [ NumberLong(68) ] } } }
Type "it" for more
juju:PRIMARY> it
{ "ts" : Timestamp(1405780167, 1), "h" : NumberLong("-2430360262590208338"), "v" : 2, "op" : "u", "ns" : "juju.txns", "o2" : { "_id" : ObjectId("53ca80c6ac02cd3e930005cc") }, "o" : { "$set" : { "s" : 6 } } }
{ "ts" : Timestamp(1405780168, 1), "h" : NumberLong("2892245837688454076"), "v" : 2, "op" : "i", "ns" : "presence.presence.pings", "o" : { "_id" : NumberLong(1405780170), "alive" : { "1" : NumberLong("4294967296") } } }
{ "ts" : Timestamp(1405780169, 1), "h" : NumberLong("-4920408163957595002"), "v" : 2, "op" : "u", "ns" : "presence.presence.pings", "o2" : { "_id" : NumberLong(1405780170) }, "o" : { "$set" : { "alive.0" : NumberLong("9007199254740992") } } }
{ "ts" : Timestamp(1405780169, 2), "h" : NumberLong("-8421208321601717803"), "v" : 2, "op" : "u", "ns" : "presence.presence.pings", "o2" : { "_id" : NumberLong(1405780170) }, "o" : { "$set" : { "alive.1" : NumberLong("5368709120") } } }
{ "ts" : Timestamp(1405780170, 1), "h" : NumberLong("-1730416142908418047"), "v" : 2, "op" : "u", "ns" : "presence.presence.pings", "o2" : { "_id" : NumberLong(1405780170) }, "o" : { "$set" : { "alive.1" : NumberLong("5372903424") } } }
{ "ts" : Timestamp(1405780170, 2), "h" ...

Read more...

Thanks Kapil.

Which team is handling the replication story, and can we have a response
from them on prioritisation and fix strategy please?

Do we need a separate bug for the noop database writes?

I filed a separate bug #1345541 to ask for acceptance tests that would
catch this sort of regression in future.

Mark

Kapil Thangavelu (hazmat) wrote :

The op log size does seem to thankfully be dependent on some algorithm, on an aws env our default instances have an 8Gb root disk and the replication oplog seems to be constrained there to 1Gb. So public cloud envs won't kill environments by filling up the disk entirely, just issues with i/o starvation.

Ian Booth (wallyworld) wrote :

The initial op log size is calculated according to the Mongo recommendations:
http://docs.mongodb.org/manual/core/replica-set-oplog/

On 64bit Linux systems, it is 5% of the free disk space, but never less than 1GB and never more than 50GB.
So as Kapil says, on AWS deployments with an 8GB root disk, the op log size will be 1GB.

But for systems with a very big root disk, the size will indeed max out to 50GB. This is as per the documented Mongo behaviour.
What Juju does is pre-allocate the oplog prior to starting Mongo to reduce the start up delay and subsequent timing issues which were causing machine agent flakiness. But the pre-allocation is done according the the Mongo algorithm used for the file size.

On local provider, we do actually reduce the initial oplog size to improve startup times, but didn't feel the same decision to move away from the Mongo defaults would be pertinent for real cloud deployments.

The real issue is the fact that database writes appear to be occurring without an actual model change occurring. I've raised a separate bug for this, bug 1345832

The other 3 issues mentioned in this bug:

1. Mongo oplog size
I'd like to leave this the way it is since we're following Mongo recommendations (unless there's a technical reason not to and we get agreement from Kapil and William and whoever else that we should change).

2. Should the presence database be replicated.
I'll initiate a conversation but it does seem on the surface like it's not really required.

3. Don't configure a repliaset before we have one.
I'm not a Mongo expert but with ask the folks who implemented the repliaset stuff whether that's feasible.

These issues will be given top priority.

Changed in juju-core:
importance: Undecided → High
status: New → Triaged
milestone: none → 1.20.2
tags: added: cloud-installer landscape
Ian Booth (wallyworld) wrote :

I've emailed Mark Ramm an update to distribute to stakeholders. Here's a summary:

Bug 1345832 (unnecessary writes to database for unchanged api host addresses)
Fix committed to trunk, being backported to 1.20

1. Mongo oplog size

The allocated file size as is. It's not the file size that's the problem but the constant writes.

2. Should the presence database be replicated.

 tl;dr; Juju's implementation has evolved to the point whereby the use of a presence database (and associated replication thereof) is no longer an optimal approach. Andrew is investigating a solution whereby the presence status is maintained in memory and communicated amongst state servers via api calls.

3. Don't configure a repliaset before we have one.

We don't want to introduce complexity by going down this road. The cost to setting up a replicaset is minimal compared with the
complexity and additional moving parts needed to initially configure without a replicaset and activate later when needed. Most if not all serious deployments will want HA and HA means using a replicaset.

Mark Shuttleworth (sabdfl) wrote :

On 21/07/14 01:00, Ian Booth wrote:
> The initial op log size is calculated according to the Mongo recommendations:
> http://docs.mongodb.org/manual/core/replica-set-oplog/
>
> On 64bit Linux systems, it is 5% of the free disk space, but never less than 1GB and never more than 50GB.
> So as Kapil says, on AWS deployments with an 8GB root disk, the op log size will be 1GB.
>
> But for systems with a very big root disk, the size will indeed max out to 50GB. This is as per the documented Mongo behaviour.
> What Juju does is pre-allocate the oplog prior to starting Mongo to reduce the start up delay and subsequent timing issues which were causing machine agent flakiness. But the pre-allocation is done according the the Mongo algorithm used for the file size.

That Mongo algorithm may make sense for "generic Mongo likely to be big
data" but isn't it true that we know the likely behaviour of the
database much more accurately?

In a world without multi-tenanted multi-environment state servers the
"natural" size of the database is likely to be relatively small. If one
is bootstrapping a next-gen, multi-tenant multi-environment state
service then one could say so and we might apply a different algorithm.

Even 1GB seems big to me.

> On local provider, we do actually reduce the initial oplog size to
> improve startup times, but didn't feel the same decision to move away
> from the Mongo defaults would be pertinent for real cloud deployments.

Let's please take a more opinionated view of what a single-environment
database size should look like, anywhere!

The consequences of this decision are a 51G db on a reasonably sized
server on which I'd just like to run the state service for an OpenStack,
which is obviously unacceptable.

> The real issue is the fact that database writes appear to be occurring
> without an actual model change occurring. I've raised a separate bug for
> this, bug 1345832

Thank you! And I see Andrew is right on it :)

> The other 3 issues mentioned in this bug:
>
> 1. Mongo oplog size
> I'd like to leave this the way it is since we're following Mongo recommendations (unless there's a technical reason not to and we get agreement from Kapil and William and whoever else that we should change).

I think Mongo is pitching their database as a repo for streamed, low
value data. That's not how we are using it. So I don't think their
recommendations make sense. Let's go the other way: go for the smallest
(and hence lightest footprint) size we think is reasonable. If needed,
put in place some sort of reporting so we can see how actual databases
in the wild are behaving and come up with an appropriate sizing
algorithm for our application. Regardless, the current position is
unacceptable.

Thanks!
Mark

Mark Shuttleworth (sabdfl) wrote :

On 21/07/14 09:11, Ian Booth wrote:
> Bug 1345832 (unnecessary writes to database for unchanged api host addresses)
> Fix committed to trunk, being backported to 1.20

Thank you!

> 1. Mongo oplog size
>
> The allocated file size as is. It's not the file size that's the problem
> but the constant writes.

Consuming resources you won't plausibly use is bad form - it's easily
interpreted as thoughtlessness and arrogance. Please take a smarter view
of what's appropriate and implement that.

> 2. Should the presence database be replicated.
>
> tl;dr; Juju's implementation has evolved to the point whereby the use
> of a presence database (and associated replication thereof) is no longer
> an optimal approach. Andrew is investigating a solution whereby the
> presence status is maintained in memory and communicated amongst state
> servers via api calls.

Distributed state is harder than it looks. I'd prefer we leverage some
existing implementation unless we have a consensus on building something
new.

AIUI the goal is to replace polling of a database with a distributed
in-memory database of "things that are being watched" so that any
incoming or outgoing event can be checked against that. Is that the case?

> 3. Don't configure a repliaset before we have one.
>
> We don't want to introduce complexity by going down this road. The cost to setting up a replicaset is minimal compared with the
> complexity and additional moving parts needed to initially configure without a replicaset and activate later when needed. Most if not all serious deployments will want HA and HA means using a replicaset.

OK, as long as it is not the replicaset which has been blowing out our
resource and performance numbers, that's sensible.

Mark

John A Meinel (jameinel) wrote :
Download full text (3.4 KiB)

Some constraints on why we have an oplog, and why we have to be a bit careful as to how we size it.

1) If you ever want to go HA, you need to start mongo with --replicaSet and have an oplog. AIUI it isn't possible to change this after the database has been defined, so if we wanted to start without it, then we would need to spend time creating a new database, and copying the data over there.
2) It isn't possible to resize the oplog after the fact. Once it has been allocated, mongo doesn't provide tools to change its size.
3) Backups are much more reliable if you have the oplog, because it is possible to actually snapshot the database without stopping it, because it can monitor the oplog and make sure all applied changes are recorded in the backup.

I do agree that the goals of mongo's defaults don't quite match what we actually need out of it. I would like to understand what our actual consumption is likely to be. (How many operations would fit in a 1GB oplog?)
Mongo's defaults are intended to allow replicas to get out of date on a "busy" db by a day, and still be able to get back into sync without having to start from scratch again.
We've generally operated on the assumption that our dataset size isn't enormous (1-10GB would be quite large for a Juju DB), thus we don't really need 50GB for just the oplog.

I feel like we could *probably* get away with a 1GB default oplog size.

I don't think we want to remove replicaset entirely. It makes it far easier to go HA when we need it, and it makes the backup story *much* better. (You can do a safe backup without taking the system offline.)

As for "noop" writes, the problem (AIUI) is in the transaction model we have, it isn't easy to realize that a transaction is a no op and then just not apply it. We've generally used the model of "set the value of X to Y", and allowed that if the value is already Y, thats fine. Even if the transaction was "if the value is not Y, set it to Y", AIUI that still creates an entry in the transaction log (it is a failed transaction, but it still needed to be recorded so that it could be tested for whether it applied.)

I have the feeling that for IP addresses, we aren't particularly concerned with concurrency (last write always wins, because we want to just have whatever the current value is), and if we are polling it, even if we missed one update we would get it right later, and we are unlikely to change IP address and then quickly change back to the previous address. So likely reading the database, and then doing the check of "should this be changed" only in memory and only if we think it might change something then doing the transaction in the database.

I've seen comments in the past that we actually were doing that, but we have significant evidence to the contrary. So maybe what check we have is faulty somehow.

Anyway, if we can be more careful about not creating no-op transactions that probably also helps shrink the bounds of what oplog size we need. Since there will be fewer 'things changing in mongo' that the oplog is tracking.

Could we go below 1GB for the oplog? Maybe, but I'd like us to do a lot of careful analysis if we wanted to do so. Going with the sta...

Read more...

Curtis Hovey (sinzui) on 2014-07-21
tags: added: mongodb
Andrew Wilkins (axwalk) wrote :

I still think we should analyse this (but how? do we have a representative "busy" test scenario?), but it is actually possible to resize the oplog: http://docs.mongodb.org/manual/tutorial/change-oplog-size/

Mark Shuttleworth (sabdfl) wrote :

Can we see the size of the log that's actually being used by some of our
many Juju environments in production? They will vary in scale and
complexity and longevity but that would sure be a good place to start
looking for data?

Mark

Andrew Wilkins (axwalk) wrote :

This is not a production setup (I'd love to see stats from one too), but I've just created an environment on Azure with 3 state servers and 16 units of the "ubuntu" charm. Yes it's a donkey charm, but provisioning and just having the machine agent active causes enough churn on state that it should be somewhat useful.

Here's what I get:

juju:PRIMARY> db.getReplicationInfo()
{
 "logSizeMB" : 1368,
 "usedMB" : 4.24,
 "timeDiff" : 4231,
 "timeDiffHours" : 1.18,
 "tFirst" : "Wed Jul 23 2014 11:15:17 GMT+0800 (WST)",
 "tLast" : "Wed Jul 23 2014 12:25:48 GMT+0800 (WST)",
 "now" : "Wed Jul 23 2014 12:25:48 GMT+0800 (WST)"
}

So, a 4.24 MB oplog over 1.18 hours. Given that, and that we can resize if necessary, I'm going to drop it to 1GB max. If someone (Kapil?) could get stats for a real production environment, it'd be great to get some more numbers.

Ian Booth (wallyworld) on 2014-07-23
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
milestone: 1.20.2 → 1.21-alpha1
Ian Booth (wallyworld) wrote :

>> 2. Should the presence database be replicated.
>>
>> tl;dr; Juju's implementation has evolved to the point whereby the use
>> of a presence database (and associated replication thereof) is no longer
>> an optimal approach. Andrew is investigating a solution whereby the
>> presence status is maintained in memory and communicated amongst state
>> servers via api calls.

>Distributed state is harder than it looks. I'd prefer we leverage some
>existing implementation unless we have a consensus on building something
>new.

>AIUI the goal is to replace polling of a database with a distributed
>in-memory database of "things that are being watched" so that any
>incoming or outgoing event can be checked against that. Is that the case?

Essentially that's the case. Distributed state is indeed hard. The current solution which uses mongo to persist (and replicate) essentially transient heartbeat information offers little value for what is potentially significant cost, and in a way that doesn't scale so well for large deployments (as written). Moving away from mongo to an in memory model will perform and scale better. The model that is being replicated changes very slowly - agents are expected to be in a given life cycle state for some time and changes will be infrequent (in the usual case). The way we do it now, essentially writing incoming pings to mongo, filling the oplog etc, for the purpose of recording "yes this machine is still alive just like last time" doesn't make much sense.

The work being undertaken is not for inclusion in the next 1.20 release - it's a recognition of that the fact our implementation needs to evolve and so options are being discussed. The multiple oplog writes identified in this bug is one symptom driving the need for the investigation.

Andrew Wilkins (axwalk) wrote :

> So, a 4.24 MB oplog over 1.18 hours. Given that, and that we can resize if necessary, I'm going to drop it to 1GB max. If someone (Kapil?) could get stats for a real production environment, it'd be great to get some more numbers.

John pointed me at the "peer-xplod" charm, wherein units will communicate with each other peer (so N^2) continuously, but updating relation settings. I've deployed 16 units of that charm, and it's less than doubled the oplog size. If I extrapolate to 24 hours of runtime, then it'll consume roughly 200MB of the oplog.

I'm happy to drop the size lower (512MB?), but I'd be more comfortable doing so if we had real numbers.

Andrew Wilkins (axwalk) wrote :

I've lowered the min to 512MB and max to 1024MB for now (https://github.com/juju/juju/pull/368). Will revisit if this is found to still be too large.

Changed in juju-core:
status: Triaged → In Progress
Mark Shuttleworth (sabdfl) wrote :

Thanks, agree these are appropriate thresholds. If we wanted to get
fancy, we could give users the ability to auto-tune this to ensure there
is at least n days of data in the oplog, but these steps are fine for now.

Mark

Andrew Wilkins (axwalk) on 2014-07-23
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2014-09-08
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers