juju-core

Bug #1344940
Comment #16

Comment 16 for bug 1344940

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-07-23:

#16

>> 2. Should the presence database be replicated.
>>
>> tl;dr; Juju's implementation has evolved to the point whereby the use
>> of a presence database (and associated replication thereof) is no longer
>> an optimal approach. Andrew is investigating a solution whereby the
>> presence status is maintained in memory and communicated amongst state
>> servers via api calls.

>Distributed state is harder than it looks. I'd prefer we leverage some
>existing implementation unless we have a consensus on building something
>new.

>AIUI the goal is to replace polling of a database with a distributed
>in-memory database of "things that are being watched" so that any
>incoming or outgoing event can be checked against that. Is that the case?

Essentially that's the case. Distributed state is indeed hard. The current solution which uses mongo to persist (and replicate) essentially transient heartbeat information offers little value for what is potentially significant cost, and in a way that doesn't scale so well for large deployments (as written). Moving away from mongo to an in memory model will perform and scale better. The model that is being replicated changes very slowly - agents are expected to be in a given life cycle state for some time and changes will be infrequent (in the usual case). The way we do it now, essentially writing incoming pings to mongo, filling the oplog etc, for the purpose of recording "yes this machine is still alive just like last time" doesn't make much sense.

The work being undertaken is not for inclusion in the next 1.20 release - it's a recognition of that the fact our implementation needs to evolve and so options are being discussed. The multiple oplog writes identified in this bug is one symptom driving the need for the investigation.