Bug #878114 “launch multiple bootstrap nodes” : Bugs : pyjuju

William Reade (fwereade) on 2011-10-19

Changed in juju:
status:	New → In Progress
assignee:	nobody → William Reade (fwereade)

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-10-19: Re: [Bug 878114] [NEW] launch multiple bootstrap nodes

#1

Excerpts from William Reade's message of Wed Oct 19 11:43:07 UTC 2011:
> Public bug reported:
>
> As progress towards lp:803042, we should have an environment parameter
> that controls how many [er: "master", or "bootstrap", or whatever we
> call them] nodes are launched when we bootstrap.
>
> To avoid the work for this bug ballooning out of control, I propose that
> this should only be required to actually work properly in the single-
> node case; there are several things we need to handle before we have a
> fully-working solution, including:
>
> * making sure all the zookeepers know about all the other zookeepers
> * making sure provisioning agents work well together
> * figuring out what to do when we lose a master node
> * probably more
>
> These bullet points are heavily reminiscent of what juju itself is meant
> to do, but thinking too much about how we'd make juju deploy itself is
> giving me a headache. Definitely worthy of further thought/discussion...

Hi william this would be pretty nice, but we should probably talk about this. We
need to refactor bootstrap to launch zk and the provisioning agent as a juju
service, so the management of the zk is the same as any other service.

Revision history for this message

William Reade (fwereade) wrote on 2011-10-19:

#2

Download full text (5.7 KiB)

OK, here's some further thought. I'm suddenly uncertain whether any of this is a good idea or not; it's just my first approaching-serious attempt at thinking about the intersection of juju, zookeeper, and high availability. Comments appreciated.

First of all: launching a bunch of individual machines from the initial client is problematic, because we don't necessarily know instance-id, dns-name, etc. until we've actually launched the instances, but we need to know all the participating machines in order to run a ZK ensemble (and to communicate the instances to all the potential clients).

We could, I suppose, just launch all the machines and subsequently fiddle around with the ZK config on each of them, restarting ZK as necessary, but it strikes me as rather an ugly way to do things... especially since we have a whole pile of code for managing just that sort of interaction.

Now, just after I joined, Gustavo talked about making juju self-hosting, and I think it's time to consider that possibility seriously. Consider the following a tentative proposal, that I fully expect to get shredded by those wiser than me... but that may eventually evolve into a reasonable plan ;).

(preparation)

* [tweak the initialize command so it checks for /initialized and stops immediately if it finds it]
* [If a PA can't currently tell which MA it's colocated with, add some code/state that allows us to find out]

(bootstrap)

* Launch an instance in "master" mode, exactly as we currently do (preconfigured ZK; run initialize, MA, PA as usual).
* When the PA starts, it should automatically deploy a zookeeper half-charm (which I'll call a ZHC) onto the local MA.
** Note: the ZHC would not do anything to actually install zookeeper -- it'd assume it was already running -- but it would know how to deal with its relations changing (by rewriting the config file and bouncing ZK, perhaps? I understand there's stuff to handle this planned for 3.5.0 but that doesn't help us now).

As I understand it, this would give us exactly what we already have, and not break anything (nobody's going to be messing with zookeeper yet). It's the next bit that involves complexity and possible breakage.

(hardening)

* The PA, when it started, also grabbed a lock, which allows it to consider itself the provisioning agent provisioning agent (note PAPA acronym; you'd think it was planned this way from the start). Holding this lock obligates it to keep track of the number of "master" instances, and launch new ones in "master" mode until the right number are available.

(hm, that sounds suspiciously simple, what will actually happen?)

* PA launches a new instance in "master" mode, which imagines the one and only zookeeper to be the one running on the PA that launched it.
* The new instance comes up:
** initialize connects to ZK, finds everything's already sorted, exits happy;
** MA starts up, connects, isn't asked to do anything, is happy;
** PA starts up, connects, and deploys ZHK onto the local MA;
** huge rocks fall from the sky and kill everyone.

(er, wait)

* OK. What the PA needs to do is essentially either deploy or add-unit, depending on whether or not units already exist.
* Each unit ru...

OK, here's some further thought. I'm suddenly uncertain whether any of this is a good idea or not; it's just my first approaching-serious attempt at thinking about the intersection of juju, zookeeper, and high availability. Comments appreciated.

First of all: launching a bunch of individual machines from the initial client is problematic, because we don't necessarily know instance-id, dns-name, etc. until we've actually launched the instances, but we need to know all the participating machines in order to run a ZK ensemble (and to communicate the instances to all the potential clients).

We could, I suppose, just launch all the machines and subsequently fiddle around with the ZK config on each of them, restarting ZK as necessary, but it strikes me as rather an ugly way to do things... especially since we have a whole pile of code for managing just that sort of interaction.

Now, just after I joined, Gustavo talked about making juju self-hosting, and I think it's time to consider that possibility seriously. Consider the following a tentative proposal, that I fully expect to get shredded by those wiser than me... but that may eventually evolve into a reasonable plan ;).

(preparation)

* [tweak the initialize command so it checks for /initialized and stops immediately if it finds it]
* [If a PA can't currently tell which MA it's colocated with, add some code/state that allows us to find out]

(bootstrap)

* Launch an instance in "master" mode, exactly as we currently do (preconfigured ZK; run initialize, MA, PA as usual).
* When the PA starts, it should automatically deploy a zookeeper half-charm (which I'll call a ZHC) onto the local MA.
** Note: the ZHC would not do anything to actually install zookeeper -- it'd assume it was already running -- but it would know how to deal with its relations changing (by rewriting the config file and bouncing ZK, perhaps? I understand there's stuff to handle this planned for 3.5.0 but that doesn't help us now).

As I understand it, this would give us exactly what we already have, and not break anything (nobody's going to be messing with zookeeper yet). It's the next bit that involves complexity and possible breakage.

(hardening)

* The PA, when it started, also grabbed a lock, which allows it to consider itself the provisioning agent provisioning agent (note PAPA acronym; you'd think it was planned this way from the start). Holding this lock obligates it to keep track of the number of "master" instances, and launch new ones in "master" mode until the right number are available.

(hm, that sounds suspiciously simple, what will actually happen?)

* PA launches a new instance in "master" mode, which imagines the one and only zookeeper to be the one running on the PA that launched it.
* The new instance comes up:
** initialize connects to ZK, finds everything's already sorted, exits happy;
** MA starts up, connects, isn't asked to do anything, is happy;
** PA starts up, connects, and deploys ZHK onto the local MA;
** huge rocks fall from the sky and kill everyone.

(er, wait)

* OK. What the PA needs to do is essentially either deploy or add-unit, depending on whether or not units already exist.
* Each unit runs the appropriate ourobouros-relation-joined/changed[0] hook, which reconfigures and bounces the local zookeeper.
* Connections are lost all over the place. One might think that this would seriously inconvenience every component of juju... BUT, as I understand it, every component of juju needs to be able to handle random ZK disconnections at any time, and to be able to get itself back on track, so I don't believe it's an exceptionally high price to pay.
* Now, all clients will still be using their original list of zookeepers, which is now wrong, but this doesn't feel like a fundamentally unfixable state of affairs. Possibly whatever PA happens to grab the PAPA lock can keep a watch on the relation, and update another node (that everyone else watches) with the list of zookeeper addresses; assuming all agents remember the list they were started with, and watch that node for changes, they should be able to restart their connections when they realise their ZK list is out of date.

(potential problems)

* I haven't thought about how multiple provisioning agents will need to interact, beyond my vague idea that (1) we need them and (2) we don't want more than one PA taking it on itself to launch more PAs.
* I'm not quite sure how ZK will actually react to all this tomfoolery[1], but I haven't found anything to directly contradict the possibility. Others may be able to blow this whole idea out of the water with two or three well-chosen words.
* The stuff described above could reasonably be considered to be abuse of juju -- especially the HZK, which is rather a perversion of the concept of a charm. I guess we could set everything up with an independent zookeeper, start a distinct zookeeper from a charm, and do a little dance to move everything across, but I don't see that gains us much -- we still need multiple PAs for high availability, and the idea of making a PA a charm itself makes my head hurt even more.
* I just bet there are more...

Opinions?

[0] ourobouros strikes me as a good name, but whatever ;)
[1] most of all, it assumes that everything is tolerably resilient to sudden forcible loss of zookeeper, and that needing-a-majority-to-do-anything is not practically a big deal (I'm imagining people will set up the number of servers they need *before* they actually need them...). If we need to maintain "normal" service throughout, we could try to do as described in [2] instead, but still: what we *really* want is [3].
[2] http://www.thesocialdeveloper.com/2011/07/21/rolling-restart-in-apache-zookeeper-to-dynamically-add-servers-to-the-ensemble/
[3] https://issues.apache.org/jira/browse/ZOOKEEPER-107

Revision history for this message

William Reade (fwereade) wrote on 2011-10-19:

#3

Addendum: the more I think about it, the more I think the "ZHC" and the PA should be real, proper, no-fooling charms, and that any slight weirdness in bootstrapping is a small price to pay.

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-10-19: Re: [Bug 878114] Re: launch multiple bootstrap nodes

#4

Excerpts from William Reade's message of Wed Oct 19 21:30:32 UTC 2011:
> Addendum: the more I think about it, the more I think the "ZHC" and the
> PA should be real, proper, no-fooling charms, and that any slight
> weirdness in bootstrapping is a small price to pay.
>

+1 on the addendum, still digesting the main course ;-)

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-10-19:

#5

Download full text (8.7 KiB)

Excerpts from William Reade's message of Wed Oct 19 17:18:17 UTC 2011:
> OK, here's some further thought. I'm suddenly uncertain whether any of
> this is a good idea or not; it's just my first approaching-serious
> attempt at thinking about the intersection of juju, zookeeper, and high
> availability. Comments appreciated.
>
> First of all: launching a bunch of individual machines from the initial
> client is problematic, because we don't necessarily know instance-id,
> dns-name, etc. until we've actually launched the instances, but we need
> to know all the participating machines in order to run a ZK ensemble
> (and to communicate the instances to all the potential clients).
>
> We could, I suppose, just launch all the machines and subsequently
> fiddle around with the ZK config on each of them, restarting ZK as
> necessary, but it strikes me as rather an ugly way to do things...
> especially since we have a whole pile of code for managing just that
> sort of interaction.

Re multi-launch, its possible to distinguish the different machines in a
multi-machine bootstrap (on ec2) via metadata namespaceing, its probably not
portable. while for looking at allocating additional capacity from bootstrap,
the best way i found was to just modify initialize to add additional machine
states.

I'm thinking this might be better off initially as a specification pushed for
review, just so the discussion is captured in a documentation artifact.

>
> Now, just after I joined, Gustavo talked about making juju self-hosting,
> and I think it's time to consider that possibility seriously. Consider
> the following a tentative proposal, that I fully expect to get shredded
> by those wiser than me... but that may eventually evolve into a
> reasonable plan ;).
>
> (preparation)
>
> * [tweak the initialize command so it checks for /initialized and stops immediately if it finds it]

That's fine, but ideally we're not even executing it multiple times. The
perspective isn't master nodes vs. normal nodes. its just a bootstrap node, vs
additional nodes. The services on the bootstrap node are just normal services
from both a management and code perspective.

> * [If a PA can't currently tell which MA it's colocated with, add some code/state that allows us to find out]
>

the pa would get subsumed into a service consisting of itself and the machine
agent. the pa needs additional work to deal with concurrency, preferrably in the
form of fine grain locks taken out as needed on the specific machines being
acted on, else the work is parallel between machines.

> (bootstrap)
>
> * Launch an instance in "master" mode, exactly as we currently do (preconfigured ZK; run initialize, MA, PA as usual).
> * When the PA starts, it should automatically deploy a zookeeper half-charm (which I'll call a ZHC) onto the local MA.
> ** Note: the ZHC would not do anything to actually install zookeeper -- it'd assume it was already running -- but it would know how to deal with its relations changing (by rewriting the config file and bouncing ZK, perhaps? I understand there's stuff to handle this planned for 3.5.0 but that doesn't help us now).
>
> As I understand it, this would...

Excerpts from William Reade's message of Wed Oct 19 17:18:17 UTC 2011:
> OK, here's some further thought. I'm suddenly uncertain whether any of
> this is a good idea or not; it's just my first approaching-serious
> attempt at thinking about the intersection of juju, zookeeper, and high
> availability. Comments appreciated.
> 
> First of all: launching a bunch of individual machines from the initial
> client is problematic, because we don't necessarily know instance-id,
> dns-name, etc. until we've actually launched the instances, but we need
> to know all the participating machines in order to run a ZK ensemble
> (and to communicate the instances to all the potential clients).
> 
> We could, I suppose, just launch all the machines and subsequently
> fiddle around with the ZK config on each of them, restarting ZK as
> necessary, but it strikes me as rather an ugly way to do things...
> especially since we have a whole pile of code for managing just that
> sort of interaction.

Re multi-launch, its possible to distinguish the different machines in a 
multi-machine bootstrap (on ec2) via metadata namespaceing, its probably not 
portable. while for looking at allocating additional capacity from bootstrap, 
the best way i found was to just modify initialize to add additional machine 
states.

I'm thinking this might be better off initially as a specification pushed for 
review, just so the discussion is captured in a documentation artifact.

> 
> Now, just after I joined, Gustavo talked about making juju self-hosting,
> and I think it's time to consider that possibility seriously. Consider
> the following a tentative proposal, that I fully expect to get shredded
> by those wiser than me... but that may eventually evolve into a
> reasonable plan ;).
> 
> (preparation)
> 
> * [tweak the initialize command so it checks for /initialized and stops immediately if it finds it]

That's fine, but ideally we're not even executing it multiple times. The 
perspective isn't master nodes vs. normal nodes. its just a bootstrap node, vs 
additional nodes. The services on the bootstrap node are just normal services 
from both a management and code perspective.

> * [If a PA can't currently tell which MA it's colocated with, add some code/state that allows us to find out]
>

the pa would get subsumed into a service consisting of itself and the machine 
agent. the pa needs additional work to deal with concurrency, preferrably in the 
form of fine grain locks taken out as needed on the specific machines being 
acted on, else the work is parallel between machines.

> (bootstrap)
> 
> * Launch an instance in "master" mode, exactly as we currently do (preconfigured ZK; run initialize, MA, PA as usual).
> * When the PA starts, it should automatically deploy a zookeeper half-charm (which I'll call a ZHC) onto the local MA.
> ** Note: the ZHC would not do anything to actually install zookeeper -- it'd assume it was already running -- but it would know how to deal with its relations changing (by rewriting the config file and bouncing ZK, perhaps? I understand there's stuff to handle this planned for 3.5.0 but that doesn't help us now).
> 
> As I understand it, this would give us exactly what we already have, and
> not break anything (nobody's going to be messing with zookeeper yet).
> It's the next bit that involves complexity and possible breakage.

So following up on my own thoughts and your addendum, the pa and zk are just a 
service that gets initialized into the tree via 'initialize' and we start up the 
unit agent for it (prior to the machine agent). we setup the unit states to 
reflect the running status for this initial bootstrap node only, additional 
units added can go through the full charm install. So effectively this unit is 
already in place on bootstrap. This approach can be extended to deploy multiple 
machines in a zk ensemble at bootstrap time.

the pa doesn't do any charm deployments/assignments.

> 
> (hardening)
> 
> * The PA, when it started, also grabbed a lock, which allows it to
> consider itself the provisioning agent provisioning agent (note PAPA
> acronym; you'd think it was planned this way from the start). Holding
> this lock obligates it to keep track of the number of "master"
> instances, and launch new ones in "master" mode until the right number
> are available.

PAs should take only fine grained locks on things they need to do, not master 
locks/leader election. There are no master nodes, just the juju internal 
service that gets deployed in similiar fashion to other services.

separately for hardening we need to start moving to lxc everywhere, thats 
somewhat dependent on redoing network security, i'll have a proposal out for 
that in the next day or so.

> 
> (hm, that sounds suspiciously simple, what will actually happen?)
> 
> * PA launches a new instance in "master" mode, which imagines the one and only zookeeper to be the one running on the PA that launched it.
> * The new instance comes up:
> ** initialize connects to ZK, finds everything's already sorted, exits happy;
> ** MA starts up, connects, isn't asked to do anything, is happy;
> ** PA starts up, connects, and deploys ZHK onto the local MA;
> ** huge rocks fall from the sky and kill everyone.
>

it seems like your subsuming a bunch of responsibilities onto the pa by thinking 
of master nodes, vs. just services here.

the pa shouldn't be doing any deploying.

> (er, wait)
> 
> * OK. What the PA needs to do is essentially either deploy or add-unit, depending on whether or not units already exist.
> * Each unit runs the appropriate ourobouros-relation-joined/changed[0] hook, which reconfigures and bounces the local zookeeper.
> * Connections are lost all over the place. One might think that this would seriously inconvenience every component of juju... BUT, as I understand it, every component of juju needs to be able to handle random ZK disconnections at any time, and to be able to get itself back on track, so I don't believe it's an exceptionally high price to pay.
> * Now, all clients will still be using their original list of zookeepers, which is now wrong, but this doesn't feel like a fundamentally unfixable state of affairs. Possibly whatever PA happens to grab the PAPA lock can keep a watch on the relation, and update another node (that everyone else watches) with the list of zookeeper addresses; assuming all agents remember the list they were started with, and watch that node for changes, they should be able to restart their connections when they realise their ZK list is out of date.
>

regarding clients and zk, they'll rotate automatically till they find one that 
works, and will poll if they exhaust that. simply taking a server up and down 
doesn't kill a session if its within the session timeout.

the initial clients on the boostrap node are fine, they'll just wait till its 
up, the next ones are fine.

> (potential problems)
> 
> * I haven't thought about how multiple provisioning agents will need to interact, beyond my vague idea that (1) we need them and (2) we don't want more than one PA taking it on itself to launch more PAs.
> * I'm not quite sure how ZK will actually react to all this tomfoolery[1], but I haven't found anything to directly contradict the possibility. Others may be able to blow this whole idea out of the water with two or three well-chosen words.
> * The stuff described above could reasonably be considered to be abuse of juju -- especially the HZK, which is rather a perversion of the concept of a charm. I guess we could set everything up with an independent zookeeper, start a distinct zookeeper from a charm, and do a little dance to move everything across, but I don't see that gains us much -- we still need multiple PAs for high availability, and the idea of making a PA a charm itself makes my head hurt even more.
> * I just bet there are more...
> 
> Opinions?

inline above.

the other question that hasn't been addressed here is communicating the new 
zookeepers to existing clients.

the pa is also going to need to take responsibility for subscribing to the units 
of this service and updating the state file as needed.

probably good to take this into a specification for review.

cheers,

Kapil

> 
> [0] ourobouros strikes me as a good name, but whatever ;)
> [1] most of all, it assumes that everything is tolerably resilient to sudden forcible loss of zookeeper, and that needing-a-majority-to-do-anything is not practically a big deal (I'm imagining people will set up the number of servers they need *before* they actually need them...). If we need to maintain "normal" service throughout, we could try to do as described in [2] instead, but still: what we *really* want is [3].
> [2] http://www.thesocialdeveloper.com/2011/07/21/rolling-restart-in-apache-zookeeper-to-dynamically-add-servers-to-the-ensemble/
> [3] https://issues.apache.org/jira/browse/ZOOKEEPER-107
>

Kapil Thangavelu (hazmat) on 2012-01-27

Changed in juju:
status:	In Progress → Confirmed
assignee:	William Reade (fwereade) → nobody

Curtis Hovey (sinzui) on 2013-10-12

Changed in juju:
importance:	Undecided → Low
status:	Confirmed → Triaged

pyjuju

launch multiple bootstrap nodes

Bug Description

Other bug subscribers

Remote bug watches