Create minimal architecture example for HA

Bug #1645339 reported by Alexandra Settle on 2016-11-28
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openstack-manuals
Low
Adam Spiers

Bug Description

Provide a minimal architecture example for HA, expanded on that given in the *Environment* section of http://docs.openstack.org/project-install-guide/newton (depending on the distribution) for easy comparison.

Would be worth discussing with the Arch guide team.

Link: https://docs.openstack.org/draft/ha-guide-draft/ref-arch-examples.html

This was a TODO removed out of the HA guide for more visibility.

tags: added: ha-guide
Ben Silverman (tersian) wrote :

What kind of minimal architecture? Would this be a traditional monolithic 3 control node/2 compute node minimum with pacemaker/keepalived or would this be a newer "shared nothing" architecture? I recommend sticking with the monolithic design that most distros are still using.

Ben - I've added Adam and Andrew onto this conversation. Hopefully they will be able to help you out! :)

tags: added: arch-guide
Adam Spiers (adam.spiers) wrote :

I'll be happy to help out with this - can we discuss in Atlanta? I should have plenty of time for working on the docs whilst I'm there, but limited time before then.

Perfect, thank you Adam! :)

Changed in openstack-manuals:
assignee: nobody → Adam Spiers (adam.spiers)
Darren Chan (dazzachan) on 2017-02-06
Changed in openstack-manuals:
milestone: none → pike
Andrew Beekhof (1nbr3w) wrote :

I don't think "monolithic" accurately reflects either the SUSE or Red Hat architectures.

Ben Silverman (tersian) wrote :

I'd like to discuss at the PTG but in regards to Monolithic vs. Shared nothing architecture for the control plane, I'd separate it this way.

1. Does any part of your control plane architecture run on a dedicated host with other control plane services? Do they use common clustering/replication technologies IE: pacemaker/corosync/galera and have a static load balancing configuration? Is your infrastructure part of your HA configuration? Is infrastructure availability a factor in how many nodes you use for your control plane? Do you need traditional HA tools to manage the control plane?

If so, you might be monolithic

2. Can all of your services, both stateful and stateless be placed and scaled independently across underlying infrastructure without regard for other services running on that architecture. Do your control plane services have a configuration that looks very much like a 12 factor application? Do you use container technologies to isolate your control plane functions into services and manage them with a separate application (kubernetes, magnum, etc?)

If so, you may be a "shared nothing" architecture.

Of course, these are broad strokes, but I wanted to be more clear on my own internal definitions I'm using here.

My request above was to pick a commonly used HA reference architecture for the OpenStack control plane and document it as a starting point for consumers. Over the years, many distributions have had opinionated versions of HA, but they all have common threads. I think, for those looking for guidance in the HA guide, having a simple and straightforward HA sample that could be configured would be a great starting point for education and a jump off point for future documentation and discussion.

Andrew Beekhof (1nbr3w) wrote :

I think you're making false distinctions.

"Can all of your services, both stateful and stateless be placed and scaled independently across underlying infrastructure without regard for other services running on that architecture." is not in any way excluded by the use of "common clustering/replication technologies" or "traditional HA tools".

Ben Silverman (tersian) wrote :

Andrew,

I don't want to get into a semantics debate on a bug thread, monolithic and "shared nothing" are my terms I've used to try and get my point across, please feel free to use whatever terms you prefer.

All I'm outlining is the difference between the legacy, or what I call the legacy monolithic control plane architectures of distributions like Mirantis, Red Hat, Ubuntu, Cisco, Helion vs. what is being seen as the new "shared nothing" architectures. Since I'm not aware of any of the "shared nothing" architectures being part of a major distribution yet, I have to lump together some common traits that have previously defined both monolithic and what I am seeing in the new shared nothing architecture.

My list wasn't meant to be exclusive, more of a list of potential characteristics that, in partial or totality _may_ equal one of the types of architecture.

For many years, the whole HA control plane existed on a minimum of 3 bare metal servers that would all, or many of the control plane services that were not running on the compute hosts. Sometimes they would even double as network or storage nodes. The underlying clustering and replication tools required quorum, thus, the recommendation of 3 as the starting point. This is where monolithic began IMHO. Other distributions began to place services in virtualized environments on the control plane servers, but they remained on the same bare metal but within their own namespaces.

Once containerized, and refactored, into discrete services, the control plane would take a different shape, hopefully horizontally scalable on demand and managed by a central tool.

All I want is a basic HA architecture defined in the manuals followed by a discussion of what needs HA, why, and how operators have achieved it in a few examples (but I'll settle for a single example :)) From there, different components can be added like metadata replication, glance replication, etc.

Andrew Beekhof (1nbr3w) wrote :

I support a move to both containers and kubernetes, but your terms are more than just semantically wrong.

Traditional cluster technologies can be containerised, can manage containerised services, can run on more or less than 3 nodes, and don't require a homogenous distribution of services[0]; and kubernetes based approaches rely on shared storage *cough* database *cough* neutron *cough*.

In any case, we should document the "traditional reliable HA" architecture that works today, rather than the "optimistic HA" (terms don't matter right?) ones that are still evolving[1].

[0] Whether they should, or are the best tool for the job is a different question.
[1] https://github.com/kubernetes/kubernetes/pull/34160

Ben Silverman (tersian) wrote :

Andrew,

I said "best practices" from the major distros were to run 3 nodes, I've seen it run with 2 and 16 nodes :) Kubernetes is a tough nut to crack, OpenStack ON Kubernetes is even tougher. As I said, I'm not aware of a distro that's actually supporting it, but there are a few that are working in that direction.

I agree we should document what is working today, what we would recommend and what is possible for users to set up today.

Let's continue this in a couple weeks at the PTG.

Ben Silverman (tersian) wrote :

sorry, not "best practices" but "recommendation."

Adam Spiers (adam.spiers) wrote :

> [recommendation] from the major distros were to run 3 nodes

Actually that's not quite accurate; at least SUSE currently *recommends* running 3 separate clusters with a minimum of 2 nodes per cluster:

https://www.suse.com/documentation/suse-openstack-cloud-6/book_cloud_deploy/data/sec_depl_req_ha.html#sec_depl_reg_ha_control_spof

(although we support anything from a single 2-node cluster up to potentially 10 or more clusters of anything up to 31 nodes each)

Nevertheless, current standard practices probably usually involve clusters with 3-5 nodes. I have no problem with documenting a reasonable setup which is typically used today, and I am happy to help out with this in Atlanta.

I expect recommendations will start to change in the next 6-12 months if they haven't already Andrew has had some very interesting ideas (perhaps more than just ideas by now) which involve controlling control plane nodes via pacemaker_remote. This allows greater flexibility and scalability. Again, I think I know enough of the details to be able to help document this if we get that far in Atlanta. Look forward to meeting you there!

Adam Spiers (adam.spiers) wrote :

Of course there's nothing wrong with 3 nodes as a starting point in many contexts. It's highly contextual though; as always with HA, the devil is in the details, so I'm not sure how much value there is in trying to make generalizations.

Nevertheless I agree with the goal of picking a commonly used, minimal reference architecture for HA of the OpenStack control plane, and documenting it as a starting point for consumers. If some distros don't easily support starting with 2 nodes, or have other reasons for needing to avoid that, I'm fine with accommodating them and picking 3 nodes as the size of the minimal reference architecture to document.

I would think 3 nodes would be a good starting point for a default config, but shouldn't be set in stone. 2 would be enough for general HA, but some of the cluster software recommends at least 3. Your environment will evolve as it grows. This might be adding additional controllers as you add more api servers for load balancing. We have had up to 7 or more nova api servers and 10 or more glance servers handling requests in larger environments. As long as monitoring is in place, you should see this coming along and be able to plan accordingly. If one of your services are starting to compete with resources on the existing controller cluster nodes, you can move the containers off to their own controller node cluster as opposed to adding a new node to the existing cluster.

You might also want to think about cells as well as last I heard it was going to be default. This might be a separate 'cells' controller cluster that will house rabbit, galera, nova-cells, nova-scheduler and nova-conductor containers for each cell added to the environment.

To add to that, most admins have not had a lot of experience with pacemaker or redhat clustering. This might cause reservation on initial POC deployments for people as it can get complicated quickly. Its also more difficult to maintain as I have seen it used in production on mysql master/slave configs in an openstack environment. Both admins and developers didn't know enough about it to properly manage it, which lead to many broken clusters that just ended up staying that way to keep from impacting SLA. You may have a few techs with experience, but by the time someone brings them in things are already broken.

It might be nice to have a simple haproxy setup(w/keepalived) along with any software specific clustering in place(galara, rabbit....) to lower the barrier to entry. This may be more preferable to a lot of customers as they still treat servers like pets and fencing off a compute node is undesirable.

It would also be nice to have the pacemaker options available in the documentation as well for those more experienced users that want the extra automation available in those solutions. Give them an option and pro's cons of each.

Darren Chan (dazzachan) on 2017-02-13
tags: removed: arch-guide
Darren Chan (dazzachan) on 2017-04-07
tags: added: ha-guide-draft
removed: ha-guide
description: updated
Changed in openstack-manuals:
importance: Wishlist → Low
Frank Kloeker (f-kloeker) wrote :

Adam, any progress on this bug? Would love to close it here or renew on the story board.

Changed in openstack-manuals:
status: Confirmed → In Progress
Ben Silverman (tersian) wrote :

Seeing how there's no arch-guide team anymore, the TODO should probably be re-written :) I'm all that's left of that team.

Adam Spiers (adam.spiers) wrote :

I think this should be moved to StoryBoard, as per the [Transfer old ha-guide bugs from Launchpad to StoryBoard](https://storyboard.openstack.org/#!/story/2005594) story I just submitted.

Adam Spiers (adam.spiers) wrote :

Doh, no Markdown on Launchpad. Yet another reason to move to Storyboard!

Frank Kloeker (f-kloeker) wrote :

Yes, new bugs for HA Guide are already route to Storyboard. Due the retire of the Docs Project Team I've started to cleanup all open bugs here to solve or migrate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers