openstack-manuals

Missing HA Guide Content: Systemd (vs Cluster Managers)

Bug #1677682 reported by ianeta hutchinson on 2017-03-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	openstack-manuals	Won't Fix	Low	Unassigned	openstack-manuals pike

Bug Description

An explanation of Systemd is missing from the HA Guide. At the moment, an explanation of cluster managers is present, but with no alternative. A comparison to communicate why a user would utilize Systemd or a cluster manager would also be beneficial.

https://github.com/openstack/openstack-manuals/blob/master/doc/ha-guide-draft/source/intro-os-ha-cluster.rst

See original description

Tags:

ianeta hutchinson (iphutch) on 2017-03-30

tags:

added: ha-guide

ianeta hutchinson (iphutch) on 2017-03-30

tags:

added: ha-guide-draft
removed: ha-guide

ianeta hutchinson (iphutch) on 2017-03-30

description:	updated
summary:	- Missing HA Guide Content: Cluster Managers + Missing HA Guide Content: Systemd (vs Cluster Managers)

ianeta hutchinson (iphutch) on 2017-03-30

description:

updated

Lana (loquacity) on 2017-03-31

Changed in openstack-manuals:
status:	New → Triaged
importance:	Undecided → Low
milestone:	none → pike

Revision history for this message

Adam Spiers (adam.spiers) wrote on 2017-04-06:

Briefly (for now):

http://blog.clusterlabs.org/blog/2016/next-openstack-ha-arch contends that managing of OpenStack services via Pacemaker can be replaced by systemd under certain circumstances. Whilst the article makes several good points, I feel that it misses others. I have already talked to Andrew at length about this, and I think we are mostly on the same page by now.

My take is that systemd *may* be adequate for active/active services where:

1) systemd is configured to auto-restart the services on failure
2) no cross-node ordering is required
3) the services can keep functioning correctly (e.g. graceful failure) even if their dependencies go down
4) the services can recover correctly if their dependencies come back up
5) nothing more than pid-level monitoring is required
6) an external alerting / notification system is present

Whilst 1) should be easily satisfied, and 2) is true in some cases, especially if 3) and 4) hold, some caveats are as follows:

- 3) and 4) are all well and good in theory, but in practice I'm dubious that all OpenStack services have reached this level of robustness yet.

- Regarding 5), only doing pid-level monitoring misses some key failure cases such as a service hanging rather than crashing, or falling victim to a bug which renders it non-functional even though the process is still running. This is why I continue to believe that the openstack-resource-agents project still brings value, although as the (bad) maintainer I am of course horribly biased towards it.

- 6) can of course be satisfied, but requires a lot of additional work to ensure that such a system is deployed and configured in parallel to the services a way which matches the deployment of the services. It's definitely worth doing, but the effort shouldn't be underestimated. Pacemaker serves as a poor man's alternative to a real monitoring system, which is not a good long-term solution, but is useful in the short term.

Revision history for this message

Shannon Mitchell (shannon-mitchell) wrote on 2017-04-06:

We may be comparing apples-oranges with systemd & pacemaker when talking HA. Here are just my thoughts on it.

HA in openstack can be utilized by the following technologies:

- Load balancing:

HA is achieved by having multiple services behind a load balancer, so if a node goes down it can be brought back online either manually or automatically. These are pretty much any of the api services in openstack and horizon.

- Application Clustering:

HA is achieved via the application. This might be a galera or rabbit cluster. swift and ceph have their own internal clustering components. the neutron agents can also be configured to spin up multiple dhcp interfaces or use keepalived with VRRP in the background to handle router issues.

- Queue Workers:

In openstack most things pass through a rabbitmq message queue. Several openstack services only process requests from the queue and do nothing else. HA is usually achieved by having 2 or more workers handling messages in a queue, so if one goes down the others will keep processing messages. Things like nova-conductor, nova-scheduler, nova-cert, heat-engine, cinder-scheduler, cinder-backup and cinder-volume fall under this. (note: some notes on cinder-volume in certain configs may have issues and may require external clustering)

- Generic HA Clusters:

These are services that provide HA clustering for applications that can't do it for themselves and may not be able to do an active-active load balanced solution. Pacemaker and keepalived may be good examples of this. These can migrate ip addresses and other resources around to different nodes on failure. If a service doesn't fall in any of the above categories, one of these can be sued to provide HA for the service. The haproxy service is a perfect example of this.

- Recovery options for HA services mentioned above.

When using any of the HA solutions above, you normally have monitoring in place or some sort of automatic recovery. With monitoring someone is usually notified when a service fails over and ops is required to go in and fix the issue. With automated recovery, ops is usually notified but actions are taken to automatically recover the service. This is where systemd/upstart comes in. Many of the openstack services should be safe to set up to respawn on error as long as its monitored and someone knows it happening. For services handled under pacemaker, the recovery is automatic due to the way pacemaker handles migrating the services.

So I think systemd, upstart and pacemaker should be mentioned under a section going over monitoring and recovery methods. Mainly as different methods for automatic recovery.

We may be comparing apples-oranges with systemd & pacemaker when talking HA. Here are just my thoughts on it.

HA in openstack can be utilized by the following technologies:

- Load balancing:

HA is achieved by having multiple services behind a load balancer, so if a node goes down it can be brought back online either manually or automatically.  These are pretty much any of the api services in openstack and horizon.

- Application Clustering:

HA is achieved via the application. This might be a galera or rabbit cluster. swift and ceph have their own internal clustering components.  the neutron agents can also be configured to spin up multiple dhcp interfaces or use keepalived with VRRP in the background to handle router issues.

- Queue Workers:

In openstack most things pass through a rabbitmq message queue.  Several openstack services only process requests from the queue and do nothing else.  HA is usually achieved by having 2 or more workers handling messages in a queue, so if one goes down the others will keep processing messages.  Things like nova-conductor, nova-scheduler, nova-cert, heat-engine, cinder-scheduler, cinder-backup and cinder-volume fall under this.  (note: some notes on cinder-volume in certain configs may have issues and may require external clustering)

- Generic HA Clusters:

These are services that provide HA clustering for applications that can't do it for themselves and may not be able to do an active-active load balanced solution.  Pacemaker and keepalived may be good examples of this. These can migrate ip addresses and other resources around to different nodes on failure.  If a service doesn't fall in any of the above categories, one of these can be sued to provide HA for the service.  The haproxy service is a perfect example of this.

- Recovery options for HA services mentioned above.

When using any of the HA solutions above, you normally have monitoring in place or some sort of automatic recovery.  With monitoring someone is usually notified when a service fails over and ops is required to go in and fix the issue.  With automated recovery, ops is usually notified but actions are taken to automatically recover the service.  This is where systemd/upstart comes in.  Many of the openstack services should be safe to set up to respawn on error as long as its monitored and someone knows it happening.  For services handled under pacemaker, the recovery is automatic due to the way pacemaker handles migrating the services.

So I think systemd, upstart and pacemaker should be mentioned under a section going over monitoring and recovery methods.  Mainly as different methods for automatic recovery.

Revision history for this message

Frank Kloeker (f-kloeker) wrote on 2019-05-07:

tracked here: https://storyboard.openstack.org/#!/story/2005590

Changed in openstack-manuals:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.