Comment 2 for bug 1677682

Revision history for this message
Shannon Mitchell (shannon-mitchell) wrote :

We may be comparing apples-oranges with systemd & pacemaker when talking HA. Here are just my thoughts on it.

HA in openstack can be utilized by the following technologies:

  - Load balancing:

        HA is achieved by having multiple services behind a load balancer, so if a node goes down it can be brought back online either manually or automatically. These are pretty much any of the api services in openstack and horizon.

  - Application Clustering:

        HA is achieved via the application. This might be a galera or rabbit cluster. swift and ceph have their own internal clustering components. the neutron agents can also be configured to spin up multiple dhcp interfaces or use keepalived with VRRP in the background to handle router issues.

  - Queue Workers:

        In openstack most things pass through a rabbitmq message queue. Several openstack services only process requests from the queue and do nothing else. HA is usually achieved by having 2 or more workers handling messages in a queue, so if one goes down the others will keep processing messages. Things like nova-conductor, nova-scheduler, nova-cert, heat-engine, cinder-scheduler, cinder-backup and cinder-volume fall under this. (note: some notes on cinder-volume in certain configs may have issues and may require external clustering)

  - Generic HA Clusters:

       These are services that provide HA clustering for applications that can't do it for themselves and may not be able to do an active-active load balanced solution. Pacemaker and keepalived may be good examples of this. These can migrate ip addresses and other resources around to different nodes on failure. If a service doesn't fall in any of the above categories, one of these can be sued to provide HA for the service. The haproxy service is a perfect example of this.

  - Recovery options for HA services mentioned above.

When using any of the HA solutions above, you normally have monitoring in place or some sort of automatic recovery. With monitoring someone is usually notified when a service fails over and ops is required to go in and fix the issue. With automated recovery, ops is usually notified but actions are taken to automatically recover the service. This is where systemd/upstart comes in. Many of the openstack services should be safe to set up to respawn on error as long as its monitored and someone knows it happening. For services handled under pacemaker, the recovery is automatic due to the way pacemaker handles migrating the services.

So I think systemd, upstart and pacemaker should be mentioned under a section going over monitoring and recovery methods. Mainly as different methods for automatic recovery.