Fuel for OpenStack

HA doesn't work: Galera cluster can't automatically recover when several controllers gone down

Bug #1434477 reported by Timur Nurlygayanov on 2015-03-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

Note: bug was found on MOS 6.1, but looks like it can be reproduced on MOS 5.x and 6.x releases as well.
It is reproduced for VirtualBox environment but looks like it can be reproduced for KVM/baremetal as well.
The diagnostic snapshot is available by the link: https://yadi.sk/d/4am7qbV_fP7NU

Steps To Reproduce:
1. Take the fresh MOS image (in my case in was 202 image: http://mc0n2-msk.msk.mirantis.net/fuelweb-iso/fuel-6.1-202-2015-03-16_22-54-44.iso)
2. Deploy OpenStack cloud with the following configuration: Ubuntu, HA, 3 controllers, 1 compute, Neutron VLAN, Swift file storage backend. (using VirtualBox scripts)
3. Shutdown 2 controllers (to make sure that it will be reproduced, let's shutdown primary and non-primary controller). Example for VirtualBox:
VBoxManage controlvm "fuel-slave-1" poweroff
VBoxManage controlvm "fuel-slave-2" poweroff
4. Wait 10 minutes
5. Try to open Horizon dashboard. It will not available, 404 code.
6. Login to the existing controller node and try to run any OpenStack CLI commands, it will fail:
source openrc ; keystone user-lists
7. Check status of Galera cluster:
mysql -e "SHOW STATUS LIKE 'wsrep%';"

Observed Result:
Cluster doesn't work at all: we can't access Horizon dashboard (with 404 code), OpenStack CLI commands doesn't work (with 500 code), Galera cluster doesn't work at all, but existing Galera node has status 'Non Primary':

-----------
root@node-2:~# mysql +--------------------state_uuid | wsrep_protocol_version | wsrep_last_committed | wsrep_replicated | wsrep_replicated_bytes | wsrep_repl_keys | wsrep_repl_keys_bytes | wsrep_repl_data_bytes | wsrep_repl_other_bytes | wsrep_received | wsrep_received_bytes | wsrep_local_commits | wsrep_local_cert_failures | wsrep_local_replays | wsrep_local_send_queue | wsrep_local_send_queue_avg | wsrep_local_recv_queue | wsrep_local_recv_queue_avg | wsrep_local_cached_downto | wsrep_flow_control_control_paused | wsrep_flow_control_sent | wsrep_flow_control_recv | wsrep_cert_deps_distance | wsrep_apply_oooe | wsrep_apply_oool | wsrep_apply_window | wsrep_commit_oooe | wsrep_commit_oool | wsrep_commit_window | wsrep_local_state | wsrep_local_state_comment | wsrep_cert_index_size | wsrep_causal_reads | wsrep_cert_interval | wsrep_incoming_addresses | wsrep_cluster_conf_id | wsrep_cluster_size | wsrep_cluster_state_uuid | wsrep_cluster_status | wsrep_connected | wsrep_local_bf_aborts | wsrep_local_index | wsrep_provider_name | wsrep_provider_vendor | wsrep_provider_version | wsrep_ready +------------- />-----------------+--------------------------------------+
| Value |
/>-----------------+--------------------------------------+
| 3c9f95e7-ce27-11e4-bbdd-3ff507860ec1 |
| 5 |
| 246131 |
| 7 |
| 1905 |
| 7 |
| 217 |
| 1240 |
| 0 |
| 245822 |
| 202955040 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0.000000 |
| 0 |
| 0.001623 |
| 91007 |
/>paused_ns | 0 |
| 0.000000 |
| 0 |
| 0 |
| 11.239005 |
| 0.000201 |
| 0.000074 |
| 1.000209 |
| 0.000000 |
| 0.000000 |
| 1.000057 |
| 0 |
| Initialized |
| 63 |
| 0 |
| 0.009545 |
| 192.168.0.4:3307 |
| 18446744073709551615 |
| 1 |
| 3c9f95e7-ce27-11e4-bbdd-3ff507860ec1 |
| non-Primary |
| ON |
| 0 |
| 0 |
| Galera |
| Codership Oy <email address hidden> |
| 3.5(rXXXX) |
| OFF |
/>-----------------+--------------------------------------+

Tags:

Timur Nurlygayanov (tnurlygayanov) on 2015-03-20

summary:

- HA doesn't work: Galera claster can't automatically recover when several
- controller go down
+ HA doesn't work: Galera cluster can't automatically recover when several
+ controllers go down

Timur Nurlygayanov (tnurlygayanov) on 2015-03-20

summary:

HA doesn't work: Galera cluster can't automatically recover when several
- controllers go down
+ controllers gone down

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-20:

According to the HA Reference architecture http://docs.mirantis.com/fuel/fuel-6.0/reference-architecture.html#openstack-environment-architecture, this test case is invalid. When you deploy 3 controllers, you must maintain a quorum, which is 2 nodes, in order to keep your cluster operate. That means that the failover procedure can succeed only for 3-1 case, but will fail for 3-2 case.

Note, we have an unresolved documentation bug about missing supported failover cases
https://bugs.launchpad.net/bugs/1415398
https://bugs.launchpad.net/fuel/+bug/1326605

Changed in fuel:
status:	Confirmed → Invalid
assignee:	nobody → Fuel Library Team (fuel-library)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.