Fuel for OpenStack

HA. mysql cluster failover issues after a connection loss on primary controller management nic

Bug #1399181 reported by Kirill Omelchenko on 2014-12-04

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Sergii Golovatiuk	Fuel for OpenStack 6.0
5.1.x	Won't Fix	High	Fuel Library (Deprecated)	Fuel for OpenStack 5.1.1-updates
6.0.x	Invalid	High	Sergii Golovatiuk	Fuel for OpenStack 6.0

Bug Description

Scenario:
1. Deploy HA env using 5.1.1 48-RC2 iso:
with nova-flat, ceph for images and volumes. 3x Controllers, 2x computes, 2x ceph-storage
2. Run Network verification tests (passes well)
3. Run OSTF (all pass fine)
4. Simulate connection loss on management interface of the primary controller node
by running next command
# brctl delif <br> <if>
where 'if' is the node interface attached to management network
and 'br' is the bridge this 'if' is attached to.
5. Wait for 30+ min
6. Run OSTF

Result we have several tests failed:
- Create volume and boot instance from it
- Check network connectivity from instance without floating IP
- Check network connectivity from instance via floating IP
- Launch instance, create snapshot, launch instance from snapshot
HA
- Mysql node detection failed Please refer to OpenStack logs for more details.
- Check amount of tables in databases is the same on each node
- Check RabbitMQ is available
though 'Check galera environment state' passes well.

- Platform tests also are failed.

And almost every test fails because of a time out or alike.

In fact on Horizon most of the actions can be done successfully, but not always and pretty much always takes quite a lot of time.
e.g. to create an instance, while trying to assign floating ip to an instance (also eventually the ip WAS assigned, but an error message is shown 'HTTP 504' and 'Unable to assign...')

Cluster status:
http://paste.openstack.org/show/144536/

See original description

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-12-04:

fuel-snapshot-2014-12-04_12-31-22.tgz Edit (18.2 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-04:

The failover procedure is not an instant. It requires some time to bring services back to the operations. That is why it is important to have a relevant OSTF checks, such as HA group, in order to run them as a mandatory prerequisite for any other checks. If HA health checks cannot pass, you should not expect any other would.

Changed in fuel:
status:	New → Invalid

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-12-04:

I forgot to mention, that the time between the connection loss and the ostf run was quite a while.

description:

updated

Nastya Urlapova (aurlapova) on 2014-12-04

Changed in fuel:
status:	Invalid → Confirmed
status:	Confirmed → New

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

The point is not an amount of time to wait, but a failover results to check *before* to continue. That is OSTF HA health checks should pass prior to proceed with the other OSTF checks. For example, rabbitmq cluster could reassemble for 5 minutes, and some times it could fail on that as well (a known issue). if did so, you would not have got passed OSTF HA checks, so you should consider the failover procedure failed and do not continue with any other checks - they will obviousely fail due to broken cluster state.

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-12-05:

But what about the MySQL status on the affected node?
Is MySQL supposed to be inaccessible even from localhost?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

Current OSTF HA checks should be improved to cover all potential failover issues with mysql, rabbitmq, neutron agents etc. There is a bug about rabbitmq improvements https://bugs.launchpad.net/fuel/+bug/1387567. Perhaps, we should file the similar one for mysql

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

@Kirill, according to the bug description it looks like failover procedure didn't succeed for mysql cluster: Mysql node detection failed Please refer to OpenStack logs for more details. Please update the name of bug to be less generic, for example - mysql cluster failover issues after a connection loss on primary controller management nic

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
milestone:	5.1.1 → 5.1.2

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

Please note, we cannot target high bugs for 5.1.1 due to HCF, moved to 5.1.2

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-12-05:

@Bogdan, thanks a lot for explanation.

summary:	- HA. Some OSTF tests don't pass after connection loss on primary + HA. mysql cluster failover issues after a connection loss on primary controller management nic
Changed in fuel:
status:	Incomplete → New