HA. mysql cluster failover issues after a connection loss on primary controller management nic

Bug #1399181 reported by Kirill Omelchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Sergii Golovatiuk
5.1.x
Won't Fix
High
Fuel Library (Deprecated)
6.0.x
Invalid
High
Sergii Golovatiuk

Bug Description

Scenario:
1. Deploy HA env using 5.1.1 48-RC2 iso:
with nova-flat, ceph for images and volumes. 3x Controllers, 2x computes, 2x ceph-storage
2. Run Network verification tests (passes well)
3. Run OSTF (all pass fine)
4. Simulate connection loss on management interface of the primary controller node
by running next command
# brctl delif <br> <if>
where 'if' is the node interface attached to management network
and 'br' is the bridge this 'if' is attached to.
5. Wait for 30+ min
6. Run OSTF

Result we have several tests failed:
 - Create volume and boot instance from it
 - Check network connectivity from instance without floating IP
 - Check network connectivity from instance via floating IP
 - Launch instance, create snapshot, launch instance from snapshot
HA
 - Mysql node detection failed Please refer to OpenStack logs for more details.
 - Check amount of tables in databases is the same on each node
 - Check RabbitMQ is available
though 'Check galera environment state' passes well.

 - Platform tests also are failed.

And almost every test fails because of a time out or alike.

In fact on Horizon most of the actions can be done successfully, but not always and pretty much always takes quite a lot of time.
e.g. to create an instance, while trying to assign floating ip to an instance (also eventually the ip WAS assigned, but an error message is shown 'HTTP 504' and 'Unable to assign...')

Cluster status:
http://paste.openstack.org/show/144536/

Revision history for this message
Kirill Omelchenko (komelchenko) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The failover procedure is not an instant. It requires some time to bring services back to the operations. That is why it is important to have a relevant OSTF checks, such as HA group, in order to run them as a mandatory prerequisite for any other checks. If HA health checks cannot pass, you should not expect any other would.

Changed in fuel:
status: New → Invalid
Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

I forgot to mention, that the time between the connection loss and the ostf run was quite a while.

description: updated
Changed in fuel:
status: Invalid → Confirmed
status: Confirmed → New
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The point is not an amount of time to wait, but a failover results to check *before* to continue. That is OSTF HA health checks should pass prior to proceed with the other OSTF checks. For example, rabbitmq cluster could reassemble for 5 minutes, and some times it could fail on that as well (a known issue). if did so, you would not have got passed OSTF HA checks, so you should consider the failover procedure failed and do not continue with any other checks - they will obviousely fail due to broken cluster state.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

But what about the MySQL status on the affected node?
Is MySQL supposed to be inaccessible even from localhost?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Current OSTF HA checks should be improved to cover all potential failover issues with mysql, rabbitmq, neutron agents etc. There is a bug about rabbitmq improvements https://bugs.launchpad.net/fuel/+bug/1387567. Perhaps, we should file the similar one for mysql

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Kirill, according to the bug description it looks like failover procedure didn't succeed for mysql cluster: Mysql node detection failed Please refer to OpenStack logs for more details. Please update the name of bug to be less generic, for example - mysql cluster failover issues after a connection loss on primary controller management nic

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
milestone: 5.1.1 → 5.1.2
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please note, we cannot target high bugs for 5.1.1 due to HCF, moved to 5.1.2

Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

@Bogdan, thanks a lot for explanation.

summary: - HA. Some OSTF tests don't pass after connection loss on primary
+ HA. mysql cluster failover issues after a connection loss on primary
controller management nic
Changed in fuel:
status: Incomplete → New
Changed in fuel:
status: New → Confirmed
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Except nova message which is correct as services are down on stopped controller. I was not able to reproduce this bug.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

This seems to be a leftover after 5.1.1 acceptance testing. Closing as Won't Fix as there are no similar issue reported by customers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.