Deployment of third controller fails in HA

Bug #1348839 reported by Ryan Moe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Fuel Library (Deprecated)

Bug Description

"build_id": "2014-07-25_20-13-50"
"ostf_sha": "8c328521b1444f22c50463b9432193e20ed33813"
"build_number": "361"
"auth_required": true
"api": "1.0"
"nailgun_sha": "83cc9ed44ebc8dd97248483b6d414ebbc4cff3c0"
"production": "docker"
"fuelmain_sha": "0111e138e328f1bbd4453f56cc178fafbd9c4e16"
"astute_sha": "aa5aed61035a8dc4035ab1619a8bb540a7430a95"
"feature_groups": ["mirantis"]
"release": "5.1"
"fuellib_sha": "448dee95b84f4f6585baf0d98f84e376854233eb"

HA Ubuntu with nova-network. 3 controllers, 1 compute, 2 ceph OSD.

The load on the controller gets very high (> 50) and Fuel thinks the node is offline. Looking on the system I see cib with high cpu usage and mysqld using a very large amount of memory.

From top:

10485 mysql -2 0 2859m 822m 4748 S 0.7 54.9 0:06.63 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --init-

On the console I see:

INFO: task mysqld:10534 blocked for more than 120 seconds.

Ryan Moe (rmoe)
description: updated
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 5.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Ryan, we need description of your environment.

Changed in fuel:
status: New → Incomplete
Ryan Moe (rmoe)
description: updated
Revision history for this message
Ryan Moe (rmoe) wrote :

This environment is still running if somebody wants access to it. The problem has been 100% reproducible.

Mike Scherbakov (mihgen)
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

on node-3, diag snapshot from BVT (attached)
root 11284 0.3 0.1 574660 3596 ? Ssl 02:17 0:04 /usr/sbin/corosync
107 11305 41.1 0.2 97864 5048 ? S 02:17 7:26 \_ /usr/lib/pacemaker/cib

41% is a lot. It might be a root cause of MCAgent connectivity loss in BVT: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/212/.

Log details of MCAgent connectivity loss could be found here: https://bugs.launchpad.net/fuel/+bug/1350016/comments/2

tags: added: corosync
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the logs, galera cluster issues were in between

2014-07-30T03:16:02.216507 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-07-30T03:16:02.216507+01:00 notice: Finished catalog run in 896.86 seconds

and

2014-07-30T03:32:25.211335 node-4 ./node-4.test.domain.local/puppet-apply.log:2014-07-30T03:32:25.211335+01:00 notice: Finished catalog run in 961.96 seconds

tags: added: galera ha
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There was a rabbitmq cluster issues also
2014-07-30T03:25:46.822889 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.822889+01:00 err: ERROR: Resetting node 'rabbit
@node-4' ... Error: {corrupt_or_missing_cluster_files,{error,enoent},{error,enoent}}
2014-07-30T03:25:46.824226 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.824226+01:00 err: ERROR: Forcefully resetting n
ode 'rabbit@node-4' ... Error: {corrupt_or_missing_cluster_files,{error,enoent},{error,enoent}}
2014-07-30T03:25:46.824226 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.824226+01:00 err: ERROR: Mnesia couldn't cleane
d, even by force-reset command.

rca looks like a c nnectivity issues at corosync layer of logic due to its high cpu load?
Here could be a good job for fencing :)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

wsrep slave threads related as well

Revision history for this message
Ryan Moe (rmoe) wrote :

I'm not sure this is a duplicate of #1350245. The logs provided by Mike don't contain the galera failures and the logs from 1350245 don't contain the kernel messages about the hung tasks we saw here. I will test an ISO with the wsrep changes and see if it fixes this issue as well.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

guys. failing deployment of the third controller is too abstract as a bug. hung mysqld task in kernel logs says there is a really big load that blocked mysqld for execution for 120 seconds. most of the time it is related to I/O system performance. Also, let's decide whether this bug is a duplicate of 1350245 or not. If galera is being stopped by OCF script or load is spiking right at the time when galera starts full state transfer, then it is a duplicate. Otherwise it is an environmental issue which I could not reproduce.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.