Fuel for OpenStack

Deployment of third controller fails in HA

Bug #1348839 reported by Ryan Moe on 2014-07-25

This bug report is a duplicate of: Bug #1354479: Galera is not syncing on the slaves sometimes. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	High	Fuel Library (Deprecated)	Fuel for OpenStack 5.1

Bug Description

"build_id": "2014-07-25_20-13-50"
"ostf_sha": "8c328521b1444f22c50463b9432193e20ed33813"
"build_number": "361"
"auth_required": true
"api": "1.0"
"nailgun_sha": "83cc9ed44ebc8dd97248483b6d414ebbc4cff3c0"
"production": "docker"
"fuelmain_sha": "0111e138e328f1bbd4453f56cc178fafbd9c4e16"
"astute_sha": "aa5aed61035a8dc4035ab1619a8bb540a7430a95"
"feature_groups": ["mirantis"]
"release": "5.1"
"fuellib_sha": "448dee95b84f4f6585baf0d98f84e376854233eb"

HA Ubuntu with nova-network. 3 controllers, 1 compute, 2 ceph OSD.

The load on the controller gets very high (> 50) and Fuel thinks the node is offline. Looking on the system I see cib with high cpu usage and mysqld using a very large amount of memory.

From top:

10485 mysql -2 0 2859m 822m 4748 S 0.7 54.9 0:06.63 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --init-

On the console I see:

INFO: task mysqld:10534 blocked for more than 120 seconds.

See original description

Tags:

Ryan Moe (rmoe) on 2014-07-25

description:

updated

Nastya Urlapova (aurlapova) on 2014-07-27

Changed in fuel:
importance:	Undecided → High
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 5.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-07-28:

Ryan, we need description of your environment.

Changed in fuel:
status:	New → Incomplete

Ryan Moe (rmoe) on 2014-07-28

description:

updated

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-07-28:

This environment is still running if somebody wants access to it. The problem has been 100% reproducible.

Mike Scherbakov (mihgen) on 2014-07-30

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-07-30:

fail_deploy_ha_vlan-2014_07_30__02_35_58.tar.gz Edit (3.6 MiB, application/x-tar)

on node-3, diag snapshot from BVT (attached)
root 11284 0.3 0.1 574660 3596 ? Ssl 02:17 0:04 /usr/sbin/corosync
107 11305 41.1 0.2 97864 5048 ? S 02:17 7:26 \_ /usr/lib/pacemaker/cib

41% is a lot. It might be a root cause of MCAgent connectivity loss in BVT: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/212/.

Log details of MCAgent connectivity loss could be found here: https://bugs.launchpad.net/fuel/+bug/1350016/comments/2

Bogdan Dobrelya (bogdando) on 2014-07-30

tags:

added: corosync

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-30:

According to the logs, galera cluster issues were in between

2014-07-30T03:16:02.216507 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-07-30T03:16:02.216507+01:00 notice: Finished catalog run in 896.86 seconds

and

2014-07-30T03:32:25.211335 node-4 ./node-4.test.domain.local/puppet-apply.log:2014-07-30T03:32:25.211335+01:00 notice: Finished catalog run in 961.96 seconds

tags:

added: galera ha

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-30:

There was a rabbitmq cluster issues also
2014-07-30T03:25:46.822889 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.822889+01:00 err: ERROR: Resetting node 'rabbit
@node-4' ... Error: {corrupt_or_missing_cluster_files,{error,enoent},{error,enoent}}
2014-07-30T03:25:46.824226 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.824226+01:00 err: ERROR: Forcefully resetting n
ode 'rabbit@node-4' ... Error: {corrupt_or_missing_cluster_files,{error,enoent},{error,enoent}}
2014-07-30T03:25:46.824226 node-4 ./node-4.test.domain.local/rabbitmq-server.log:2014-07-30T03:25:46.824226+01:00 err: ERROR: Mnesia couldn't cleane
d, even by force-reset command.

rca looks like a c nnectivity issues at corosync layer of logic due to its high cpu load?
Here could be a good job for fencing :)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-31:

wsrep slave threads related as well

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-07-31:

I'm not sure this is a duplicate of #1350245. The logs provided by Mike don't contain the galera failures and the logs from 1350245 don't contain the kernel messages about the hung tasks we saw here. I will test an ISO with the wsrep changes and see if it fixes this issue as well.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-01:

guys. failing deployment of the third controller is too abstract as a bug. hung mysqld task in kernel logs says there is a really big load that blocked mysqld for execution for 120 seconds. most of the time it is related to I/O system performance. Also, let's decide whether this bug is a duplicate of 1350245 or not. If galera is being stopped by OCF script or load is spiking right at the time when galera starts full state transfer, then it is a duplicate. Otherwise it is an environmental issue which I could not reproduce.