Fuel for OpenStack

[10.0-mitaka]bvt_2: mysql cluster crashed

Bug #1600792 reported by Pavel Kholkin on 2016-07-11

This bug report is a duplicate of: Bug #1620268: MySQL split-brain issue after successful deploy. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	High	Maksim Malchuk	Fuel for OpenStack 10.0
	Mitaka	Confirmed	High	Maksim Malchuk	Fuel for OpenStack 9.1

Bug Description

reproduced on bvt CI https://product-ci.infra.mirantis.net/job/10.0-mitaka.main.ubuntu.bvt_2/70/

failed ostf tested because of mysql lost connection (node-1 neutron-server.log):

2016-07-09 11:55:19.374 4447 ERROR oslo_messaging.rpc.dispatcher DBConnectionError: (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")

node-1 syslog:

<27>Jul 9 11:55:00 node-1 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:13765, this is a split-brain!
<27>Jul 9 11:55:00 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 15c64ce7-45c6-11e6-99f8-ea2cb83a3a56:5144, which was not expected
<27>Jul 9 11:55:06 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Jul 9 11:55:08 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

node-1 pacemaker.log:

Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:0 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:1 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:2 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off

Tags:

Dmitry Klenov (dklenov) on 2016-07-13

tags:	added: area-library
Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
importance:	Undecided → High
status:	New → Confirmed
tags:	added: swarm-blocker

Oleksiy Molchanov (omolchanov) on 2016-07-14

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)

Oleksiy Molchanov (omolchanov) on 2016-07-14

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Sergii Golovatiuk (sgolovatiuk)

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2016-07-27:

I checked a few swarm checks after this failure. They passed successfully. According to logs the issue is environment specific. I am invalidating this bug

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

The last runs of this jobs are green https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/

Revision history for this message

Anna Babich (ababich) wrote on 2016-09-09:

The bug has been reproduced at the step of OSTF tests running by Tempest CI job: http://cz7776.bud.mirantis.net:8080/jenkins/view/Tempest_9.%D0%A5/job/9.x_Tempest_LVM_no_ssl/89/consoleFull

The tracebacks for one of failed tests:
From ostf log - http://paste.openstack.org/show/570172/ :
ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-6accf408-72f3-457d-b3d7-cb43cc0b333e)

From nova-api log - http://paste.openstack.org/show/570171/ :
2016-09-08T22:55:12.277204+00:00 info: 2016-09-08 22:55:12.275 28769 INFO nova.api.openstack.wsgi [req-6accf408-72f3-457d-b3d7-cb43cc0b333e 8218301469224e5085ddb183a0cafbea bbb0de9a173d441f89c454a0545871d1 - - -] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'>

From cinder log - http://paste.openstack.org/show/570177/
2016-09-08T22:55:12.288621+00:00 err: 2016-09-08 22:55:12.259 26281 ERROR cinder.api.middleware.fault [req-61dab4c5-aaf5-4e7d-b0d6-009e71fdc686 8218301469224e5085ddb183a0cafbea bbb0de9a173d441f89c454a0545871
d1 - - -] Caught error: <class 'oslo_messaging.rpc.client.RemoteError'> Remote error: Remote error: DBConnectionError (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading i
nitial communication packet', system error: 0") [SQL: u'SELECT 1']

Revision history for this message

Anna Babich (ababich) wrote on 2016-09-09:

Diagnostic snapshot: https://drive.google.com/file/d/0B-mMDT56Q1EGdkEyYU11RDNPWVU/view?usp=sharing

Changed in fuel:
status:	Invalid → New

Sergii Turivnyi (sturivnyi) on 2016-09-09

Changed in fuel:
status:	New → Confirmed

Sergii Golovatiuk (sgolovatiuk) on 2016-09-09

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Maksim Malchuk (mmalchuk)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

Dev team, Tempest CI is the large CI where the whole Tempest tests suite is usually passed, this fail is not related just to the low performance of the environment, something is broken in MySQL cluster logic and we need to understand what is wrong here.

The issue reproduced even on BVT sometimes.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-12:

this issue is duplicate for https://bugs.launchpad.net/fuel/+bug/1620268