[10.0-mitaka]bvt_2: mysql cluster crashed

Bug #1600792 reported by Pavel Kholkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Maksim Malchuk
Mitaka
Confirmed
High
Maksim Malchuk

Bug Description

reproduced on bvt CI https://product-ci.infra.mirantis.net/job/10.0-mitaka.main.ubuntu.bvt_2/70/

failed ostf tested because of mysql lost connection (node-1 neutron-server.log):

2016-07-09 11:55:19.374 4447 ERROR oslo_messaging.rpc.dispatcher DBConnectionError: (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")

node-1 syslog:

<27>Jul 9 11:55:00 node-1 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:13765, this is a split-brain!
<27>Jul 9 11:55:00 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 15c64ce7-45c6-11e6-99f8-ea2cb83a3a56:5144, which was not expected
<27>Jul 9 11:55:06 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Jul 9 11:55:08 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

node-1 pacemaker.log:

Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:0 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:1 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: get_failcount_full: p_mysqld:2 has failed 1 times on node-1.test.domain.local
Jul 09 11:55:23 [9154] node-1.test.domain.local pengine: info: common_apply_stickiness: clone_p_mysqld can fail 9 more times on node-1.test.domain.local before being forced off

Dmitry Klenov (dklenov)
tags: added: area-library
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → High
status: New → Confirmed
tags: added: swarm-blocker
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

I checked a few swarm checks after this failure. They passed successfully. According to logs the issue is environment specific. I am invalidating this bug

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :
Revision history for this message
Anna Babich (ababich) wrote :

The bug has been reproduced at the step of OSTF tests running by Tempest CI job: http://cz7776.bud.mirantis.net:8080/jenkins/view/Tempest_9.%D0%A5/job/9.x_Tempest_LVM_no_ssl/89/consoleFull

The tracebacks for one of failed tests:
From ostf log - http://paste.openstack.org/show/570172/ :
ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-6accf408-72f3-457d-b3d7-cb43cc0b333e)

From nova-api log - http://paste.openstack.org/show/570171/ :
2016-09-08T22:55:12.277204+00:00 info: 2016-09-08 22:55:12.275 28769 INFO nova.api.openstack.wsgi [req-6accf408-72f3-457d-b3d7-cb43cc0b333e 8218301469224e5085ddb183a0cafbea bbb0de9a173d441f89c454a0545871d1 - - -] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_db.exception.DBConnectionError'>

From cinder log - http://paste.openstack.org/show/570177/
2016-09-08T22:55:12.288621+00:00 err: 2016-09-08 22:55:12.259 26281 ERROR cinder.api.middleware.fault [req-61dab4c5-aaf5-4e7d-b0d6-009e71fdc686 8218301469224e5085ddb183a0cafbea bbb0de9a173d441f89c454a0545871
d1 - - -] Caught error: <class 'oslo_messaging.rpc.client.RemoteError'> Remote error: Remote error: DBConnectionError (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading i
nitial communication packet', system error: 0") [SQL: u'SELECT 1']

Revision history for this message
Anna Babich (ababich) wrote :
Changed in fuel:
status: Invalid → New
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Maksim Malchuk (mmalchuk)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Dev team, Tempest CI is the large CI where the whole Tempest tests suite is usually passed, this fail is not related just to the low performance of the environment, something is broken in MySQL cluster logic and we need to understand what is wrong here.

The issue reproduced even on BVT sometimes.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.