Bug #1620268 “MySQL split-brain issue after successful deploy” : Bugs : Fuel for OpenStack

Victor Ryzhenkin (vryzhenkin) on 2016-09-05

summary:	- Murano can't delete package due MyAQL error + Murano can't delete package due MySQL error
summary:	- Murano can't delete package due MySQL error + [Murano] Murano can't delete package due 503 error from glare
description:	updated

Revision history for this message

Victor Ryzhenkin (vryzhenkin) wrote on 2016-09-05: Re: [Murano] Murano can't delete package due 503 error from glare

#1

The syslog with MySQL errors Edit (1.5 MiB, text/plain)

summary:	- [Murano] Murano can't delete package due 503 error from glare + [Murano] MySQL unexpectedly crashed on primary controller
Changed in fuel:
importance:	High → Critical

Timur Nurlygayanov (tnurlygayanov) on 2016-09-05

Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Victor Ryzhenkin (vryzhenkin) wrote on 2016-09-06: Re: [Murano] MySQL unexpectedly crashed on primary controller

#2

In this log you can find the next message:
<27>Sep 5 00:28:41 node-5 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep 5 00:29:22 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 5 00:29:24 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-09-06:

#3

Lowering priority to high because it was one time failure and next run is finished successfuly.

Changed in fuel:
importance:	Critical → High

Revision history for this message

Victor Ryzhenkin (vryzhenkin) wrote on 2016-09-06:

#4

It may be that the bug https://bugs.launchpad.net/fuel/+bug/1614947 is about the same problem.

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2016-09-07:

#5

Is it really a crash? I mean it could have been killed by OOM killer as it happened before already.

Revision history for this message

Peter Razumovsky (prazumovsky) wrote on 2016-09-07:

#6

Download full text (7.8 KiB)

The similar issue happened in [1] and [2], but with FloatingIP and cinder Volume, respectively.

*Logs for [2] (Similar for [1])*:

cinder-volume.log (node-6):
---------------------------

2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume [req-d2f92157-adb4-4704-8205-1812a12cb884 - - - - -] Volume service rbd:volumes@RBD-backend failed to start.
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume Traceback (most recent call last):
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/cmd/volume.py", line 81, in main
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume binary='cinder-volume')
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 268, in create
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume service_name=service_name)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 139, in __init__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume service_ref = objects.Service.get_by_args(ctxt, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 181, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume result = fn(cls, context, *args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/objects/service.py", line 81, in get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume db_service = db.service_get_by_args(context, host, binary_key)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/api.py", line 127, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return IMPL.service_get_by_args(context, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 175, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return f(*args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 450, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume filter_by(binary=binary).\
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2588, in all
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return list(self)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2736, in __iter__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return self._execute_and_instances(context)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2751, in _execute_and_instances
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume result = conn.execute(querycontext.statement, self._params)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2...

The similar issue happened in [1] and [2], but with FloatingIP and cinder Volume, respectively.

*Logs for [2] (Similar for [1])*:

cinder-volume.log (node-6):
---------------------------

2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume [req-d2f92157-adb4-4704-8205-1812a12cb884 - - - - -] Volume service rbd:volumes@RBD-backend failed to start.
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume Traceback (most recent call last):
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/cmd/volume.py", line 81, in main
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     binary='cinder-volume')
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 268, in create
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     service_name=service_name)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 139, in __init__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     service_ref = objects.Service.get_by_args(ctxt, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 181, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     result = fn(cls, context, *args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/objects/service.py", line 81, in get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     db_service = db.service_get_by_args(context, host, binary_key)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/db/api.py", line 127, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return IMPL.service_get_by_args(context, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 175, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return f(*args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 450, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     filter_by(binary=binary).\
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2588, in all
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return list(self)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2736, in __iter__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return self._execute_and_instances(context)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2751, in _execute_and_instances
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     result = conn.execute(querycontext.statement, self._params)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return meth(self, multiparams, params)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     return connection._execute_clauseelement(self, multiparams, params)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     compiled_sql, distilled_params
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     context)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     util.raise_from_cause(newraise, exc_info)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     reraise(type(exception), exception, tb=exc_tb, cause=cause)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     context)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     cursor.execute(statement, parameters)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 219, in execute
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     self.errorhandler(self, exc, value)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume   File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 38, in defaulterrorhandler
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume     raise errorvalue
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'cinder.services' doesn't exist") [SQL: u'SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.availability_zone AS services_availability_zone, services.disabled_reason AS services_disabled_reason, services.modified_at AS services_modified_at, services.rpc_current_version AS services_rpc_current_version, services.object_current_version AS services_object_current_version, services.replication_status AS services_replication_status, services.active_backend_id AS services_active_backend_id, services.frozen AS services_frozen \nFROM services \nWHERE services.deleted = false AND services.host = %s AND services.`binary` = %s'] [parameters: ('rbd:volumes@RBD-backend', 'cinder-volume')]

syslog (node-6):
----------------

<27>Sep  7 01:04:27 node-5 ocf-ns_conntrackd: ERROR: Device "conntrd" does not exist.
<27>Sep  7 01:04:29 node-5 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep  7 01:04:35 node-5 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Sep  7 01:04:35 node-5 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".
<27>Sep  7 01:04:35 node-5 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".
<27>Sep  7 01:05:00 node-5 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep  7 01:05:46 node-5 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep  7 01:06:41 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep  7 01:06:43 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

[1] https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/641

[2] https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/228/

Maksim Malchuk (mmalchuk) on 2016-09-07

Changed in fuel:
status:	New → Confirmed
tags:	added: area-library

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-08:

#7

the mysql didn't crashed as described in the issue, it got split-brain:

./ocf-mysql-wss.log:994:2016-09-05T01:07:20.281618+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:11395, this is a split-brain!

and pacemaker gracefully stoppped mysqld on the node-1:

2016-09-05T01:07:20.511963+00:00 err: 2016-09-05 01:07:20 11395 [Note] /usr/sbin/mysqld: Normal shutdown

several seconds later it started again:

2016-09-05T01:07:37.569858+00:00 err: 2016-09-05 01:07:37 0 [Note] /usr/sbin/mysqld (mysqld 5.6.30-0~u14.04+mos1) starting as process 13109 ...

several seconds later Galera cluster ready:

2016-09-05T01:07:41.697481+00:00 err: 2016-09-05 01:07:41 13262 [Note] WSREP: New cluster view: global state: 9f259a0e-72ff-11e6-aeb9-fab6b3c5476e:4265, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
2016-09-05T01:07:41.697481+00:00 err: 2016-09-05 01:07:41 13262 [Note] WSREP: SST complete, seqno: 4265

so, mysqld on the node-1 was unreachable 21 seconds from 01:07:20 to 01:07:41 - right at this window the murano-client did his request and got an error:

2016-09-05T01:07:33.274404+00:00 info: HTTPInternalServerError: {"explanation": "The server has either erred or is incapable of performing the requested operation.", "code": 500, "error": {"message": "
(_mysql_exceptions.OperationalError) (2013, \"Lost connection to MySQL server at 'reading initial communication packet', system error: 0\") [SQL: u'SELECT 1']"

the murano-client in the ostf tests should do retries.

Maksim Malchuk (mmalchuk) on 2016-09-08

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#8

The same issue but not related to Murano client:

https://bugs.launchpad.net/fuel/+bug/1604731

Alexander Kurenyshev (akurenyshev) on 2016-09-09

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → MOS QA Team (mos-qa)

Timur Nurlygayanov (tnurlygayanov) on 2016-09-09

Changed in fuel:
assignee:	MOS QA Team (mos-qa) → Victor Ryzhenkin (vryzhenkin)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#9

The MySQL data base in the split brain, Dev team, please do not ignore the issue. It can be a critical issue for the customers.

Changed in fuel:
assignee:	Victor Ryzhenkin (vryzhenkin) → Fuel Sustaining (fuel-sustaining-team)
assignee:	Fuel Sustaining (fuel-sustaining-team) → MOS Puppet Team (mos-puppet)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#10

Hi MOS Puppet team, could you please take a look? Looks like the MySQL database is in split brain and we need to figure out how to fix it in pacemaker / puppet?

Timur Nurlygayanov (tnurlygayanov) on 2016-09-09

Changed in fuel:
assignee:	MOS Puppet Team (mos-puppet) → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Denis Egorenko (degorenko) wrote on 2016-09-09:

#11

The possible solution, to prevent split-brain, is installing of Galera Arbitrator Daemon. Fuel Sustaining, can you comment this solution?

Victor Ryzhenkin (vryzhenkin) on 2016-09-09

summary:

- [Murano] MySQL unexpectedly crashed on primary controller
+ MySQL unexpectedly crashed on primary controller

Revision history for this message

Victor Ryzhenkin (vryzhenkin) wrote on 2016-09-09: Re: MySQL unexpectedly crashed on primary controller

#12

Folks, adding retries to the tests - is not a solution it this case. If we will add retries to the tests - we will not saw problems like this. This is not a test issue. It may be found on a customers's envs, and then the problems will become bigger.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-09:

#13

Victor, please do not mix the issues in the tests and in the mysql-client logic.

Our mysql-cluster (Galera) well managed by the pacemaker, and it correctly solves issues with split-brain or single mysql-daemon crashes, but if your client tries to directly connect to the mysqld-server (instead of haproxy backend) and got an error it should retry, also, if your code works through haproxy backend and issue with one of the mysql-servers happens during the sql command execution you got an error and should retry while the haproxy switched to the working node and pacemaker solves an issue with mysql server.

So, we should fix tests, and add retries or checks for galera sync (depend on the case).

Maksim Malchuk (mmalchuk) on 2016-09-09

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#14

Maksim, tests not work with MySQL database directly, we use just OpenStack API calls and OpenStack components work with MySQL database.

Revision history for this message

Victor Ryzhenkin (vryzhenkin) wrote on 2016-09-09:

#15

Max, there is no problem in the tests.
Let me explain the logic.

1. The service is running (murano-api/engine for example) and it have the backend (mysql for now).
2. The test is running and make calls to murano-api via muranoclient.
3. In each call, murano-api goes to mysql and get required information (mysql host listed in the conf, and this is a haproxy host)
4. So, in this bug, we got the traceback from service due service can't get information from db, not the client. And this is not connected with the tests.

Timur Nurlygayanov (tnurlygayanov) on 2016-09-09

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → MOS Murano (mos-murano)
assignee:	MOS Murano (mos-murano) → Maksim Malchuk (mmalchuk)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#16

Ok let's wait when the issue will be reproduced again.

Changed in fuel:
assignee:	Maksim Malchuk (mmalchuk) → Timur Nurlygayanov (tnurlygayanov)
status:	Confirmed → Incomplete

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09:

#17

Note:
we discussed the issue with dev team and they said that it can be an issue of the virtual environment / random fail. This is why we are waiting for the repro.

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-09-09:

#18

Reproduced on swarm:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/54/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-09-09:

#19

<27>Sep 9 00:31:55 node-2 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:23443, this is a split-brain!
<27>Sep 9 00:31:55 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 1f27d411-7622-11e6-a663-5bacd23b1468:1393, which was not expected
<27>Sep 9 00:32:02 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 9 00:32:04 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running
<129>Sep 9 00:32:16 node-2 haproxy[3167]: Server mysqld/node-2 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 22ms. 0 active and 2 backup servers left. Running on backup. 15 sessions active, 0 requeued, 0 remaining in queue.

Maksim Malchuk (mmalchuk) on 2016-09-10

Changed in fuel:
status:	Incomplete → Confirmed
assignee:	Timur Nurlygayanov (tnurlygayanov) → Maksim Malchuk (mmalchuk)

Revision history for this message

Nikolay Starodubtsev (starodubcevna) wrote on 2016-09-12:

#20

Probably we have the same issue at https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/656/ and https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/652/

Here is an output from syslog on controller:

<27>Sep 12 06:20:26 node-1 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep 12 06:21:01 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 12 06:21:03 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running```

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-12:

#21

Nikolay, both are not related:

- fail on #656 because of resources outage:

2016-09-12T07:10:03.180032+00:00 err: [ 3560.158545] Out of memory: Kill process 27788 (mysqld) score 44 or sacrifice child
2016-09-12T07:10:03.180032+00:00 err: [ 3560.158679] Killed process 27788 (mysqld) total-vm:2548480kB, anon-rss:112300kB, file-rss:0kB

- #652 also have lack of resources:

2016-09-11T07:00:05.491042+00:00 warning: [ 3595.682903] glance-cache-pr invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0

on both tests the controllers were killed by oom.

Maksim Malchuk (mmalchuk) on 2016-09-12

summary:

- MySQL unexpectedly crashed on primary controller
+ MySQL split-brain issue after successful deploy

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-12:

#22

summary for the all split-brain failures not caused by the oom: http://paste.openstack.org/show/572400/

Changed in fuel:
status:	Confirmed → Triaged

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-12:

#23

the bug for the oom issues: https://bugs.launchpad.net/fuel/+bug/1622673

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-12:

#24

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/367996

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-13:

#25

please remove 5.6_5.6.30 version packages from the proposed and snapsots

Changed in fuel:
assignee:	Maksim Malchuk (mmalchuk) → Roman Vyalov (r0mikiam)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-13: Change abandoned on fuel-library (stable/mitaka)

#26

Change abandoned by Maksim Malchuk (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/367996

Revision history for this message

Roman Vyalov (r0mikiam) wrote on 2016-09-13:

#27

please revert the mysql packages from the repository

Changed in fuel:
assignee:	Roman Vyalov (r0mikiam) → MOS Linux (mos-linux)
status:	In Progress → New

Revision history for this message

Roman Vyalov (r0mikiam) wrote on 2016-09-13:

#28

related bug https://bugs.launchpad.net/mos/+bug/1578370

Maksim Malchuk (mmalchuk) on 2016-09-13

Changed in fuel:
assignee:	MOS Linux (mos-linux) → Ivan Suzdal (isuzdal)

Revision history for this message

Roman Vyalov (r0mikiam) wrote on 2016-09-13:

#29

the mysql packages were deleted http://mirror.seed-cz1.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-latest/pool/main/m/mysql-wsrep-5.6/

Changed in fuel:
assignee:	Ivan Suzdal (isuzdal) → Dmitry Burmistrov (dburmistrov)
status:	New → In Progress
status:	In Progress → Fix Committed

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-16:

#30

reopened due to the new failures: https://bugs.launchpad.net/fuel/+bug/1624368

Changed in fuel:
status:	Fix Committed → New
assignee:	Dmitry Burmistrov (dburmistrov) → Ivan Suzdal (isuzdal)

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-16:

#31

The fix AFAIK in the progress: https://review.fuel-infra.org/#/c/26461/

Revision history for this message

Andrey Maximov (maximov) wrote on 2016-09-16:

#32

Ivan, this is critical issue, and moved to confirmed state because this is reoccurring issue.

Changed in fuel:
importance:	High → Critical
status:	New → Confirmed

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-19:

#33

At the last friday (16.09.2016) we decide to upgrade mysql to the latest version (5.6.33).
bvt2 results is here [0]. Looks promising, but need to be tested more.

[0] https://custom-ci.infra.mirantis.net/view/9.x/job/9.x.custom.ubuntu.bvt_2/72/

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-20:

#34

Reproduced once again on CI: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.bonding_ha/66/

Revision history for this message

Tatyana Kuterina (tkuterina) wrote on 2016-09-21:

#35

Reproduced on CI: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/70/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-21:

#36

Well, let's take a look into logs.
Split-brain was detected once at 01:05:49 (node-1 syslog).
After that pacemaker tried to rebuild cluster and at 01:06:58 cluster was started successfully.
In haproxy.log we can see what mysql got 'UP state' at 01:06:36.
So, it's absolutely normal (and expected) behavior for clustered resources.

Timur Nurlygayanov (tnurlygayanov) on 2016-09-21

Changed in fuel:
status:	Fix Committed → Fix Released
status:	Fix Released → Fix Committed

Revision history for this message

Andrey Lavrentyev (alavrentyev) wrote on 2016-09-21:

#37

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/67/testReport/%28root%29/ceph_partitions_repetitive_cold_restart/

One more similar failure on Swarm

Nastya Urlapova (aurlapova) on 2016-09-21

Changed in fuel:
status:	Fix Committed → Confirmed

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-21:

#38

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/67/testReport/%28root%29/ceph_partitions_repetitive_cold_restart/

The same as https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/70/testReport/(root)/ceph_ha_restart/ceph_ha_restart/.

At 22:50:05 was detected split-brain (see node-1 syslog). Split-brain was detected only once. At 22:50:47 mysql was recovered and started by pacemaker (see node-1 pacemaker.log).

BTW: possible it related to a lack of memory. Take a look into atop logs.

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-09-21:

#39

Another occurrence in Swarm
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.services_reconfiguration_thread_1/67/testReport/(root)/reconfiguration_scalability/reconfiguration_scalability/

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-21:

#40

Colleagues, before you attach another fail logs - please, read the haproxy/pacemaker/mysqld logs.
At 22:56:54 was detected split-brain on node-3, at 22:56:59 mysql was recovered and started by pacemaker.

I've a question, how actually do you decide what some particular fail is related to this bug?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21: Related fix proposed to fuel-library (master)

#41

Related fix proposed to branch: master
Review: https://review.openstack.org/374265

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21: Related fix proposed to fuel-library (stable/mitaka)

#42

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/374266

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-21:

#43

Let me to explain what actually is going.
Let's take a look into mysqld.log. As we can see, at 22:50:06 mysql got SIGTERM and went to "normal shutdown".
Now, take a look into pacemaker.log. At 20:50:05 crmd has failed p_mysqld_monitor ("Detected action (47.49) p_mysqld_monitor_60000.59=unknown error: failed").
After what, pacemaker send TERM signal to mysqld.
After spending a little bit of time mysqld was recovered and started again.
If something will try to access to mysqld at this time (while mysql is recovering) - any requests will be fail.
Again, for a clustered services this behavior is absolutely normal.
So, my point is: we shouldn't work with clustered services as with a standalone service.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Related fix merged to fuel-library (stable/mitaka)

#44

Reviewed: https://review.openstack.org/374266
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=0c09d3d24a8df841f7763cdec763bc9caf42838b
Submitter: Jenkins
Branch: stable/mitaka

commit 0c09d3d24a8df841f7763cdec763bc9caf42838b
Author: Denis Egorenko <email address hidden>
Date: Wed Sep 21 18:56:19 2016 +0300

Add retries for murano-dbmanage task

    Murano-dbmanage fails without any chance to retry via
    additional puppet runs if it was completed successfully on the
    first run. Adding retries reduces the likelihood that deployment fails.

Change-Id: I567c0c944b308db344326ad0555e98e21f422236
Related-bug: #1620268

tags:

added: in-stable-mitaka

Bogdan Dobrelya (bogdando) on 2016-09-23

Changed in fuel:
status:	Confirmed → Fix Committed

Mykola Stolyarenko (mstolyarenko) on 2016-09-26

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Mykola Stolyarenko (mstolyarenko) wrote on 2016-09-26:

#45

Not reproducible. fuel rpm: fuel-release-9.0.0-1.mos6357.noarch
murano plugin: 0.11.0.dev13-1~u14.04+mos2

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-09-27:

#46

Reproduced on yesterday swarm:
9.1 snapshot #311
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/75/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
495
cat /etc/fuel_build_number:
495
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6357.noarch
fuel-misc-9.0.0-1.mos8605.noarch
fuel-9.0.0-1.mos6357.noarch
fuel-openstack-metadata-9.0.0-1.mos8861.noarch
fuel-nailgun-9.0.0-1.mos8861.noarch
fuel-agent-9.0.0-1.mos291.noarch
fuel-mirror-9.0.0-1.mos151.noarch
nailgun-mcagents-9.0.0-1.mos774.noarch
fuel-ui-9.0.0-1.mos2814.noarch
shotgun-9.0.0-1.mos90.noarch
network-checker-9.0.0-1.mos77.x86_64
fuel-utils-9.0.0-1.mos8605.noarch
fuel-migrate-9.0.0-1.mos8605.noarch
python-fuelclient-9.0.0-1.mos356.noarch
fuel-provisioning-scripts-9.0.0-1.mos8861.noarch
fuel-notify-9.0.0-1.mos8605.noarch
rubygem-astute-9.0.0-1.mos774.noarch
fuelmenu-9.0.0-1.mos275.noarch
python-packetary-9.0.0-1.mos151.noarch
fuel-bootstrap-cli-9.0.0-1.mos291.noarch
fuel-setup-9.0.0-1.mos6357.noarch
fuel-library9.0-9.0.0-1.mos8605.noarch
fuel-ostf-9.0.0-1.mos946.noarch
[root@nailgun ~]#

Revision history for this message

Andrey Lavrentyev (alavrentyev) wrote on 2016-09-28:

#47

Looks like there is a similar issue that happened during auto acceptance:
https://product-ci.infra.mirantis.net/job/9.x.acceptance.ubuntu.failover_group_mongo/11/testReport/(root)/deploy_mongo_cluster/

[root@nailgun ~]# grep -ir 'split-brain' /var/log
/var/log/remote/node-1.test.domain.local/ocf-mysql-wss.log:2016-09-28T07:11:16.756341+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:18969, this is a split-brain!

Also:

2016-09-28T07:11:17.461567+00:00 err: (/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]) Failed to call refresh: neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head returned 1 instead of one of [0]

2016-09-28T07:11:17.465163+00:00 err: (/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]) neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head returned 1 instead of one of [0]

Env Description:
9.1 snapshot #315

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-28: Related fix proposed to fuel-library (stable/mitaka)

#48

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/378492

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-28: Related fix proposed to fuel-library (master)

#49

Related fix proposed to branch: master
Review: https://review.openstack.org/378937

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-29: Related fix merged to fuel-library (master)

#50

Reviewed: https://review.openstack.org/374265
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=354b1b1c342c1843596e55fe0ea30df5f1eb2c30
Submitter: Jenkins
Branch: master

commit 354b1b1c342c1843596e55fe0ea30df5f1eb2c30
Author: Denis Egorenko <email address hidden>
Date: Wed Sep 21 18:56:19 2016 +0300

Add retries for murano-dbmanage task

    Murano-dbmanage fails without any chance to retry via
    additional puppet runs if it was completed successfully on the
    first run. Adding retries reduces the likelihood that deployment fails.

Change-Id: I567c0c944b308db344326ad0555e98e21f422236
Related-bug: #1620268

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-29:

#51

Reviewed: https://review.openstack.org/378937
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b415c05b039302a42dfec58357438585c06073b4
Submitter: Jenkins
Branch: master

commit b415c05b039302a42dfec58357438585c06073b4
Author: Maksim Malchuk <email address hidden>
Date: Wed Sep 28 12:01:27 2016 +0300

Add retries for neutron-db-sync task

    Exec 'neutron-db-sync' fails without any chance to retry via
    additional puppet runs if it was completed successfully on the first
    run. Adding retries reduces the likelihood that deployment fails.

    Change-Id: I27522de30fc29ef7516e3c9baf36516723ced4a5
    Related-bug: #1620268
    Signed-off-by: Maksim Malchuk <email address hidden>

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-29:

#52

fixes for the db-sync related fail were proposed to all puppet modules in master branch: https://bugs.launchpad.net/puppet-aodh/+bug/1628580

Changed in fuel:
status:	Fix Released → Fix Committed
importance:	Critical → High
milestone:	9.1 → 10.0

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-29:

#53

added the Newton milestone to the bug due to the several related fixes in master branch

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-30:

#54

It's not only about retrying db syncs as it also affects services / tests after deployment, like in https://bugs.launchpad.net/mos/+bug/1628942 .

I suggest we re-open this and try to understand the root cause of the split brain. If it's indeed scarce resources, then we should set correct expectations for QA engineers, so that they fix their scripts and do not file a new pile of duplicates.

Revision history for this message

Alisa Tselovalnikova (atselovalnikova) wrote on 2016-10-03:

#55

This bug affects two swarm tests (jumbo_frames_neutron_vlan, jumbo_frames_neutron_vxlan):
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.jumbo_frames/80/
http://paste.openstack.org/show/584028/
snapshot #339

Revision history for this message

Dmitry Teselkin (teselkin-d) wrote on 2016-10-03:

#56

It is with great sadness that I inform you that two swarm tests failed because of performance issues:
* one of controller nodes was heavy loaded and mysql wasn't respond in time to pacemaker
* pacemaker sent SIGTEM to mysql and started a new instance
* because of high load 'old' mysql instance wasn't stopped immediately, and the 'new' one detected split-brain
* once an 'old' mysql instance was stopped mysql cluster went back to normal
* tests failed because of perfomance issues / no retries if single request to mysql failed

MySQL 'split-brain' occured during time intervals:

fail_error_jumbo_frames_neutron_vlan-fuel-snapshot-2016-10-03_06-32-49
----------------------------------------------------------------------

Time: 06:32:47 - 06:33:05
  * 06:32:47 - mysql failed to respond, SIGTERM sent
  * 06:32:47 - new instance started, 'split-brain' detected
  * 06:32:51 - 'old' mysql stopped
  * 06:33:05 - mysql recovered

Failed because MySQL instance stop responding because of high load.
This is NOT a split-brain issue, root cause is performance issue.

---
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions [req-909b13bc-abd3-46ca-81ce-838c707248b0 5fdd105ee97e4883b4187e307e98dc14 bfeb52a6529e4166a1f182740cb311b3 - - -] Unexpected exception in API method
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions Traceback (most recent call last):
...
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions DBConnectionError: (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") [SQL: u'SELECT 1']
---

fail_error_jumbo_frames_neutron_vxlan-fuel-snapshot-2016-10-03_07-40-03
-----------------------------------------------------------------------

Time: 07:16:33 - 07:16:53:
  * 07:16:33 - mysql failed to respond, SIGTERM sent
  * 07:16:33 - new instance started, 'split-brain' detected
  * 07:16:38 - 'old' mysql stopped
  * 07:16:51 - mysql recovered

Issue with cinder occured because of deadlock at the same moment mysql wasn't able to respond properly because of high load.
This is NOT a split-brain issue, root cause is performance issue.

---
2016-10-03 07:16:34.136 24163 CRITICAL cinder [-] DBDeadlock: (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'\nALTER TABLE volume_type_projects CHANGE COLUMN deleted deleted INTEGER']
---