MySQL split-brain issue after successful deploy

Bug #1620268 reported by Victor Ryzhenkin
108
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Ivan Suzdal
Mitaka
Fix Released
High
MOS Linux

Bug Description

Detailed bug description:
 Murano can't remove package due MySQL connection errors
Steps to reproduce:
 1. Deploy MOS 9.1 with box murano
 2. Run Platform OSTF
Expected results:
 Murano platform OSTF pass
Actual result:
 Test "Check application deployment in Murano environment with GLARE" failed
Reproducibility:
 It may be a race condition or connection problems.
Workaround:
 Nope
Impact:
 Test Impact
Description of the environment:
 Operation system: Ubuntu
 Versions of components: 9.x
 Reference architecture: HA
 Network model: Neutron VXLAN
 Related projects installed: Murano, Mos, Fuel
Additional information:
  OSTF log says for client:
glanceclient.common.http: DEBUG: Request returned failure status 503.
fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 177, in verify
    result = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/fuel_health/muranomanager.py", line 459, in delete_package
    self.murano_art_client.packages.delete(package_id)
  File "/usr/lib/python2.7/site-packages/muranoclient/v1/artifact_packages.py", line 29, in inner
    raise exc.from_code(e.code)
HTTPServiceUnavailable: HTTPServiceUnavailable (HTTP 503)

There is no tracebacks in glance-glare logs.

Additional murano packages cleanup raised this:
part of the log: [SQL: u'SELECT 1']\n", "type": "DBConnectionError"}, "title": "Internal Server Error"} (HTTP 500)

Reproduced only once and SWARM 9.x.
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.services_ha/49/console

Also, there is tracebacks from cinder also connected with mysql:
2016-09-05T00:46:58.039663+00:00 crit: 2016-09-05 00:46:58.001 7513 CRITICAL cinder [req-3c321312-7828-41ff-b136-3a6995c11a94 - - - - -] OperationalError: (_mysql_exceptions.Oper
ationalError) (1054, "Unknown column 'services.rpc_current_version' in 'field list'") [SQL: u'SELECT services.created_at AS services_created_at, services.updated_at AS services_up
dated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.host AS services_host, services.`binary` AS servic
es_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.availability_zone AS services_availab
ility_zone, services.disabled_reason AS services_disabled_reason, services.modified_at AS services_modified_at, services.rpc_current_version AS services_rpc_current_version, servi
ces.object_current_version AS services_object_current_version, services.replication_status AS services_replication_status, services.active_backend_id AS services_active_backend_id
, services.frozen AS services_frozen \nFROM services \nWHERE services.deleted = false AND services.`binary` = %s'] [parameters: ('cinder-scheduler',)]

summary: - Murano can't delete package due MyAQL error
+ Murano can't delete package due MySQL error
summary: - Murano can't delete package due MySQL error
+ [Murano] Murano can't delete package due 503 error from glare
description: updated
Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote : Re: [Murano] Murano can't delete package due 503 error from glare
summary: - [Murano] Murano can't delete package due 503 error from glare
+ [Murano] MySQL unexpectedly crashed on primary controller
Changed in fuel:
importance: High → Critical
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote : Re: [Murano] MySQL unexpectedly crashed on primary controller

In this log you can find the next message:
<27>Sep 5 00:28:41 node-5 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep 5 00:29:22 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 5 00:29:24 node-5 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Lowering priority to high because it was one time failure and next run is finished successfuly.

Changed in fuel:
importance: Critical → High
Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote :

It may be that the bug https://bugs.launchpad.net/fuel/+bug/1614947 is about the same problem.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Is it really a crash? I mean it could have been killed by OOM killer as it happened before already.

Revision history for this message
Peter Razumovsky (prazumovsky) wrote :
Download full text (7.8 KiB)

The similar issue happened in [1] and [2], but with FloatingIP and cinder Volume, respectively.

*Logs for [2] (Similar for [1])*:

cinder-volume.log (node-6):
---------------------------

2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume [req-d2f92157-adb4-4704-8205-1812a12cb884 - - - - -] Volume service rbd:volumes@RBD-backend failed to start.
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume Traceback (most recent call last):
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/cmd/volume.py", line 81, in main
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume binary='cinder-volume')
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 268, in create
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume service_name=service_name)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 139, in __init__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume service_ref = objects.Service.get_by_args(ctxt, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 181, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume result = fn(cls, context, *args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/objects/service.py", line 81, in get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume db_service = db.service_get_by_args(context, host, binary_key)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/api.py", line 127, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return IMPL.service_get_by_args(context, host, binary)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 175, in wrapper
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return f(*args, **kwargs)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/cinder/db/sqlalchemy/api.py", line 450, in service_get_by_args
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume filter_by(binary=binary).\
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2588, in all
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return list(self)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2736, in __iter__
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume return self._execute_and_instances(context)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2751, in _execute_and_instances
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume result = conn.execute(querycontext.statement, self._params)
2016-09-07 01:22:37.952 27393 ERROR cinder.cmd.volume File "/usr/lib/python2...

Read more...

Changed in fuel:
status: New → Confirmed
tags: added: area-library
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

the mysql didn't crashed as described in the issue, it got split-brain:

./ocf-mysql-wss.log:994:2016-09-05T01:07:20.281618+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:11395, this is a split-brain!

and pacemaker gracefully stoppped mysqld on the node-1:

2016-09-05T01:07:20.511963+00:00 err: 2016-09-05 01:07:20 11395 [Note] /usr/sbin/mysqld: Normal shutdown

several seconds later it started again:

2016-09-05T01:07:37.569858+00:00 err: 2016-09-05 01:07:37 0 [Note] /usr/sbin/mysqld (mysqld 5.6.30-0~u14.04+mos1) starting as process 13109 ...

several seconds later Galera cluster ready:

2016-09-05T01:07:41.697481+00:00 err: 2016-09-05 01:07:41 13262 [Note] WSREP: New cluster view: global state: 9f259a0e-72ff-11e6-aeb9-fab6b3c5476e:4265, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
2016-09-05T01:07:41.697481+00:00 err: 2016-09-05 01:07:41 13262 [Note] WSREP: SST complete, seqno: 4265

so, mysqld on the node-1 was unreachable 21 seconds from 01:07:20 to 01:07:41 - right at this window the murano-client did his request and got an error:

2016-09-05T01:07:33.274404+00:00 info: HTTPInternalServerError: {"explanation": "The server has either erred or is incapable of performing the requested operation.", "code": 500, "error": {"message": "
(_mysql_exceptions.OperationalError) (2013, \"Lost connection to MySQL server at 'reading initial communication packet', system error: 0\") [SQL: u'SELECT 1']"

the murano-client in the ostf tests should do retries.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The same issue but not related to Murano client:

https://bugs.launchpad.net/fuel/+bug/1604731

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → MOS QA Team (mos-qa)
Changed in fuel:
assignee: MOS QA Team (mos-qa) → Victor Ryzhenkin (vryzhenkin)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The MySQL data base in the split brain, Dev team, please do not ignore the issue. It can be a critical issue for the customers.

Changed in fuel:
assignee: Victor Ryzhenkin (vryzhenkin) → Fuel Sustaining (fuel-sustaining-team)
assignee: Fuel Sustaining (fuel-sustaining-team) → MOS Puppet Team (mos-puppet)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi MOS Puppet team, could you please take a look? Looks like the MySQL database is in split brain and we need to figure out how to fix it in pacemaker / puppet?

Changed in fuel:
assignee: MOS Puppet Team (mos-puppet) → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Denis Egorenko (degorenko) wrote :

The possible solution, to prevent split-brain, is installing of Galera Arbitrator Daemon. Fuel Sustaining, can you comment this solution?

summary: - [Murano] MySQL unexpectedly crashed on primary controller
+ MySQL unexpectedly crashed on primary controller
Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote : Re: MySQL unexpectedly crashed on primary controller

Folks, adding retries to the tests - is not a solution it this case. If we will add retries to the tests - we will not saw problems like this. This is not a test issue. It may be found on a customers's envs, and then the problems will become bigger.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Victor, please do not mix the issues in the tests and in the mysql-client logic.

Our mysql-cluster (Galera) well managed by the pacemaker, and it correctly solves issues with split-brain or single mysql-daemon crashes, but if your client tries to directly connect to the mysqld-server (instead of haproxy backend) and got an error it should retry, also, if your code works through haproxy backend and issue with one of the mysql-servers happens during the sql command execution you got an error and should retry while the haproxy switched to the working node and pacemaker solves an issue with mysql server.

So, we should fix tests, and add retries or checks for galera sync (depend on the case).

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Maksim, tests not work with MySQL database directly, we use just OpenStack API calls and OpenStack components work with MySQL database.

Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote :

Max, there is no problem in the tests.
Let me explain the logic.

1. The service is running (murano-api/engine for example) and it have the backend (mysql for now).
2. The test is running and make calls to murano-api via muranoclient.
3. In each call, murano-api goes to mysql and get required information (mysql host listed in the conf, and this is a haproxy host)
4. So, in this bug, we got the traceback from service due service can't get information from db, not the client. And this is not connected with the tests.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → MOS Murano (mos-murano)
assignee: MOS Murano (mos-murano) → Maksim Malchuk (mmalchuk)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Ok let's wait when the issue will be reproduced again.

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Timur Nurlygayanov (tnurlygayanov)
status: Confirmed → Incomplete
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Note:
we discussed the issue with dev team and they said that it can be an issue of the virtual environment / random fail. This is why we are waiting for the repro.

Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :

<27>Sep 9 00:31:55 node-2 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:23443, this is a split-brain!
<27>Sep 9 00:31:55 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 1f27d411-7622-11e6-a663-5bacd23b1468:1393, which was not expected
<27>Sep 9 00:32:02 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 9 00:32:04 node-2 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running
<129>Sep 9 00:32:16 node-2 haproxy[3167]: Server mysqld/node-2 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 22ms. 0 active and 2 backup servers left. Running on backup. 15 sessions active, 0 requeued, 0 remaining in queue.

Changed in fuel:
status: Incomplete → Confirmed
assignee: Timur Nurlygayanov (tnurlygayanov) → Maksim Malchuk (mmalchuk)
Revision history for this message
Nikolay Starodubtsev (starodubcevna) wrote :

Probably we have the same issue at https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/656/ and https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/652/

Here is an output from syslog on controller:

<27>Sep 12 06:20:26 node-1 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Sep 12 06:21:01 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Sep 12 06:21:03 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running```

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nikolay, both are not related:

- fail on #656 because of resources outage:

2016-09-12T07:10:03.180032+00:00 err: [ 3560.158545] Out of memory: Kill process 27788 (mysqld) score 44 or sacrifice child
2016-09-12T07:10:03.180032+00:00 err: [ 3560.158679] Killed process 27788 (mysqld) total-vm:2548480kB, anon-rss:112300kB, file-rss:0kB

- #652 also have lack of resources:

2016-09-11T07:00:05.491042+00:00 warning: [ 3595.682903] glance-cache-pr invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0

on both tests the controllers were killed by oom.

summary: - MySQL unexpectedly crashed on primary controller
+ MySQL split-brain issue after successful deploy
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

summary for the all split-brain failures not caused by the oom: http://paste.openstack.org/show/572400/

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/367996

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

please remove 5.6_5.6.30 version packages from the proposed and snapsots

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Roman Vyalov (r0mikiam)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/mitaka)

Change abandoned by Maksim Malchuk (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/367996

Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please revert the mysql packages from the repository

Changed in fuel:
assignee: Roman Vyalov (r0mikiam) → MOS Linux (mos-linux)
status: In Progress → New
Revision history for this message
Roman Vyalov (r0mikiam) wrote :
Changed in fuel:
assignee: MOS Linux (mos-linux) → Ivan Suzdal (isuzdal)
Revision history for this message
Roman Vyalov (r0mikiam) wrote :
Changed in fuel:
assignee: Ivan Suzdal (isuzdal) → Dmitry Burmistrov (dburmistrov)
status: New → In Progress
status: In Progress → Fix Committed
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

reopened due to the new failures: https://bugs.launchpad.net/fuel/+bug/1624368

Changed in fuel:
status: Fix Committed → New
assignee: Dmitry Burmistrov (dburmistrov) → Ivan Suzdal (isuzdal)
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

The fix AFAIK in the progress: https://review.fuel-infra.org/#/c/26461/

Revision history for this message
Andrey Maximov (maximov) wrote :

Ivan, this is critical issue, and moved to confirmed state because this is reoccurring issue.

Changed in fuel:
importance: High → Critical
status: New → Confirmed
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

At the last friday (16.09.2016) we decide to upgrade mysql to the latest version (5.6.33).
bvt2 results is here [0]. Looks promising, but need to be tested more.

[0] https://custom-ci.infra.mirantis.net/view/9.x/job/9.x.custom.ubuntu.bvt_2/72/

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Tatyana Kuterina (tkuterina) wrote :
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Well, let's take a look into logs.
Split-brain was detected once at 01:05:49 (node-1 syslog).
After that pacemaker tried to rebuild cluster and at 01:06:58 cluster was started successfully.
In haproxy.log we can see what mysql got 'UP state' at 01:06:36.
So, it's absolutely normal (and expected) behavior for clustered resources.

Changed in fuel:
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/67/testReport/%28root%29/ceph_partitions_repetitive_cold_restart/

The same as https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/70/testReport/(root)/ceph_ha_restart/ceph_ha_restart/.

At 22:50:05 was detected split-brain (see node-1 syslog). Split-brain was detected only once. At 22:50:47 mysql was recovered and started by pacemaker (see node-1 pacemaker.log).

BTW: possible it related to a lack of memory. Take a look into atop logs.

MEM | tot 2.9G | free 78.4M | cache 81.5M | dirty 0.6M | buff 1.9M | slab 70.4M |
SWP | tot 3.0G | free 1.9G |
PAG…| swin 7478 | | swout 1710 |

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Colleagues, before you attach another fail logs - please, read the haproxy/pacemaker/mysqld logs.
At 22:56:54 was detected split-brain on node-3, at 22:56:59 mysql was recovered and started by pacemaker.

I've a question, how actually do you decide what some particular fail is related to this bug?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/374265

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/374266

Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Let me to explain what actually is going.
Let's take a look into mysqld.log. As we can see, at 22:50:06 mysql got SIGTERM and went to "normal shutdown".
Now, take a look into pacemaker.log. At 20:50:05 crmd has failed p_mysqld_monitor ("Detected action (47.49) p_mysqld_monitor_60000.59=unknown error: failed").
After what, pacemaker send TERM signal to mysqld.
After spending a little bit of time mysqld was recovered and started again.
If something will try to access to mysqld at this time (while mysql is recovering) - any requests will be fail.
Again, for a clustered services this behavior is absolutely normal.
So, my point is: we shouldn't work with clustered services as with a standalone service.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/374266
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=0c09d3d24a8df841f7763cdec763bc9caf42838b
Submitter: Jenkins
Branch: stable/mitaka

commit 0c09d3d24a8df841f7763cdec763bc9caf42838b
Author: Denis Egorenko <email address hidden>
Date: Wed Sep 21 18:56:19 2016 +0300

    Add retries for murano-dbmanage task

    Murano-dbmanage fails without any chance to retry via
    additional puppet runs if it was completed successfully on the
    first run. Adding retries reduces the likelihood that deployment fails.

    Change-Id: I567c0c944b308db344326ad0555e98e21f422236
    Related-bug: #1620268

tags: added: in-stable-mitaka
Changed in fuel:
status: Confirmed → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Mykola Stolyarenko (mstolyarenko) wrote :

Not reproducible. fuel rpm: fuel-release-9.0.0-1.mos6357.noarch
murano plugin: 0.11.0.dev13-1~u14.04+mos2

Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :

Reproduced on yesterday swarm:
9.1 snapshot #311
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/75/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6357.noarch
 fuel-misc-9.0.0-1.mos8605.noarch
 fuel-9.0.0-1.mos6357.noarch
 fuel-openstack-metadata-9.0.0-1.mos8861.noarch
 fuel-nailgun-9.0.0-1.mos8861.noarch
 fuel-agent-9.0.0-1.mos291.noarch
 fuel-mirror-9.0.0-1.mos151.noarch
 nailgun-mcagents-9.0.0-1.mos774.noarch
 fuel-ui-9.0.0-1.mos2814.noarch
 shotgun-9.0.0-1.mos90.noarch
 network-checker-9.0.0-1.mos77.x86_64
 fuel-utils-9.0.0-1.mos8605.noarch
 fuel-migrate-9.0.0-1.mos8605.noarch
 python-fuelclient-9.0.0-1.mos356.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8861.noarch
 fuel-notify-9.0.0-1.mos8605.noarch
 rubygem-astute-9.0.0-1.mos774.noarch
 fuelmenu-9.0.0-1.mos275.noarch
 python-packetary-9.0.0-1.mos151.noarch
 fuel-bootstrap-cli-9.0.0-1.mos291.noarch
 fuel-setup-9.0.0-1.mos6357.noarch
 fuel-library9.0-9.0.0-1.mos8605.noarch
 fuel-ostf-9.0.0-1.mos946.noarch
[root@nailgun ~]#

Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

Looks like there is a similar issue that happened during auto acceptance:
https://product-ci.infra.mirantis.net/job/9.x.acceptance.ubuntu.failover_group_mongo/11/testReport/(root)/deploy_mongo_cluster/

[root@nailgun ~]# grep -ir 'split-brain' /var/log
/var/log/remote/node-1.test.domain.local/ocf-mysql-wss.log:2016-09-28T07:11:16.756341+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:18969, this is a split-brain!

Also:

2016-09-28T07:11:17.461567+00:00 err: (/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]) Failed to call refresh: neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head returned 1 instead of one of [0]

2016-09-28T07:11:17.465163+00:00 err: (/Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]) neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head returned 1 instead of one of [0]

Env Description:
9.1 snapshot #315

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/378492

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/378937

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/374265
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=354b1b1c342c1843596e55fe0ea30df5f1eb2c30
Submitter: Jenkins
Branch: master

commit 354b1b1c342c1843596e55fe0ea30df5f1eb2c30
Author: Denis Egorenko <email address hidden>
Date: Wed Sep 21 18:56:19 2016 +0300

    Add retries for murano-dbmanage task

    Murano-dbmanage fails without any chance to retry via
    additional puppet runs if it was completed successfully on the
    first run. Adding retries reduces the likelihood that deployment fails.

    Change-Id: I567c0c944b308db344326ad0555e98e21f422236
    Related-bug: #1620268

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/378937
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b415c05b039302a42dfec58357438585c06073b4
Submitter: Jenkins
Branch: master

commit b415c05b039302a42dfec58357438585c06073b4
Author: Maksim Malchuk <email address hidden>
Date: Wed Sep 28 12:01:27 2016 +0300

    Add retries for neutron-db-sync task

    Exec 'neutron-db-sync' fails without any chance to retry via
    additional puppet runs if it was completed successfully on the first
    run. Adding retries reduces the likelihood that deployment fails.

    Change-Id: I27522de30fc29ef7516e3c9baf36516723ced4a5
    Related-bug: #1620268
    Signed-off-by: Maksim Malchuk <email address hidden>

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

fixes for the db-sync related fail were proposed to all puppet modules in master branch: https://bugs.launchpad.net/puppet-aodh/+bug/1628580

Changed in fuel:
status: Fix Released → Fix Committed
importance: Critical → High
milestone: 9.1 → 10.0
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

added the Newton milestone to the bug due to the several related fixes in master branch

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

It's not only about retrying db syncs as it also affects services / tests after deployment, like in https://bugs.launchpad.net/mos/+bug/1628942 .

I suggest we re-open this and try to understand the root cause of the split brain. If it's indeed scarce resources, then we should set correct expectations for QA engineers, so that they fix their scripts and do not file a new pile of duplicates.

Revision history for this message
Alisa Tselovalnikova (atselovalnikova) wrote :

This bug affects two swarm tests (jumbo_frames_neutron_vlan, jumbo_frames_neutron_vxlan):
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.jumbo_frames/80/
http://paste.openstack.org/show/584028/
snapshot #339

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

It is with great sadness that I inform you that two swarm tests failed because of performance issues:
* one of controller nodes was heavy loaded and mysql wasn't respond in time to pacemaker
* pacemaker sent SIGTEM to mysql and started a new instance
* because of high load 'old' mysql instance wasn't stopped immediately, and the 'new' one detected split-brain
* once an 'old' mysql instance was stopped mysql cluster went back to normal
* tests failed because of perfomance issues / no retries if single request to mysql failed

MySQL 'split-brain' occured during time intervals:

fail_error_jumbo_frames_neutron_vlan-fuel-snapshot-2016-10-03_06-32-49
----------------------------------------------------------------------

Time: 06:32:47 - 06:33:05
  * 06:32:47 - mysql failed to respond, SIGTERM sent
  * 06:32:47 - new instance started, 'split-brain' detected
  * 06:32:51 - 'old' mysql stopped
  * 06:33:05 - mysql recovered

Failed because MySQL instance stop responding because of high load.
This is NOT a split-brain issue, root cause is performance issue.

---
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions [req-909b13bc-abd3-46ca-81ce-838c707248b0 5fdd105ee97e4883b4187e307e98dc14 bfeb52a6529e4166a1f182740cb311b3 - - -] Unexpected exception in API method
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions Traceback (most recent call last):
...
2016-10-03 06:32:49.400 3032 ERROR nova.api.openstack.extensions DBConnectionError: (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") [SQL: u'SELECT 1']
---

fail_error_jumbo_frames_neutron_vxlan-fuel-snapshot-2016-10-03_07-40-03
-----------------------------------------------------------------------

Time: 07:16:33 - 07:16:53:
  * 07:16:33 - mysql failed to respond, SIGTERM sent
  * 07:16:33 - new instance started, 'split-brain' detected
  * 07:16:38 - 'old' mysql stopped
  * 07:16:51 - mysql recovered

Issue with cinder occured because of deadlock at the same moment mysql wasn't able to respond properly because of high load.
This is NOT a split-brain issue, root cause is performance issue.

---
2016-10-03 07:16:34.136 24163 CRITICAL cinder [-] DBDeadlock: (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'\nALTER TABLE volume_type_projects CHANGE COLUMN deleted deleted INTEGER']
---

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Dima, we are agree that is a performance issue! the thing is what you didn't find a real cause of degradation, why on the same HW in 9.0 our tests work, and in 9.1 we see the performance degradation. what is a real cause of it?

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

We've found that there is intensive swapping at the time of mysql fauilres. Please try adding +2 GB of RAM to each controller.

Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

@Dmitry Teselkin

Which service requires that additional 2Gb? For 9.0 current capacities were fine.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nastya, it is fair to compare 9.x and 10.0 because both different to much:
1. Different OSes (14.04 Trusty vs 16.04 Xenial)
2. Different packages and package dependencies (See #1)
3. Upstream puppet modules differ not only by branches (9.x doesn't have puppet-oslo and puppet-ceph at all)
4. Downstream puppet modules much more differ (due to #3 for example, and also have more features not backported to Mitaka)

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

According to the difference 9.1 from 9.0 the code contains more backports from the 10.0 and the memory requirements also raised, maybe a little, but it is.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Dima T. please don't change assignment w/o proper comment and don't forget comment #59!

@Maskim, I was talking about comparison 9.1 and 9.0, not 9.x/master.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Guys, I'm sorry, but I'm closing the bug. Bug is about split-brain issue and not related to current performance issues. Please create a new report for your case.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

New bug for mysql-galera performance issue: https://bugs.launchpad.net/fuel/+bug/1630233
Closing current

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/378492
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6c6954df72406d761e49823ae72089ce83f45df7
Submitter: Jenkins
Branch: stable/mitaka

commit 6c6954df72406d761e49823ae72089ce83f45df7
Author: Maksim Malchuk <email address hidden>
Date: Wed Sep 28 12:01:27 2016 +0300

    Add retries for neutron-db-sync task

    Exec 'neutron-db-sync' fails without any chance to retry via
    additional puppet runs if it was completed successfully on the first
    run. Adding retries reduces the likelihood that deployment fails.

    Change-Id: I27522de30fc29ef7516e3c9baf36516723ced4a5
    Related-bug: #1620268
    Signed-off-by: Maksim Malchuk <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.