Mysql crashing preventing deployments

Bug #1624013 reported by Bob Ball
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Linux

Bug Description

Detailed bug description:
 MOS 9 environment cannot deploy potentially due to mysql crashing. (Edit: It seems that the mysql crash from /var/log/mysql/error.log could be unrelated to the failures to deploy due to mysql being inaccessible)

Puppet logs for the failed controller say:
 (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) mysql -uclustercheck -pOObsCqCTtkLkRHK52n0H0N8O -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]
 (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) Failed to call refresh: mysql -uclustercheck -pOObsCqCTtkLkRHK52n0H0N8O -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]
 (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]/returns) ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)

Logging in to failed controller node shows that, indeed, mysql is not running:
root@node-4:~# service mysql status
mysql stop/waiting

/var/log/mysql/error.log is attached, and shows a segfault occurring, possibly from the wsrep post commit function:

14:22:43 UTC - mysqld got signal 11 ;
stack_bottom = 7fbebb747e88 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x7fbebbf81b7c]
/usr/sbin/mysqld(handle_fatal_signal+0x3c2)[0x7fbebbcd25c2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fbeba9c7330]
/usr/sbin/mysqld(thd_get_ha_data+0xc)[0x7fbebbd1f54c]
/usr/sbin/mysqld(_Z20thd_binlog_trx_resetP3THD+0x2e)[0x7fbebbf2c79e]
/usr/sbin/mysqld(_Z17wsrep_post_commitP3THDb+0xcc)[0x7fbebbe0c32c]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x6f)[0x7fbebbdf2dcf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x38d1)[0x7fbebbd60851]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x3c8)[0x7fbebbd649d8]
/usr/sbin/mysqld(+0x508c24)[0x7fbebbd64c24]
/usr/sbin/mysqld(_Z19do_handle_bootstrapP3THD+0x111)[0x7fbebbd64ff1]
/usr/sbin/mysqld(+0x509060)[0x7fbebbd65060]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fbeba9bf184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fbeba0e237d]

Steps to reproduce:
 Not sure which steps are needed, but my environment has:
  3x Controller (4 CPU, 6GB RAM, 80GB HDD)
  3x Qemu Compute/Cinder/Ceph-OSD (2 CPU, 1GB RAM, 50GB HDD)

 Each host has two interfaces - PXE (eth0) and a VLAN network (eth1).
 Public network is on a VLAN over eth1, and Neutron is also configured to use VLANs

This has been reproduced several times on different hardware and with different Fuel 9 installations, with different Ubuntu repositories and with the XenServer plugin disabled as well as enabled. MD5 sum of ISO has been confirmed:

# md5sum MirantisOpenStack-9.0.iso
07461ba42d5056830dd6f203e8fe9691 MirantisOpenStack-9.0.iso

Expected results:
 Deployment succeeds :)

Actual result:
 Deployment fails with the above errors :)

Reproducibility:
 Deployments are occasionally successful, but once a deployment is successful it is not possible to add a new controller node as adding a controller fails 100% of the time.

Workaround:
 None known

Impact:
 Fatal; but also preventing the validation of the XenServer plugin for MOS 9 as this issue also occurs with the plugin installed.

Fuel snapshot is attached, with a second snapshot (with the XenServer plugin enabled) at https://citrix.sharefile.com/d-s4a809f3542947818

Revision history for this message
Bob Ball (bob-ball) wrote :
Revision history for this message
Bob Ball (bob-ball) wrote :

Updated snapshot which failed to add a new controller node following a successful deploy

Changed in fuel:
importance: Undecided → High
milestone: none → 9.1
tags: added: customer-found
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
Changed in fuel:
status: New → Confirmed
milestone: 9.1 → 9.2
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Now we have the newer version of mysql-wsrep (5.6.33) in repos.
It should fix this issue.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

Thanks for the fix.
Where can we get the newer version of "mysql-wsrep (5.6.33)" so that we can proceed with our XenServer fuel plugin test? They seem not in the mos-repos: http://mirror.fuel-infra.org/mos-repos/ubuntu/master/pool/main/m/mysql-wsrep-5.6/

Revision history for this message
Dmitry Burmistrov (dburmistrov) wrote :

You could find new mysql-wsrep there:
for 9.0 - http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-latest/pool/main/m/mysql-wsrep-5.6/ (still in proposed; will be available in updates with MU)
for xenial/master - http://mirror.fuel-infra.org/mos-repos/xenial/master/pool/main/m/mysql-wsrep-5.6/

Revision history for this message
Bob Ball (bob-ball) wrote :

Thanks Dmitry - do you know what triggered this issue? Why are we hitting it now when previous MOS9 deployments succeeded?
Is there a workaround that we can use internally, or are all MOS 9 deployments globally broken until this MU is released?

Revision history for this message
Bob Ball (bob-ball) wrote :

We've been investigating this proposed fix, and we're not convinced it is the right fix. We now believe that this is a race condition / ordering issue - if we manually run the "mysql -uclustercheck -p50cUWiAnI86oXBhx7OcthBer -Nbe "show status like 'wsrep_local_state_comment'"" after a failed deployment, the command returns successfully As previously mentioned, a second attempt to deploy does often also succeed (but deploying a new controller to an existing environment has never succeeded)

Even with successful deployments, two of the symptoms I originally described still appear (service mysql status always shows it is stopped, and /var/log/mysql/error.log includes this error), and we noticed that the same mysql package is used in MOS8 and MOS7.

We've been unable to test the proposed fix as our attempts to regenerate the local package cache with the replacement package are failing with verification errors.

How can we prove that this proposed fix works in our environment?

Reverted to 'confirmed' for further input and requesting confirmation that, despite the above concerns, Mirantis are convinced that the mysql package upgrade will definitely fix this issue.

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Hi,

Let me comment the steps:

> service mysql status

MySQL/Galera is not controlled and not run as SystemV/Upstart/SystemD service. It's controlled by pacemaker

pcs resource show clone_p_mysqld

is a proper command to check the status of Galera Cluster.

> 14:22:43 UTC - mysqld got signal 11 ;

It's a known issue when mysqld reference to file descriptor which doesn't exist anymore. However, it happens only first run of mysqld when db structure (/var/lib/mysql) is not initialized by mysql_install_db. Here is a patch to MySQL I tried to push to community

 diff --git a/sql/sql_class.cc b/mysql-wsrep-5.6/sql/sql_class.cc
   2 index 79d56ad..ac55818 100644
   3 --- a/sql/sql_class.cc
   4 +++ b/sql/sql_class.cc
   5 @@ -665,6 +665,7 @@ void thd_storage_lock_wait(THD *thd, long long value)
   6 extern "C"
   7 void *thd_get_ha_data(const THD *thd, const struct handlerton *hton)
   8 {
   9 + if (hton->slot == -1) return NULL;
  10 return *thd_ha_data(thd, hton);
  11 }

Though it's not been accepted as the case is very rare which happens only when mysqld got sigterm before database structure is initialized.

Revision history for this message
Bob Ball (bob-ball) wrote :

Great - thanks for explaining the pacemaker integration.

So is the assertion that the majority of MOS 9 deployments now will currently fail because something (what?) has happened to make this rare race condition almost certain?

Or does this issue not affect every MOS 9 deployment?
If it does not affect the majority MOS 9 deployment, why has it recently started to affect all of ours?

Also, as above, how can we prove that this proposed fix works in our environment?

Bob Ball (bob-ball)
description: updated
description: updated
Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

More observation from my side:

The problem is the following command:
mysql -uclustercheck -pOObsCqCTtkLkRHK52n0H0N8O -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10

will result into this error:
(HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)

If we check into the environment just after the deployment failure, we can get the same error by manually running the mysql commands. But the error will go away after a while - actually need wait for a long while (not sure of the exact time, but >1 hour in one of my test).

When a new controller node joined, it seems mysql will try to transfer State Snapshot from other existing controllers. During this transferring period, mysql doesn't give regular service. That may have resulted the error.

But I have no idea so far why SS transfer takes longer than expected. Any idea on what we can do for further trouble shooting?

root@node-25:~# pstree -pas 17159
init,1
  └─mysqld_safe,17159 /usr/bin/mysqld_safe --pid-file=/var/run/resource-agents/mysql-wss/mysql-wss.pid --socket=/var/run/mysqld/mysqld.sock ...
      ├─logger,18113 -t mysqld -p daemon.error
      └─mysqld,18112 --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --open-files-limit=102400...
          ├─sh,18257 -c...
          │ └─wsrep_sst_xtrab,18258 -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.0.7 --auth root:iM9VWfY1anaR5ZXWconKfZq3 ...
          │ ├─socat,18530 -u TCP-LISTEN:4444,reuseaddr stdio
          │ └─xbstream,18531 -x
          ├─{mysqld},18114
          ├─{mysqld},18252
          ├─{mysqld},18253
          ├─{mysqld},18254
          ├─{mysqld},18255
          └─{mysqld},18256

Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

Yeah, I just finished a test to confirm the above statement on SST.

The status checking will pass just after the SST process terminated/finished. And this process spent ~72min. So the next question need to be answered is why it spend so long to finish SST.

LOG:
root@node-25:~# mysql -uclustercheck -p50cUWiAnI86oXBhx7OcthBer -Nbe "show status like 'wsrep_local_state_comment'"
Warning: Using a password on the command line interface can be insecure.
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+

mysql 18257 18112 0 07:28 ? 00:00:00 sh -c wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.7' --auth 'root:iM9VWfY1anaR5ZXWconKfZq3' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '18112' ''

the terminate time is Thu Sep 22 08:40:41 UTC 2016

Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

During the SST period, there are much free CPU/diskIO/Memory on both of the 'joiner' and 'donor' Controller nodes. So the long time running seems not relative those resources.

Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

This may be a workaround: after adding controller failed, wait for mysql finished SST; and then re-deployment the env. It will pass successfully.

Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

We can see the mysql service is not restarted during the re-deployment (see the process' start-up time). So there is no more fresh, which means there is no SST in the 2nd re-deployment.
root@node-25:~# date
Thu Sep 22 09:30:48 UTC 2016
root@node-25:~# ps -ef | grep mysqld_safe
root 11742 7377 0 09:30 pts/3 00:00:00 grep --color=auto mysqld_safe
root 17159 1 0 07:28 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --pid-file=/var/run/resource-agents/mysql-wss/mysql-wss.pid --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql --user=mysql

Revision history for this message
Jianghua Wang (wjh-fresh) wrote :

Sergii and Dmitry: We tested in a single host env which 3 controller + 1 CPU node. Then removed and added back one controller. It worked well. And the SST finished quickly - one or two minutes. Do you have any suggestions on what could cause this difference? Thanks.

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Please provide more details:

* exact lab configuration - what HW (CPU, RAM, disks, etc) do you use, what hypervisor do you use, how HW node connected to each other, etc.

* in two snapshots you've provided I didn't find logs for node other that those where you found errors in pupper.log. If you removed logs for other nodes - please, don't do this in the future. If not - it looks like an issue with snapshots and should be reported.

* is this correct that you still using "old" mysql and wasn't able to install an updated version? If so, what prevented you from installing new version? What errors occurred?

Revision history for this message
Bob Ball (bob-ball) wrote :

1) We've tried this with a variety of different hardware and it's always reproduced. Typically we're giving the controller nodes 80-100GB disk and 6GB RAM and 4 CPUs. Hardware is connected through an integrated switch (e.g. virtual connect) with VLANs defined by the switch.

2) No modifications of the snapshot archives were made - I've verified that the logs are indeed included. For example:
C:\Users\Bob\Downloads\fuel-snapshot-2016-09-15_16-56-42\fuel.domain.tld\var\log\remote\node-4.domain.tld\puppet-apply.log contains failures such as the following (which is attempted 3 times with a 10 minute timeout)
2016-09-15T14:33:06.393264+00:00 err: (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) Failed to call refresh: mysql -uclustercheck -pOObsCqCTtkLkRHK52n0H0N8O -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]

As this is referring to node-4, I also checked some other node-4 logs at the offending time and mysqld.log contains failures such as:
2016-09-15T14:25:24.509238+00:00 err: 2016-09-15 14:25:24 17678 [ERROR] WSREP: Requesting state transfer failed: -77(File descriptor in bad state)
2016-09-15T14:25:24.509238+00:00 err: 2016-09-15 14:25:24 17678 [ERROR] WSREP: State transfer request failed unrecoverably: 77 (File descriptor in bad state). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2016-09-15T14:25:24.509238+00:00 err: 2016-09-15 14:25:24 17678 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Mysqld log file actually contains a *lot* of errors here - but I don't understand what it's trying to do well enough.

3) We are using the mysql included in the ubuntu repositories in MOS9 - e.g.:
/var/www/nailgun/mitaka-9.0/ubuntu/x86_64/pool/main/m/mysql-wsrep-5.6/mysql-wsrep-common-5.6_5.6.23-1~u14.04+mos1_all.deb

MOS9 of course pulls mysql from this location to install on the nodes and therefore this location would need to be updated. We have tried a few methods to replace mysql files in this directory but get package verification errors when attempting to regenerate the contents. Advice for how to update this for MOS9 would be greatly appreciated.

Revision history for this message
Bob Ball (bob-ball) wrote :

Embarrassingly, we may have just discovered the root cause. More thorough testing is required, but it looks like there is something *very* strange going on with networking in our environment. We’ve been using physical VLANs for management/storage/neutron and it seems that somehow these VLANs are not behaving correctly.
Basic testing shows everything working, however when used in the MOS deployment environment this very slow mysql SST behaviour is seen.

We have tested creating GRE tunnels to host the management/storage/neutron networks and are unable to reproduce this issue. Our current belief is that there is some unusual behaviour in the NICs or virtual connect switch between the hosts which is somehow causing this effect.

We’re just at the start of understanding this, and don’t have any real answers yet, but I wanted to let you know that we’re now thinking this is hardware related but do not yet understand it, or why the reproduction also occurs on multiple chassis from different vendors.

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

The issue seems to be hardware-related so moving to 'incomplete' now in order to invalidate in a month.
Anyway, it would be great to get the result of further investigation, if possible.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Bob Ball (bob-ball) wrote :

We still don't have real answers, but some of the hardware this was reproducible on are BL480 G6 HP gear - so pretty old. There may be some weird incompatibility with this particular hardware and the NIC driver? We've managed to produce some effects I don't understand yet, such as pause flooding and loop detection triggering when we're using the physical VLANs.
We're going to investigate whether there are similar effects on the newer Dell kit on which this was also reproducible but the machines are currently in use by another project, so that investigation may take some time.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Invalid because the bug was in Incomplete status for more than a month. Please reopen if you get additional information.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.