Failed to call refresh: /usr/bin/mysql -uwsrep_sst -ppassword -Nbe "show status like 'wsrep_local_state_comment'"

Bug #1350245 reported by Anastasia Palkina on 2014-07-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Sergii Golovatiuk

Bug Description

"build_id": "2014-07-29_02-01-14",
"ostf_sha": "9c0454b2197756051fc9cee3cfd856cf2a4f0875",
"build_number": "369",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "98ea1ce54f4d084e2533c7ea23aa98551f955ec5",
"production": "docker",
"fuelmain_sha": "2379783517e4830868f66b5fbab512eec4695679",
"astute_sha": "aa5aed61035a8dc4035ab1619a8bb540a7430a95",
"feature_groups": ["mirantis", "experimental"],
"release": "5.1",
"fuellib_sha": "851b29ac6434c1671078ef293bb25a04dc492f49"

1. Create new environment (Ubuntu, HA mode)
2. Choose nova-network, vlan manager
3. Choose Ceph for images
4. Choose Ceilometer
5. Add 3 controllers+mongo, compute, cinder, 2 ceph
6. Start deployment. It has failed.
7. There is error on 3-rd controller in puppet.log (node-12):

2014-07-29 16:42:30 ERR

 (/Stage[main]/Galera/Exec[wait-initial-sync]) Failed to call refresh: /usr/bin/mysql -uwsrep_sst -ppassword -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q -e Synced -e Initialized && sleep 10 returned 1 instead of one of [0]

Anastasia Palkina (apalkina) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/110710

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
status: New → In Progress
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)
assignee: Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)

Reviewed: https://review.openstack.org/110710
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ce8284294c8a313aa03a322dff72315ae1f1c955
Submitter: Jenkins
Branch: master

commit ce8284294c8a313aa03a322dff72315ae1f1c955
Author: Sergii Golovatiuk <email address hidden>
Date: Wed Jul 30 16:07:01 2014 +0000

    Increase threads for Galera slaves

    Set galera wsrep_threads to at least
    4 or 2*physical_cpus but not more than
    12.

    Low number of slave threads leads to
    primary controller node failing to
    send state transfer and thus to
    deployment failure.

    Change-Id: Ic2cfe8a50ec61a2f64562f0db64fe7a74b5d04a5
    Closes-Bug: 1350245

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → In Progress

Reviewed: https://review.openstack.org/110790
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=9a64237c70fc464ef0b11ac7c0bad34e8c202135
Submitter: Jenkins
Branch: master

commit 9a64237c70fc464ef0b11ac7c0bad34e8c202135
Author: Vladimir Kuklin <email address hidden>
Date: Thu Jul 31 01:08:15 2014 +0400

    Fix galera erb template line carriage

    Stupid mistake in erb file did not return the line

    Change-Id: I9ba8eb0807a402f54855b595c018178d93d60c2d
    Closes-bug: #1350245

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/110745
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=6119e1017771a704d0ff6dc84299801d211748b2
Submitter: Jenkins
Branch: master

commit 6119e1017771a704d0ff6dc84299801d211748b2
Author: Vladimir Kuklin <email address hidden>
Date: Wed Jul 30 22:31:26 2014 +0400

    Deploy secondary controllers in portions

    Altering priorities for secondary controllers.
    Do not deploy more than 6 secondary controllers
    simultaneously or galera will not be able to
    digest this amount of slaves for initial state
    transfer of the database.

    Related-bug: #1350245
    Change-Id: I01554a3ad9dac22c655c6f9e97eba4f9444df8ee

Vladimir Kuklin (vkuklin) wrote :
Changed in fuel:
status: Fix Committed → In Progress
Artem Panchenko (apanchenko-8) wrote :

api: '1.0'
astute_sha: b16efcec6b4af1fb8669055c053fbabe188afa67
auth_required: true
build_id: 2014-07-31_10-30-25
build_number: '378'
feature_groups:
- mirantis
fuellib_sha: 9a64237c70fc464ef0b11ac7c0bad34e8c202135
fuelmain_sha: 63d0775708b0f5fa4d6d1e09a316d9c26f7e5444
nailgun_sha: 6119e1017771a704d0ff6dc84299801d211748b2
ostf_sha: b4c5efa51909404fd9ec1d0bbc38a31b200e1d6d
production: docker
release: '5.1'

This issue was reproduced during system tests: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/217/testReport/junit/(root)/deploy_ha_vlan/deploy_ha_vlan/

Deployment of 3rd controller (node-5) failed, here is the part of puppet logs:

 (/Stage[main]/Galera/Exec[wait-for-synced-state]/returns) change from notrun to 0 failed: /usr/bin/mysql -uwsrep_sst -ppassword -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q Synced && sleep 10 returned 1 instead of one of [0]

root@node-5:~# mysql 2>/dev/null -uwsrep_sst -ppassword -Nbe "show status like 'wsrep_local_state_comment'"
+---------------------------+-----------------------------------+
| wsrep_local_state_comment | Joining: receiving State Transfer |
+---------------------------+-----------------------------------+

MySQL logs from all controllers are attached.

Artem Panchenko (apanchenko-8) wrote :
OSCI Robot (oscirobot) wrote :

Package MySQL-wsrep has been built from changeset: http://gerrit.mirantis.com/19736
RPM Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-5.1-stable-19736/centos
You can build an ISO with this package:
make iso EXTRA_RPM_REPOS="osci-testing,http://osci-obs.vm.mirantis.net:82/centos-fuel-5.1-stable-19736/centos"

OSCI Robot (oscirobot) wrote :

Package MySQL-wsrep has been built from changeset: http://gerrit.mirantis.com/19739
DEB Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.1-stable-19739/ubuntu
You can build an ISO with this package:
make iso EXTRA_DEB_REPOS="http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.1-stable-19739/ubuntu /"

OSCI Robot (oscirobot) wrote :

Package MySQL-wsrep has been built from changeset: http://gerrit.mirantis.com/19736
RPM Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-5.1-stable/centos
You can build an ISO with this package:
make iso EXTRA_RPM_REPOS="osci-testing,http://osci-obs.vm.mirantis.net:82/centos-fuel-5.1-stable/centos"

OSCI Robot (oscirobot) wrote :

Package MySQL-wsrep has been built from changeset: http://gerrit.mirantis.com/19739
DEB Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.1-stable/ubuntu
You can build an ISO with this package:
make iso EXTRA_DEB_REPOS="http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.1-stable/ubuntu /"

Vladimir Kuklin (vkuklin) wrote :

fixed by renicing wsrep_sst_mysqldump calls of mysqldump and mysql respectively

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → In Progress

The issue is still present. I was able to reproduce it in my environment.

The RCA of issue is

/var/log/mysql/error.log
2014-08-01 11:15:57 3683 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted

/var/log/mysqld.log
<27>Aug 1 11:15:48 node-2 mysql-wss[3767]: ERROR: GTID have wrong format: wsrep_local_state_uuid:18446744073709551615
<27>Aug 1 11:15:48 node-2 mysql-wss[3767]: ERROR: Wrong GTID, not updating gtid attribute
<27>Aug 1 11:16:48 node-2 mysql-wss[4430]: ERROR: MySQL not running: removing old PID file

/var/log/mysql/error.log
2014-08-01 11:17:24 7790 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted

/var/log/mysqld.log
<27>Aug 1 11:17:21 node-2 mysql-wss[7882]: ERROR: GTID have wrong format: wsrep_local_state_uuid:18446744073709551615
<27>Aug 1 11:17:21 node-2 mysql-wss[7882]: ERROR: Wrong GTID, not updating gtid attribute
<27>Aug 1 11:18:22 node-2 mysql-wss[8577]: ERROR: MySQL not running: removing old PID file

/var/log/mysql/error.log
2014-08-01 11:18:57 11783 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted

/var/log/mysqld.log
<27>Aug 1 11:18:54 node-2 mysql-wss[11874]: ERROR: GTID have wrong format: wsrep_local_state_uuid:18446744073709551615
<27>Aug 1 11:18:54 node-2 mysql-wss[11874]: ERROR: Wrong GTID, not updating gtid attribute
<27>Aug 1 11:19:55 node-2 mysql-wss[12593]: ERROR: MySQL not running: removing old PID file

/var/log/mysql/error.log
2014-08-01 11:22:03 20410 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted

/var/log/mysqld.log
<27>Aug 1 11:22:00 node-2 mysql-wss[20516]: ERROR: GTID have wrong format: wsrep_local_state_uuid:18446744073709551615
<27>Aug 1 11:22:00 node-2 mysql-wss[20516]: ERROR: Wrong GTID, not updating gtid attribute
<27>Aug 1 11:23:01 node-2 mysql-wss[22166]: ERROR: MySQL not running: removing old PID file

Comparing MySQL log and Pacemaker log we can see that pacemaker cannot validate GTID as mysqld is in SST/IST state without proper GTID. In this case pacemaker should also verify if MySQL is not in SST/IST state and skip it until transfer is done

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk)

Reviewed: https://review.openstack.org/111375
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b283b191b029f0d8165c8ae46fdb833d56b1625f
Submitter: Jenkins
Branch: master

commit b283b191b029f0d8165c8ae46fdb833d56b1625f
Author: Sergii Golovatiuk <email address hidden>
Date: Fri Aug 1 19:35:00 2014 +0000

    Add Desync/Donor states to Pacemaker OCF script

    - Increase timeouts for OCF script as we don't need to check every
      minute, HAProxy clustercheck script verifies MySQL every second.
      Even if MySQL got Desync it will be dropped out from HAProxy,
      until it will be Synced.
    - Add Donor/Desync to monitor as on slow hardware or when DB is
      large (in production). Monitor function verified only Synced
      state, and terminated mysqld during SST/IST (mysqldump). After
      change OCF should verify states correctly and not to kill mysqld.
      HAProxy relies on clustercheck script waiting for Synced state
      until it adds MySQL node to production.
    - Replace --wsrep-cluster-address=gcomm://
to --wsrep-new-cluster

    Change-Id: I0a659c72ab80a33dcdc1c0e26de12c2fd88c75be
    Closes-Bug: 1350245

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Confirmed
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #464
"build_id": "2014-08-21_02-01-17",
"ostf_sha": "c6ecd0137b5d7c1576fa65baef0fc70f9a150daa",
"build_number": "464",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "25eba6fbb2047f26d9da4d27ffdb742c9c27832a",
"production": "docker",
"fuelmain_sha": "25a0c228d998707f90e90877559f17817a749d2f",
"astute_sha": "efe3cb3668b9079e68fb1534fd4649ac45a344e1",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "52f3ebfa968f0338e0584edf47cff10911109de5"

1. Create new environment (Ubuntu, HA mode)
2. Choose neutron, vlan
3. Choose both Ceph
4. Add 3 controller+ceph, 1 compute
5. Start deployment. Deployment has failed. Timeout of deployment is exceeded.

There is error on 3-rd controller (node-7):

2014-08-21 12:07:47 ERR

 (/Stage[main]/Galera/Exec[wait-initial-sync]) Failed to call refresh: /usr/bin/mysql -uwsrep_sst -ppassword -Nbe "show status like 'wsrep_local_state_comment'" | /bin/grep -q -e Synced -e Initialized && sleep 10 returned 1 instead of one of [0]

Anastasia Palkina (apalkina) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers