Fuel for OpenStack

Mysql has become unmanaged by pacemaker on controller after mysql termination

Bug #1388771 reported by Andrey Sledzinskiy on 2014-11-03

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 7.0
5.1.x	Won't Fix	High	Fuel Library (Deprecated)	Fuel for OpenStack 5.1.1
6.0.x	Won't Fix	High	Fuel Library (Deprecated)	Fuel for OpenStack 6.0
6.1.x	Fix Released	High	Denis Puchkin	Fuel for OpenStack 6.1-mu-4
7.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 7.0

Bug Description

{

    "build_id": "2014-10-31_00-01-50",
    "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
    "build_number": "7",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "28f54f91fae722c26d3c0f6b2cf34bbf95a2be03",
    "production": "docker",
    "fuelmain_sha": "117540694173d7a20ba022e091c469b4da1666e1",
    "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1.1",
    "release_versions": {
        "2014.1.1-5.1.1": {
            "VERSION": {
                "build_id": "2014-10-31_00-01-50",
                "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
                "build_number": "7",
                "api": "1.0",
                "nailgun_sha": "28f54f91fae722c26d3c0f6b2cf34bbf95a2be03",
                "production": "docker",
                "fuelmain_sha": "117540694173d7a20ba022e091c469b4da1666e1",
                "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "5.1.1",
                "fuellib_sha": "0686b8ba1cb6f05c8fd611bb4f2fcdf800d7331c"
            }
        }
    },
    "fuellib_sha": "0686b8ba1cb6f05c8fd611bb4f2fcdf800d7331c"

}

Steps:
1. Create and deploy next cluster - CentOS, HA, Nova-network flat, 3 controller, 2 compute nodes
2. Terminate mysql on first controller with - pkill -9 "mysqld_safe|mysqld"
3. Wait mysql starts on controller - [ -r /var/run/mysql/mysqld.pid ] && pkill -0 -F /var/run/mysql/mysqld.pid return exit code 0
4. Do the same steps for second controller
5. Do the same steps for third controller (node-2 - primary)

Expected - mysql starts by pacemaker
Actual - mysql wasn't started by pacemaker. It's in unmanaged status

[root@node-2 ~]# crm status
Last updated: Mon Nov 3 11:55:55 2014
Last change: Mon Nov 3 11:55:48 2014 via crm_attribute on node-4.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
15 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]

vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     p_mysql (ocf::mirantis:mysql-wss): FAILED node-2.test.domain.local (unmanaged)
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-3.test.domain.local node-4.test.domain.local ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]
p_openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-2.test.domain.local

Failed actions:
    p_mysql_stop_0 on node-2.test.domain.local 'unknown error' (1): call=170, status=complete, last-rc-change='Sun Nov 2 06:08:52 2014', queued=20ms, exec=0ms
    p_mysql_monitor_120000 on node-4.test.domain.local 'not running' (7): call=95, status=complete, last-rc-change='Sun Nov 2 06:08:28 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-3.test.domain.local 'not running' (7): call=92, status=complete, last-rc-change='Sun Nov 2 06:06:07 2014', queued=0ms, exec=0ms

Logs are attached

See original description

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-11-03:

fail_error_ha_mysql_termination-2014_11_02__06_14_52.tar.gz Edit (6.5 MiB, application/x-tar)

Andrey Sledzinskiy (asledzinskiy) on 2014-11-03

summary:

- [System Tests] ha-mysql-termination test fails with timeout - MySQL
- daemon was down on 3-rd controller after kill
+ Mysql has become unmanaged by pacemaker on controller after mysql
+ termination

Andrey Sledzinskiy (asledzinskiy) on 2014-11-03

description:	updated
Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Bogdan Dobrelya (bogdando) on 2014-11-03

Changed in fuel:
status:	New → Confirmed
tags:	added: ha pacemaker

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-11-03:

The same issue on 6.0 ISO
{

    "build_id": "2014-11-02_21-27-58",
    "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
    "build_number": "69",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
    "production": "docker",
    "fuelmain_sha": "ac3ba5f5c6073b7776ec69fc3cb4dd3c56df36c5",
    "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-11-02_21-27-58",
                "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
                "build_number": "69",
                "api": "1.0",
                "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
                "production": "docker",
                "fuelmain_sha": "ac3ba5f5c6073b7776ec69fc3cb4dd3c56df36c5",
                "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "45ad9b42666d7e3e14ab9af2911808e6c8806842"
            }
        }
    },
    "fuellib_sha": "45ad9b42666d7e3e14ab9af2911808e6c8806842"

}

mysql on node-2 again is unmanaged
Logs are attached

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-11-03:

fail_error_ha_haproxy_termination-2014_11_03__04_44_40.tar.gz Edit (8.8 MiB, application/x-tar)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-11-14:

This doesn't fall in line with our usual HA style where single failures are tolerated. This is simultaneous failure of all mysql in a non-standard way. Lowering priority because it doesn't affect normal deployments.

Vladimir Kuklin (vkuklin) on 2014-11-17

Changed in fuel:
importance:	High → Medium
no longer affects:	fuel/6.1.x
Changed in fuel:
milestone:	6.0 → 6.1
tags:	added: release-notes

Vladimir Kuklin (vkuklin) on 2015-03-31

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

Sergii Golovatiuk (sgolovatiuk) on 2015-04-02

Changed in fuel:
status:	Confirmed → Won't Fix

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-31:

I'm confirming this issue as there are too much duplicates have been reported so far.
Normally, the pacemaker resource should never end up unmanaged after any type of a failure, AFAIK.
This might be some flaw in OCF logic or resource parameters.

Changed in fuel:
importance:	Medium → High
assignee:	Sergii Golovatiuk (sgolovatiuk) → MOS Sustaining (mos-sustaining)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-31:

I'm pretty sure the RC is the same for all of the duplicates despite different repro steps
And it should be some how related to kill -9 execution logic

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-03:

Note, pacemaker renders a resource unmanaged only if it has failed to stop and there is no stonith configured. So, the issue must be realted with failed stop actions of the p_mysql resource

Bogdan Dobrelya (bogdando) on 2015-08-04

tags:

added: tricky

Revision history for this message

Sergey Yudin (tsipa740) wrote on 2015-08-04:

I have tried to reproduce it with

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "132"
  build_id: "2015-08-03_22-04-28"
  nailgun_sha: "d1536c3a57459e205e39bc4d86d2b4efc6141c4e"
  python-fuelclient_sha: "4fe70fb5c0ce8905ae5908f63d45b45e89a99340"
  fuel-agent_sha: "1fe47720ba554818a0be707f2e16281791492d50"
  fuel-nailgun-agent_sha: "1512b9af6b41cc95c4d891c593aeebe0faca5a63"
  astute_sha: "6d09f3fc7f69ac558095299211ebfd081fa54b8f"
  fuel-library_sha: "1cfd80a833ed27c777c950006a8d4e4080f81616"
  fuel-ostf_sha: "53109a99d923cccdf88c5cf5aba0af8050df47e3"
  fuelmain_sha: "7a374fbd1f5ebde943cb391a4f71b94888ce4a15"

using different combinations of mysql kill -9 on different controllers with different ordering one-by-one and was nto able to reproduce the issue, i've got only dozen of

Failed actions:
    p_mysql_monitor_60000 on node-1.test.domain.local 'unknown error' (1): call=137, status=complete, last-rc-change='Tue Aug 4 17:17:20 2015', queued=0ms, exec=0ms
    p_mysql_monitor_60000 on node-3.test.domain.local 'unknown error' (1): call=142, status=complete, last-rc-change='Tue Aug 4 17:13:19 2015', queued=0ms, exec=0ms
    p_mysql_monitor_60000 on node-5.test.domain.local 'unknown error' (1): call=143, status=complete, last-rc-change='Tue Aug 4 17:15:20 2015', queued=0ms, exec=0ms

Revision history for this message

Bartosz Kupidura (zynzel) wrote on 2015-08-05:

I can't reproduce issue. I tried a few scenarios:
1) from bug report (kill mysql #1, wait for start #1, kill mysql on ctrl #2, wait for start #2, kill mysql on ctrl #3)
2) kill all controllers in the same time
3) kill controllers with different intervals (10-50s)
4) ban mysql on ctrl #1, insert ~1kk records to DB on ctrl #2/#3, unban mysql on ctrl #1, during DB synchronization on ctrl #1 kill mysql on ctrl #2 and #3

Each scenario was tested ~5-10 times, without success to destroy galera cluster.

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-08-05:

#10

This issue also hasn't been reproducing on our system tests with mysql termination

Revision history for this message

Alexander Arzhanov (aarzhanov) wrote on 2015-08-05:

#11

Because this bug not reproduced, i change status Incomplete.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-06:

#12

Scale lab team reproduced this bug, please contact Leontiy Istomin for details

Revision history for this message

Bartosz Kupidura (zynzel) wrote on 2015-08-06:

#13

Scale lab team problems are not related to this BUG.
There is also unmanaged mysql service, but this is caused by missing (uninstalled) mysql-server-wsrep package. This was probably ISO build fault.

Alexey Shtokolov (ashtokolov) on 2015-08-06

tags:

added: known-issue

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-12:

#14

Managed to reproduce manually:
1) ensure mysql_stop() function in OCF returned error unconditionally:
mysql_stop() {
return $OCF_ERR_GENERIC
...

2) issue the command:
pcs resource disable clone_p_mysql

3) check the mysql resource, it will be unmanaged:
Clone Set: clone_p_mysql [p_mysql]
p_mysql (ocf::fuel:mysql-wss): FAILED node-2.test.domain.local (unmanaged)

The RC is the code in mysql_stop() which may return unkown error to pacemaker. Action stop must never return errors unless we want a resource unmanaged/a node STONISHed

tags:

added: scale

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-12: Fix proposed to fuel-library (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/212051

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-12:

#17

In production environments the use case for this bug is when mysqld process running but its pidfile incorrect for some reason. From the action stop perspective there is nothing to do here, but kill the process and report SUCCESS. The pid issues are normally being addressed on the start action instead. So, we should not put resource into unmanaged state or start would never happen.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-13: Fix merged to fuel-library (master)

#18

Reviewed: https://review.openstack.org/212051
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=533372f2098e29b8d3bbded5308379331de2c803
Submitter: Jenkins
Branch: master

commit 533372f2098e29b8d3bbded5308379331de2c803
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Aug 12 16:39:33 2015 +0200

Fix OCF stop errors for mysql

    W/o this fix, the pacemaker mysql resource
    would became unmanaged when failed to kill
    mysql on OCF stop action. For mysql OCF,
    this possible only when the kill command failed:
    ESRCH
       No process or process group can be found
    corresponding to that specified by pid.

    This is an issue as pacemaker resources must
    never turn unmanaged for such cases (wrong pid)
    and the stop action should not return error.
    Instead we consider this case as "nothing to do
    here, return OK" in the hope of the nearest start
    action to fix this pid issue later.

    The solution is:
    - Do not exit with error from the stop() when
    failed to terminate with SIGTERM. This impacts
    nothing as there is the code below which
    shutdowns the whole mysql process group anyway.
    - Add a warning message instead
    - Update the log message to provide all PIDs under
    the process group being terminated

Closes-bug: #1388771

Change-Id: I2c4ecc10e9129e94e5610d774c92f6eaf5228759
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-09-04:

#19

on verification for 7.0

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-09-04:

#20

verified 256 iso

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Related fix proposed to fuel-docs (master)

#21

Related fix proposed to branch: master
Review: https://review.openstack.org/223473

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Related fix merged to fuel-docs (master)

#22

Reviewed: https://review.openstack.org/223473
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=42ec9edb529186c7ea81ec63a98c416fe2e76c10
Submitter: Jenkins
Branch: master

commit 42ec9edb529186c7ea81ec63a98c416fe2e76c10
Author: evkonstantinov <email address hidden>
Date: Tue Sep 15 11:16:22 2015 +0300

Add mysql unmanaged resolved issue to relnotes

Change-Id: I3d7a8908a13b4e61d12b828d89045be7e5b229ff
Related-Bug:#1388771

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-17: Fix proposed to fuel-library (stable/6.1)

#23

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/246291

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-01: Fix merged to fuel-library (stable/6.1)

#24

Reviewed: https://review.openstack.org/246291
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a9e444e733f5c6a7a98decddf3349e09d218270b
Submitter: Jenkins
Branch: stable/6.1

commit a9e444e733f5c6a7a98decddf3349e09d218270b
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Aug 12 16:39:33 2015 +0200

Fix OCF stop errors for mysql

Closes-bug: #1388771

    cherry-pick from 533372f2098e29b8d3bbded5308379331de2c803
    Change-Id: I2c4ecc10e9129e94e5610d774c92f6eaf5228759
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Alexey Stupnikov (astupnikov) on 2015-12-16

tags:

added: on-verification

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2015-12-18:

#25

Verified on Fuel 6.1 (Ubuntu/Centos).

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.