Mysql has become unmanaged by pacemaker on controller after mysql termination

Bug #1388771 reported by Andrey Sledzinskiy
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Fuel Library (Deprecated)
6.0.x
Won't Fix
High
Fuel Library (Deprecated)
6.1.x
Fix Released
High
Denis Puchkin
7.0.x
Fix Released
High
Bogdan Dobrelya

Bug Description

{

    "build_id": "2014-10-31_00-01-50",
    "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
    "build_number": "7",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "28f54f91fae722c26d3c0f6b2cf34bbf95a2be03",
    "production": "docker",
    "fuelmain_sha": "117540694173d7a20ba022e091c469b4da1666e1",
    "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1.1",
    "release_versions": {
        "2014.1.1-5.1.1": {
            "VERSION": {
                "build_id": "2014-10-31_00-01-50",
                "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
                "build_number": "7",
                "api": "1.0",
                "nailgun_sha": "28f54f91fae722c26d3c0f6b2cf34bbf95a2be03",
                "production": "docker",
                "fuelmain_sha": "117540694173d7a20ba022e091c469b4da1666e1",
                "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "5.1.1",
                "fuellib_sha": "0686b8ba1cb6f05c8fd611bb4f2fcdf800d7331c"
            }
        }
    },
    "fuellib_sha": "0686b8ba1cb6f05c8fd611bb4f2fcdf800d7331c"

}

Steps:
1. Create and deploy next cluster - CentOS, HA, Nova-network flat, 3 controller, 2 compute nodes
2. Terminate mysql on first controller with - pkill -9 "mysqld_safe|mysqld"
3. Wait mysql starts on controller - [ -r /var/run/mysql/mysqld.pid ] && pkill -0 -F /var/run/mysql/mysqld.pid return exit code 0
4. Do the same steps for second controller
5. Do the same steps for third controller (node-2 - primary)

Expected - mysql starts by pacemaker
Actual - mysql wasn't started by pacemaker. It's in unmanaged status

[root@node-2 ~]# crm status
Last updated: Mon Nov 3 11:55:55 2014
Last change: Mon Nov 3 11:55:48 2014 via crm_attribute on node-4.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
15 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     p_mysql (ocf::mirantis:mysql-wss): FAILED node-2.test.domain.local (unmanaged)
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-3.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]
 p_openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-2.test.domain.local

Failed actions:
    p_mysql_stop_0 on node-2.test.domain.local 'unknown error' (1): call=170, status=complete, last-rc-change='Sun Nov 2 06:08:52 2014', queued=20ms, exec=0ms
    p_mysql_monitor_120000 on node-4.test.domain.local 'not running' (7): call=95, status=complete, last-rc-change='Sun Nov 2 06:08:28 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-3.test.domain.local 'not running' (7): call=92, status=complete, last-rc-change='Sun Nov 2 06:06:07 2014', queued=0ms, exec=0ms

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
summary: - [System Tests] ha-mysql-termination test fails with timeout - MySQL
- daemon was down on 3-rd controller after kill
+ Mysql has become unmanaged by pacemaker on controller after mysql
+ termination
description: updated
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
Changed in fuel:
status: New → Confirmed
tags: added: ha pacemaker
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

The same issue on 6.0 ISO
{

    "build_id": "2014-11-02_21-27-58",
    "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
    "build_number": "69",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
    "production": "docker",
    "fuelmain_sha": "ac3ba5f5c6073b7776ec69fc3cb4dd3c56df36c5",
    "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-11-02_21-27-58",
                "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
                "build_number": "69",
                "api": "1.0",
                "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
                "production": "docker",
                "fuelmain_sha": "ac3ba5f5c6073b7776ec69fc3cb4dd3c56df36c5",
                "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "45ad9b42666d7e3e14ab9af2911808e6c8806842"
            }
        }
    },
    "fuellib_sha": "45ad9b42666d7e3e14ab9af2911808e6c8806842"

}

mysql on node-2 again is unmanaged
Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This doesn't fall in line with our usual HA style where single failures are tolerated. This is simultaneous failure of all mysql in a non-standard way. Lowering priority because it doesn't affect normal deployments.

Changed in fuel:
importance: High → Medium
no longer affects: fuel/6.1.x
Changed in fuel:
milestone: 6.0 → 6.1
tags: added: release-notes
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
status: Confirmed → Won't Fix
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm confirming this issue as there are too much duplicates have been reported so far.
Normally, the pacemaker resource should never end up unmanaged after any type of a failure, AFAIK.
This might be some flaw in OCF logic or resource parameters.

Changed in fuel:
importance: Medium → High
assignee: Sergii Golovatiuk (sgolovatiuk) → MOS Sustaining (mos-sustaining)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm pretty sure the RC is the same for all of the duplicates despite different repro steps
And it should be some how related to kill -9 execution logic

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, pacemaker renders a resource unmanaged only if it has failed to stop and there is no stonith configured. So, the issue must be realted with failed stop actions of the p_mysql resource

tags: added: tricky
Revision history for this message
Sergey Yudin (tsipa740) wrote :

I have tried to reproduce it with

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "132"
  build_id: "2015-08-03_22-04-28"
  nailgun_sha: "d1536c3a57459e205e39bc4d86d2b4efc6141c4e"
  python-fuelclient_sha: "4fe70fb5c0ce8905ae5908f63d45b45e89a99340"
  fuel-agent_sha: "1fe47720ba554818a0be707f2e16281791492d50"
  fuel-nailgun-agent_sha: "1512b9af6b41cc95c4d891c593aeebe0faca5a63"
  astute_sha: "6d09f3fc7f69ac558095299211ebfd081fa54b8f"
  fuel-library_sha: "1cfd80a833ed27c777c950006a8d4e4080f81616"
  fuel-ostf_sha: "53109a99d923cccdf88c5cf5aba0af8050df47e3"
  fuelmain_sha: "7a374fbd1f5ebde943cb391a4f71b94888ce4a15"

using different combinations of mysql kill -9 on different controllers with different ordering one-by-one and was nto able to reproduce the issue, i've got only dozen of

Failed actions:
    p_mysql_monitor_60000 on node-1.test.domain.local 'unknown error' (1): call=137, status=complete, last-rc-change='Tue Aug 4 17:17:20 2015', queued=0ms, exec=0ms
    p_mysql_monitor_60000 on node-3.test.domain.local 'unknown error' (1): call=142, status=complete, last-rc-change='Tue Aug 4 17:13:19 2015', queued=0ms, exec=0ms
    p_mysql_monitor_60000 on node-5.test.domain.local 'unknown error' (1): call=143, status=complete, last-rc-change='Tue Aug 4 17:15:20 2015', queued=0ms, exec=0ms

Revision history for this message
Bartosz Kupidura (zynzel) wrote :

I can't reproduce issue. I tried a few scenarios:
1) from bug report (kill mysql #1, wait for start #1, kill mysql on ctrl #2, wait for start #2, kill mysql on ctrl #3)
2) kill all controllers in the same time
3) kill controllers with different intervals (10-50s)
4) ban mysql on ctrl #1, insert ~1kk records to DB on ctrl #2/#3, unban mysql on ctrl #1, during DB synchronization on ctrl #1 kill mysql on ctrl #2 and #3

Each scenario was tested ~5-10 times, without success to destroy galera cluster.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

This issue also hasn't been reproducing on our system tests with mysql termination

Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :

Because this bug not reproduced, i change status Incomplete.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Scale lab team reproduced this bug, please contact Leontiy Istomin for details

Revision history for this message
Bartosz Kupidura (zynzel) wrote :

Scale lab team problems are not related to this BUG.
There is also unmanaged mysql service, but this is caused by missing (uninstalled) mysql-server-wsrep package. This was probably ISO build fault.

tags: added: known-issue
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Managed to reproduce manually:
1) ensure mysql_stop() function in OCF returned error unconditionally:
mysql_stop() {
    return $OCF_ERR_GENERIC
    ...

2) issue the command:
pcs resource disable clone_p_mysql

3) check the mysql resource, it will be unmanaged:
 Clone Set: clone_p_mysql [p_mysql]
     p_mysql (ocf::fuel:mysql-wss): FAILED node-2.test.domain.local (unmanaged)

The RC is the code in mysql_stop() which may return unkown error to pacemaker. Action stop must never return errors unless we want a resource unmanaged/a node STONISHed

tags: added: scale
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/212051

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

In production environments the use case for this bug is when mysqld process running but its pidfile incorrect for some reason. From the action stop perspective there is nothing to do here, but kill the process and report SUCCESS. The pid issues are normally being addressed on the start action instead. So, we should not put resource into unmanaged state or start would never happen.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/212051
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=533372f2098e29b8d3bbded5308379331de2c803
Submitter: Jenkins
Branch: master

commit 533372f2098e29b8d3bbded5308379331de2c803
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Aug 12 16:39:33 2015 +0200

    Fix OCF stop errors for mysql

    W/o this fix, the pacemaker mysql resource
    would became unmanaged when failed to kill
    mysql on OCF stop action. For mysql OCF,
    this possible only when the kill command failed:
    ESRCH
       No process or process group can be found
    corresponding to that specified by pid.

    This is an issue as pacemaker resources must
    never turn unmanaged for such cases (wrong pid)
    and the stop action should not return error.
    Instead we consider this case as "nothing to do
    here, return OK" in the hope of the nearest start
    action to fix this pid issue later.

    The solution is:
    - Do not exit with error from the stop() when
    failed to terminate with SIGTERM. This impacts
    nothing as there is the code below which
    shutdowns the whole mysql process group anyway.
    - Add a warning message instead
    - Update the log message to provide all PIDs under
    the process group being terminated

    Closes-bug: #1388771

    Change-Id: I2c4ecc10e9129e94e5610d774c92f6eaf5228759
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

on verification for 7.0

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified 256 iso

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/223473

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/223473
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=42ec9edb529186c7ea81ec63a98c416fe2e76c10
Submitter: Jenkins
Branch: master

commit 42ec9edb529186c7ea81ec63a98c416fe2e76c10
Author: evkonstantinov <email address hidden>
Date: Tue Sep 15 11:16:22 2015 +0300

    Add mysql unmanaged resolved issue to relnotes

    Change-Id: I3d7a8908a13b4e61d12b828d89045be7e5b229ff
    Related-Bug:#1388771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/246291

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/246291
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a9e444e733f5c6a7a98decddf3349e09d218270b
Submitter: Jenkins
Branch: stable/6.1

commit a9e444e733f5c6a7a98decddf3349e09d218270b
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Aug 12 16:39:33 2015 +0200

    Fix OCF stop errors for mysql

    W/o this fix, the pacemaker mysql resource
    would became unmanaged when failed to kill
    mysql on OCF stop action. For mysql OCF,
    this possible only when the kill command failed:
    ESRCH
       No process or process group can be found
    corresponding to that specified by pid.

    This is an issue as pacemaker resources must
    never turn unmanaged for such cases (wrong pid)
    and the stop action should not return error.
    Instead we consider this case as "nothing to do
    here, return OK" in the hope of the nearest start
    action to fix this pid issue later.

    The solution is:
    - Do not exit with error from the stop() when
    failed to terminate with SIGTERM. This impacts
    nothing as there is the code below which
    shutdowns the whole mysql process group anyway.
    - Add a warning message instead
    - Update the log message to provide all PIDs under
    the process group being terminated

    Closes-bug: #1388771

    cherry-pick from 533372f2098e29b8d3bbded5308379331de2c803
    Change-Id: I2c4ecc10e9129e94e5610d774c92f6eaf5228759
    Signed-off-by: Bogdan Dobrelya <email address hidden>

tags: added: on-verification
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Verified on Fuel 6.1 (Ubuntu/Centos).

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.