mariadb fails when standby controller rebooted in AIO-DX

Bug #1837724 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Brief Description
-----------------
After force rebooting the standby controller in an AIO-DX (two node) system, the mariadb was unavailable until the standy controller came back into service (which can take about 10 minutes).

Severity
--------
Critical: If the standby controller had been powered down instead of rebooted, this would result in a complete outage of all openstack services.

Steps to Reproduce
------------------
Install an AIO-DX system and the stx-openstack application. Force reboot the standby controller.

Expected Behavior
------------------
When the standby controller is rebooted, the mariadb pod on the active controller should continue to function (after a brief disruption when the peer is lost).

Actual Behavior
----------------
The mariadb pod on the active controller (controller-0) did not become primary so the database was unaccessible. The mariadb pod on controller-0 shows 0/1 containers ready:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mariadb-ingress-6ff964556d-74kv5 0/1 Running 0 153m 172.16.192.127 controller-0 <none> <none>
mariadb-ingress-6ff964556d-lghk7 0/1 Pending 0 102s <none> <none> <none> <none>
mariadb-ingress-6ff964556d-vdc29 1/1 Terminating 0 35m 172.16.167.31 controller-1 <none> <none>
mariadb-ingress-error-pages-764cfd869b-kpwqj 1/1 Running 0 137m 172.16.192.119 controller-0 <none> <none>
mariadb-server-0 1/1 Terminating 0 29m 172.16.167.35 controller-1 <none> <none>
mariadb-server-1 0/1 Running 0 10m 172.16.192.112 controller-0 <none> <none>

The /var/log/daemon-ocf.log file on the active controller seems to indicate that the OCF script for the dbmon resource is not being run. This points to a bug in SM. Bin Qian investigated and said that this is likely an issue in SM with recovery from failure.

Reproducibility
---------------
Intermittent: I have seen this several times now.

System Configuration
--------------------
AIO-DX (two node) system

Branch/Pull Time/Commit
-----------------------
Designer load:
BUILD_DATE="2019-07-22 11:21:07 -0500"

Last Pass
---------
Unsure

Timestamp/Logs
--------------
The collect logs for controller-0 and controller-1 will be attached. Some times:

2019-07-23T17:08:11 - force reboot of controller-1 - mariadb did not recover until controller-1 recovered

2019-07-23T18:53:30 - deleted mariadb pod on controller-0 (with kubectl) and pod recovered

2019-07-23T19:01:26 - force reboot of controller-1 - mariadb did not recover until controller-1 recovered

Test Activity
-------------
Developer testing

Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 / high priority given the system impact

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bin Qian (bqian20)
tags: added: stx.ha
removed: stx.containers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675936

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/675936
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=66e040421732423562cc58c840dcadf13efb5e13
Submitter: Zuul
Branch: master

commit 66e040421732423562cc58c840dcadf13efb5e13
Author: Bin Qian <email address hidden>
Date: Wed Aug 7 14:00:11 2019 -0400

    Enhance timer system to avoid double deregister

    The bug reported was because the dbmon service audit timer was
    overwritten accidentally, therefore no audit was performed so the
    dbmon service was not actually being audit.

    Major change is to enhance timer system to use global unique timer
    id (not reused) to ensure timer is not double deregistered by 2
    different mechanisms (disarm/deregister).
    Change the timer id to 64 bit integer to ensure id never overflow.

    Above change eliminates the double deregistering a timer issue which
    could accidentally deregister a new timer that reuses the same id.

    Also some cleaning to get rid of cases that could double deregister
    timer (although it is no longer harmful as above mentioned change is
    in place)

    Change-Id: I2603870d2eb2749d78456e406095ae543353963f
    Closes-Bug: 1837724
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (r/stx.2.0)

Reviewed: https://review.opendev.org/678029
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=3034fef823387e66a332d63a5b69dcda9d3feb0d
Submitter: Zuul
Branch: r/stx.2.0

commit 3034fef823387e66a332d63a5b69dcda9d3feb0d
Author: Bin Qian <email address hidden>
Date: Wed Aug 7 14:00:11 2019 -0400

    Enhance timer system to avoid double deregister

    The bug reported was because the dbmon service audit timer was
    overwritten accidentally, therefore no audit was performed so the
    dbmon service was not actually being audit.

    Major change is to enhance timer system to use global unique timer
    id (not reused) to ensure timer is not double deregistered by 2
    different mechanisms (disarm/deregister).
    Change the timer id to 64 bit integer to ensure id never overflow.

    Above change eliminates the double deregistering a timer issue which
    could accidentally deregister a new timer that reuses the same id.

    Also some cleaning to get rid of cases that could double deregister
    timer (although it is no longer harmful as above mentioned change is
    in place)

    Change-Id: I2603870d2eb2749d78456e406095ae543353963f
    Closes-Bug: 1837724
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Raviteja naidu Jagalmarri (raviteja0218) wrote :

When the standby controller is rebooted, the mariadb pod on the active controller is functioning without any issues. Change tested in duplex and is working as expected.

Build info:

OS="centos"
SW_VERSION="19.08"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.2.0"

JOB="STX_BUILD_2.0"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-26 23:30:00 +0000"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.