StarlingX

mariadb fails when standby controller rebooted in AIO-DX

Bug #1837724 reported by Bart Wensley on 2019-07-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bin Qian

Bug Description

Brief Description
-----------------
After force rebooting the standby controller in an AIO-DX (two node) system, the mariadb was unavailable until the standy controller came back into service (which can take about 10 minutes).

Severity
--------
Critical: If the standby controller had been powered down instead of rebooted, this would result in a complete outage of all openstack services.

Steps to Reproduce
------------------
Install an AIO-DX system and the stx-openstack application. Force reboot the standby controller.

Expected Behavior
------------------
When the standby controller is rebooted, the mariadb pod on the active controller should continue to function (after a brief disruption when the peer is lost).

Actual Behavior
----------------
The mariadb pod on the active controller (controller-0) did not become primary so the database was unaccessible. The mariadb pod on controller-0 shows 0/1 containers ready:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mariadb-ingress-6ff964556d-74kv5 0/1 Running 0 153m 172.16.192.127 controller-0 <none> <none>
mariadb-ingress-6ff964556d-lghk7 0/1 Pending 0 102s <none> <none> <none> <none>
mariadb-ingress-6ff964556d-vdc29 1/1 Terminating 0 35m 172.16.167.31 controller-1 <none> <none>
mariadb-ingress-error-pages-764cfd869b-kpwqj 1/1 Running 0 137m 172.16.192.119 controller-0 <none> <none>
mariadb-server-0 1/1 Terminating 0 29m 172.16.167.35 controller-1 <none> <none>
mariadb-server-1 0/1 Running 0 10m 172.16.192.112 controller-0 <none> <none>

The /var/log/daemon-ocf.log file on the active controller seems to indicate that the OCF script for the dbmon resource is not being run. This points to a bug in SM. Bin Qian investigated and said that this is likely an issue in SM with recovery from failure.

Reproducibility
---------------
Intermittent: I have seen this several times now.

System Configuration
--------------------
AIO-DX (two node) system

Branch/Pull Time/Commit
-----------------------
Designer load:
BUILD_DATE="2019-07-22 11:21:07 -0500"

Last Pass
---------
Unsure

Timestamp/Logs
--------------
The collect logs for controller-0 and controller-1 will be attached. Some times:

2019-07-23T17:08:11 - force reboot of controller-1 - mariadb did not recover until controller-1 recovered

2019-07-23T18:53:30 - deleted mariadb pod on controller-0 (with kubectl) and pod recovered

2019-07-23T19:01:26 - force reboot of controller-1 - mariadb did not recover until controller-1 recovered

Test Activity
-------------
Developer testing

Tags:

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-24:

controller-0_20190724.115731.tgz Edit (61.7 MiB, application/x-tar)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-24:

controller-1_20190724.115731.tgz Edit (47.0 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-24:

Marking as stx.2.0 / high priority given the system impact

tags:	added: stx.2.0 stx.containers
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Bin Qian (bqian20)
tags:	added: stx.ha removed: stx.containers

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675936

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-22: Fix merged to ha (master)

Reviewed: https://review.opendev.org/675936
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=66e040421732423562cc58c840dcadf13efb5e13
Submitter: Zuul
Branch: master

commit 66e040421732423562cc58c840dcadf13efb5e13
Author: Bin Qian <email address hidden>
Date: Wed Aug 7 14:00:11 2019 -0400

Enhance timer system to avoid double deregister

    The bug reported was because the dbmon service audit timer was
    overwritten accidentally, therefore no audit was performed so the
    dbmon service was not actually being audit.

    Major change is to enhance timer system to use global unique timer
    id (not reused) to ensure timer is not double deregistered by 2
    different mechanisms (disarm/deregister).
    Change the timer id to 64 bit integer to ensure id never overflow.

Above change eliminates the double deregistering a timer issue which
could accidentally deregister a new timer that reuses the same id.

    Also some cleaning to get rid of cases that could double deregister
    timer (although it is no longer harmful as above mentioned change is
    in place)

    Change-Id: I2603870d2eb2749d78456e406095ae543353963f
    Closes-Bug: 1837724
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-22: Fix proposed to ha (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678029

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-22: Fix merged to ha (r/stx.2.0)

Reviewed: https://review.opendev.org/678029
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=3034fef823387e66a332d63a5b69dcda9d3feb0d
Submitter: Zuul
Branch: r/stx.2.0

commit 3034fef823387e66a332d63a5b69dcda9d3feb0d
Author: Bin Qian <email address hidden>
Date: Wed Aug 7 14:00:11 2019 -0400

Enhance timer system to avoid double deregister

    The bug reported was because the dbmon service audit timer was
    overwritten accidentally, therefore no audit was performed so the
    dbmon service was not actually being audit.

Above change eliminates the double deregistering a timer issue which
could accidentally deregister a new timer that reuses the same id.

    Also some cleaning to get rid of cases that could double deregister
    timer (although it is no longer harmful as above mentioned change is
    in place)

    Change-Id: I2603870d2eb2749d78456e406095ae543353963f
    Closes-Bug: 1837724
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil) on 2019-08-22

tags:

added: in-r-stx20

Revision history for this message

Raviteja naidu Jagalmarri (raviteja0218) wrote on 2019-08-27:

When the standby controller is rebooted, the mariadb pod on the active controller is functioning without any issues. Change tested in duplex and is working as expected.

Build info:

OS="centos"
SW_VERSION="19.08"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.2.0"

JOB="STX_BUILD_2.0"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-26 23:30:00 +0000"

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.