rook-ceph after swact cluster down

Bug #1920882 reported by chen haochuan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
chen haochuan

Bug Description

Brief Description
-----------------
In duplex case, when makes swact and quickly shutdown inactive controller, then there will be two ceph-mon-a pod, one is running, and another is terminating.

Steps to Reproduce
------------------
Write down the steps to reproduce the issue
1, duplex, host swact
2, quickly shutdown the inactive controller
3, ceph -s, find cluster is always down

Expected Behavior
------------------
ceph cluster is not down

Changed in starlingx:
assignee: nobody → chen haochuan (martin1982)
Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: @Austin, what is the system impact of this? Does this need to be addressed/cherrypicked for stx.5.0? If so, please plan to have the fix merged/cherrypicked by April 30.

tags: added: stx.storage
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: I've marked this as medium / stx.5.0 so that the fix is cherrypicked to the release branch if available in time.

tags: added: stx.5.0
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the stx release meeting (2021-05-19), agreed not to hold up the stx.5.0 release for this bug as it is related to a very specific scenario which is unlikely during typical operations.

tags: added: stx.6.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/utilities/+/792213

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (f/centos8)
Download full text (29.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/792213
Committed: https://opendev.org/starlingx/utilities/commit/c4d042615e6fe8944a4628fa1a29e86e012a9bf5
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 557cada006fd5a3bd81ad5af387c37657801f8c5
Author: Fernando Theirs <email address hidden>
Date: Thu May 13 16:21:47 2021 -0300

    Collect is missing etcdctl output

    When the collect tool is run, it does not include the contents
    of the etcd database. Fixes have been made for this to dump the
    contents in "etcd_database.dump" file.

    Verify if etcd access is secured. In that case, certificates
    will be used.

    Closes-Bug: 1911935

    Signed-off-by: Fernando Theirs <email address hidden>
    Change-Id: Idbc60edffa978a7a6bead939a4eb54f4abae29a6

commit 6045b1b8a0d8ed6a94d06cdfc994bf1a5fa9dbb5
Author: Jim Gauld <email address hidden>
Date: Thu May 6 11:58:34 2021 -0400

    Provide utility script is-rootdisk-device.sh

    This provides a utility script to determine which disk contains the root
    filesystem. This can also be used as a helper function for io-scheduler
    udev rules that require specific configuration for root disk.

    Example usage:
    /usr/local/bin/is-rootdisk-device.sh
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sda
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sdb
    (i.e., no output)

    Partial-Bug: 1927515
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ib0d4a161a407b08d294c5ff9aa0b7590961e18c9

commit 88a678f142cfe86c58b6405aae6babbc08de0e8f
Author: Chen, Haochuan Z <email address hidden>
Date: Fri Mar 26 09:09:41 2021 +0800

    Add packages to stx-ceph-manager image

    This update installs ceph-mgr, ceph-mon, ceph-osd packages as part
    of stx-ceph-manager image.

    Partial-Bug: 1920882

    Change-Id: I4afde8b1476e14453fac8561f1edde7360b8ee96
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 09b3542fcc6cc0300a9cae0d302225e6977780f3
Author: Scott Little <email address hidden>
Date: Thu Mar 25 11:49:49 2021 -0400

    Set SW_VERSION 21.05

    Prep for the StarlingX 5.0 release.
    SW_VERSION, also known as PLATFORM_RELEASE, uses YY.MM format.

    Story: 2008055
    Task: 42115
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: If7c91a2b523358269ae4850961cf4189ffcd7a75

commit ae4cefd0e2a0001476782c31e1003810da2b4838
Author: Chris Friesen <email address hidden>
Date: Thu Mar 4 18:04:12 2021 -0500

    add dcmanager-audit-worker to patch restart script

    Need to add the new process to the patch restart script.

    Story: 2007267
    Task: 41999
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: If5faa806bd0d52ddbf1343b064959f4207cf975a

commit 27fce5a52321f3014fa8ae9181d344bc774289da
Author: Enzo Candotti <email address hidden>
Date: Mon Feb 1 12:47:38 2021 -0300

    Add resource CPU and memory info in collect

    This adds commands to collect more data to debug
    resource allocations and...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to rook-ceph (master)

Reviewed: https://review.opendev.org/c/starlingx/rook-ceph/+/783584
Committed: https://opendev.org/starlingx/rook-ceph/commit/5945b9b82a4f372ba9a1bd39fa6371fe0d6a4598
Submitter: "Zuul (22348)"
Branch: master

commit 5945b9b82a4f372ba9a1bd39fa6371fe0d6a4598
Author: Chen, Haochuan Z <email address hidden>
Date: Mon Mar 29 08:16:49 2021 +0800

    Add cronjob to osd audit

    After host-swact, mon and osd deployment on active controller will
    be deleted, then drbd could make /var/lib/ceph/mon folder primary
    and secondary switch on two controllers. Then after swact, a new mon
    and osd deployment will launch by rook-ceph-operator. But during this
    process, one controller suddenly shutdown, there will be only osd
    deployment on power on controller, even the other one controller later
    power on to recovery. So add a cron job to make osd deployment status
    check to ensure cluster health.

    Partial-Bug: 1920882

    Depends-On: I4afde8b1476e14453fac8561f1edde7360b8ee96

    Change-Id: I39cb66daecf4052821ceb28344a90ea70f63a742
    Signed-off-by: Chen, Haochuan Z <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@chen haochuan, Is there anything outstanding for this LP? Or can it be considered fixed based on the code merge above? If this LP is fixed, please mark the status as Fix Released.

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.