Switch reboot collapses CEPH storage when with workload applications installed

Bug #2004183 reported by Hediberto Cavalcante da Silva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Hediberto Cavalcante da Silva

Bug Description

Brief Description
-----------------
This issue is majorly observed on both (Duplex) and (Duplex + 2 workers) where in if the switch connecting the oam0 and mgmt0 networks is rebooted both the controllers reboot the CEPH is not able to recover automatically.

Resulting in the Radio Site offline.

Severity
--------
Critical

Steps to Reproduce
------------------
- Both the controllers online and configured.
- Application Workload installed an online and configured.
- Switch is rebooted.
- The system comes online active and standby.
- But the CEPH does NOT recover.

Expected Behavior
------------------
The system should recover that KUBERNETES CLUSTER and the CEPH CLUSTER

Actual Behavior
----------------
The KUBERNETES CLUSTER looks OK but CEPH CLUSTER does not recover.

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
Multi-Node deployment with CEPH storage with replication factor=2

Timestamp/Logs
--------------
+----------+--------------------------------------------------------------------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------------+
| 400.001 | Service group storage-services warning; ceph-osd(enabling, failed) | service_domain=controller. | minor | 2022-09-26T11:02: |
| | | service_group=storage-services.host= | | 52.003632 |
| | | controller-1 | | |
| | | | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=334fd130-954d-4361-b0ea- | major | 2022-09-26T10:59: |
| | | e65e33193e6e.peergroup=group-0.host= | | 10.452143 |
| | | controller-1 | | |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or undersized;Possible data damage: 46 | cluster=334fd130-954d-4361-b0ea- | critical | 2022-09-26T10:59: |
| | pgs recovery_unfound]. Please check 'ceph -s' for more details. | e65e33193e6e | | 10.267334 |
| | | | | |
+----------+--------------------------------------------------------------------------------------+
cluster:
    id: 334fd130-954d-4361-b0ea-e65e33193e6e
    health: HEALTH_ERR
            1 filesystem is degraded
            1 osds down
            1 host (1 osds) down
            1044/98787 objects unfound (1.057%)
            Reduced data availability: 4 pgs inactive, 4 pgs down
            Possible data damage: 46 pgs recovery_unfound
            Degraded data redundancy: 99817/197574 objects degraded (50.521%), 178 pgs degraded, 188 pgs undersized services:
    mon: 1 daemons, quorum controller (age 13m)
    mgr: controller-1(active, since 13m), standbys: controller-0
    mds: kube-cephfs:1/1 {0=controller-1=up:replay} 1 up:standby
    osd: 2 osds: 1 up (since 4m), 2 in (since 4w) data:
    pools: 3 pools, 192 pgs
    objects: 98.79k objects, 16 GiB
    usage: 33 GiB used, 1.7 TiB / 1.7 TiB avail
    pgs: 2.083% pgs not active
             99817/197574 objects degraded (50.521%)
             1044/98787 objects unfound (1.057%)
             132 active+undersized+degraded
             46 active+recovery_unfound+undersized+degraded
             10 active+undersized
             4 down

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.74377 root storage-tier
-2 1.74377 chassis group-0
-4 0.87189 host controller-0
 0 ssd 0.87189 osd.0 up 1.00000 1.00000
-3 0.87189 host controller-1
 1 ssd 0.87189 osd.1 down 1.00000 1.00000

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Workaround
----------
No Workarounds as such.

Changed in starlingx:
assignee: nobody → Hediberto Cavalcante da Silva (hcavalca)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/872196

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/872202

Ghada Khalil (gkhalil)
tags: added: stx.storage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/872202
Committed: https://opendev.org/starlingx/integ/commit/b629db6b9fc58a96171f2142b347463cc9fd288f
Submitter: "Zuul (22348)"
Branch: master

commit b629db6b9fc58a96171f2142b347463cc9fd288f
Author: Hediberto Cavalcante da Silva <email address hidden>
Date: Mon Jan 30 13:09:42 2023 -0500

    AIO-DX Ceph Optimizations

    This change is part of the solution to resolve the scenario where
    Ceph MON starts without having data in sync when there is no
    communication with the peer, leading to PG issues.

    Improvements:

    Removed starting Ceph MON and MDS from ceph.sh script called by
    mtcClient for AIO-DX:
    - Ceph MDS was not being managed, only started by ceph.sh
      script called from mtcClient. Now it will be managed by PMON.
    - Ceph MON will continue to be managed by SM.

    Ceph-init-wrapper script will verify some conditions to start
    Ceph MON safely:
    - First, check if drbd-cephmon role is Primary.
    - Then, check if drbd-cephmon partition is mounted correctly.
    - Check flags (inside drbd-cephmon path) for last active Ceph MON
    process (Controller-0 or Controler-1). This flag will be created
    by the last Ceph MON successful start.
    - If the last active monitor is the other one, check if
    drbd-cephmon is UpToDate/UpToDate, meaning that data is synchronized
    between controllers.

    We also made some improvements to /etc/init.d/ceph script to be able
    to stop Ceph OSD even if Ceph MON was not available. Stopping OSD
    without a Ceph Monitor was hanging when the command to flush the
    journal would wait forever to communicate to any available Ceph Monitor.

    Test Plan:
        PASS: system host-swact.
        PASS: Ceph recovery after mgmt network outage for few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after rebooting active controller.
        PASS: Ceph recovery after case of dead office recovery (DOR).
        PASS: Running shellcheck on ceph-base.ceph.init, ceph.sh,
              and ceph-init-wrapper.sh files without any complaints
              about the lines related to the changes.

    Closes-bug: 2004183

    Signed-off-by: Hediberto Cavalcante da Silva <email address hidden>
    Change-Id: Id09432aecef68b39adabf633c74545f2efa02e99

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/872196
Committed: https://opendev.org/starlingx/stx-puppet/commit/1eb3bec38d799f9eb5a57a74c9efe904a5ad3a26
Submitter: "Zuul (22348)"
Branch: master

commit 1eb3bec38d799f9eb5a57a74c9efe904a5ad3a26
Author: Hediberto Cavalcante da Silva <email address hidden>
Date: Mon Jan 30 12:11:28 2023 -0500

    AIO-DX Ceph Monitoring

    This change is part of the solution to resolve the scenario where
    Ceph MON starts without having data in sync when there is no
    communication with the peer, leading to PG issues.

    Ceph MDS was not being managed, only started by ceph.sh script
    called from mtcClient. Now it will be managed by pmon.

    Test Plan:
        PASS: After applying the puppet manifest, the symlink
              /etc/pmon.d/ceph-mds.conf -> /etc/ceph/ceph-mds.conf.pmon
              exists.
        PASS: system host-swact.
        PASS: Ceph recovery after mgmt network outage for few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after rebooting active controller.
        PASS: Ceph recovery after case of dead office recovery (DOR).

    Partial-Bug: 2004183

    Depends-On: https://review.opendev.org/c/starlingx/integ/+/872202

    Signed-off-by: Hediberto Cavalcante da Silva <email address hidden>
    Change-Id: Id5b9f456a60235fe6446f58de9d513f0a9179c9b

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.