StarlingX

Switch reboot collapses CEPH storage when with workload applications installed

Bug #2004183 reported by Hediberto Cavalcante da Silva on 2023-01-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Hediberto Cavalcante da Silva

Bug Description

Brief Description
-----------------
This issue is majorly observed on both (Duplex) and (Duplex + 2 workers) where in if the switch connecting the oam0 and mgmt0 networks is rebooted both the controllers reboot the CEPH is not able to recover automatically.

Resulting in the Radio Site offline.

Severity
--------
Critical

Steps to Reproduce
------------------
- Both the controllers online and configured.
- Application Workload installed an online and configured.
- Switch is rebooted.
- The system comes online active and standby.
- But the CEPH does NOT recover.

Expected Behavior
------------------
The system should recover that KUBERNETES CLUSTER and the CEPH CLUSTER

Actual Behavior
----------------
The KUBERNETES CLUSTER looks OK but CEPH CLUSTER does not recover.

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
Multi-Node deployment with CEPH storage with replication factor=2

Timestamp/Logs
--------------
+----------+--------------------------------------------------------------------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------------+
| 400.001 | Service group storage-services warning; ceph-osd(enabling, failed) | service_domain=controller. | minor | 2022-09-26T11:02: |
| | | service_group=storage-services.host= | | 52.003632 |
| | | controller-1 | | |
| | | | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=334fd130-954d-4361-b0ea- | major | 2022-09-26T10:59: |
| | | e65e33193e6e.peergroup=group-0.host= | | 10.452143 |
| | | controller-1 | | |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or undersized;Possible data damage: 46 | cluster=334fd130-954d-4361-b0ea- | critical | 2022-09-26T10:59: |
| | pgs recovery_unfound]. Please check 'ceph -s' for more details. | e65e33193e6e | | 10.267334 |
| | | | | |
+----------+--------------------------------------------------------------------------------------+
cluster:
    id: 334fd130-954d-4361-b0ea-e65e33193e6e
    health: HEALTH_ERR
            1 filesystem is degraded
            1 osds down
            1 host (1 osds) down
            1044/98787 objects unfound (1.057%)
            Reduced data availability: 4 pgs inactive, 4 pgs down
            Possible data damage: 46 pgs recovery_unfound
            Degraded data redundancy: 99817/197574 objects degraded (50.521%), 178 pgs degraded, 188 pgs undersized services:
    mon: 1 daemons, quorum controller (age 13m)
    mgr: controller-1(active, since 13m), standbys: controller-0
    mds: kube-cephfs:1/1 {0=controller-1=up:replay} 1 up:standby
    osd: 2 osds: 1 up (since 4m), 2 in (since 4w) data:
    pools: 3 pools, 192 pgs
    objects: 98.79k objects, 16 GiB
    usage: 33 GiB used, 1.7 TiB / 1.7 TiB avail
    pgs: 2.083% pgs not active
             99817/197574 objects degraded (50.521%)
             1044/98787 objects unfound (1.057%)
             132 active+undersized+degraded
             46 active+recovery_unfound+undersized+degraded
             10 active+undersized
             4 down

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.74377 root storage-tier
-2 1.74377 chassis group-0
-4 0.87189 host controller-0
0 ssd 0.87189 osd.0 up 1.00000 1.00000
-3 0.87189 host controller-1
1 ssd 0.87189 osd.1 down 1.00000 1.00000

Workaround
----------
No Workarounds as such.

Tags:

Hediberto Cavalcante da Silva (hcavalca) on 2023-01-30

Changed in starlingx:
assignee:	nobody → Hediberto Cavalcante da Silva (hcavalca)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-30: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/872196

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-30: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/872202

Ghada Khalil (gkhalil) on 2023-02-18

tags:

added: stx.storage

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-09: Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/872202
Committed: https://opendev.org/starlingx/integ/commit/b629db6b9fc58a96171f2142b347463cc9fd288f
Submitter: "Zuul (22348)"
Branch: master

commit b629db6b9fc58a96171f2142b347463cc9fd288f
Author: Hediberto Cavalcante da Silva <email address hidden>
Date: Mon Jan 30 13:09:42 2023 -0500

AIO-DX Ceph Optimizations

    This change is part of the solution to resolve the scenario where
    Ceph MON starts without having data in sync when there is no
    communication with the peer, leading to PG issues.

Improvements:

    Removed starting Ceph MON and MDS from ceph.sh script called by
    mtcClient for AIO-DX:
    - Ceph MDS was not being managed, only started by ceph.sh
      script called from mtcClient. Now it will be managed by PMON.
    - Ceph MON will continue to be managed by SM.

    Ceph-init-wrapper script will verify some conditions to start
    Ceph MON safely:
    - First, check if drbd-cephmon role is Primary.
    - Then, check if drbd-cephmon partition is mounted correctly.
    - Check flags (inside drbd-cephmon path) for last active Ceph MON
    process (Controller-0 or Controler-1). This flag will be created
    by the last Ceph MON successful start.
    - If the last active monitor is the other one, check if
    drbd-cephmon is UpToDate/UpToDate, meaning that data is synchronized
    between controllers.

    We also made some improvements to /etc/init.d/ceph script to be able
    to stop Ceph OSD even if Ceph MON was not available. Stopping OSD
    without a Ceph Monitor was hanging when the command to flush the
    journal would wait forever to communicate to any available Ceph Monitor.

    Test Plan:
        PASS: system host-swact.
        PASS: Ceph recovery after mgmt network outage for few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after rebooting active controller.
        PASS: Ceph recovery after case of dead office recovery (DOR).
        PASS: Running shellcheck on ceph-base.ceph.init, ceph.sh,
              and ceph-init-wrapper.sh files without any complaints
              about the lines related to the changes.

Closes-bug: 2004183

Signed-off-by: Hediberto Cavalcante da Silva <email address hidden>
Change-Id: Id09432aecef68b39adabf633c74545f2efa02e99

Reviewed:  https://review.opendev.org/c/starlingx/integ/+/872202
Committed: https://opendev.org/starlingx/integ/commit/b629db6b9fc58a96171f2142b347463cc9fd288f
Submitter: "Zuul (22348)"
Branch:    master

commit b629db6b9fc58a96171f2142b347463cc9fd288f
Author: Hediberto Cavalcante da Silva <hediberto.cavalcantedasilva@windriver.com>
Date:   Mon Jan 30 13:09:42 2023 -0500

AIO-DX Ceph Optimizations
    
    This change is part of the solution to resolve the scenario where
    Ceph MON starts without having data in sync when there is no
    communication with the peer, leading to PG issues.
    
    Improvements:
    
    Removed starting Ceph MON and MDS from ceph.sh script called by
    mtcClient for AIO-DX:
    - Ceph MDS was not being managed, only started by ceph.sh
      script called from mtcClient. Now it will be managed by PMON.
    - Ceph MON will continue to be managed by SM.
    
    Ceph-init-wrapper script will verify some conditions to start
    Ceph MON safely:
    - First, check if drbd-cephmon role is Primary.
    - Then, check if drbd-cephmon partition is mounted correctly.
    - Check flags (inside drbd-cephmon path) for last active Ceph MON
    process (Controller-0 or Controler-1). This flag will be created
    by the last Ceph MON successful start.
    - If the last active monitor is the other one, check if
    drbd-cephmon is UpToDate/UpToDate, meaning that data is synchronized
    between controllers.
    
    We also made some improvements to /etc/init.d/ceph script to be able
    to stop Ceph OSD even if Ceph MON was not available. Stopping OSD
    without a Ceph Monitor was hanging when the command to flush the
    journal would wait forever to communicate to any available Ceph Monitor.
    
    Test Plan:
        PASS: system host-swact.
        PASS: Ceph recovery after mgmt network outage for few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after rebooting active controller.
        PASS: Ceph recovery after case of dead office recovery (DOR).
        PASS: Running shellcheck on ceph-base.ceph.init, ceph.sh,
              and ceph-init-wrapper.sh files without any complaints
              about the lines related to the changes.
    
    Closes-bug: 2004183
    
    Signed-off-by: Hediberto Cavalcante da Silva <hediberto.cavalcantedasilva@windriver.com>
    Change-Id: Id09432aecef68b39adabf633c74545f2efa02e99

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-09: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/872196
Committed: https://opendev.org/starlingx/stx-puppet/commit/1eb3bec38d799f9eb5a57a74c9efe904a5ad3a26
Submitter: "Zuul (22348)"
Branch: master

commit 1eb3bec38d799f9eb5a57a74c9efe904a5ad3a26
Author: Hediberto Cavalcante da Silva <email address hidden>
Date: Mon Jan 30 12:11:28 2023 -0500

AIO-DX Ceph Monitoring

    This change is part of the solution to resolve the scenario where
    Ceph MON starts without having data in sync when there is no
    communication with the peer, leading to PG issues.

Ceph MDS was not being managed, only started by ceph.sh script
called from mtcClient. Now it will be managed by pmon.

    Test Plan:
        PASS: After applying the puppet manifest, the symlink
              /etc/pmon.d/ceph-mds.conf -> /etc/ceph/ceph-mds.conf.pmon
              exists.
        PASS: system host-swact.
        PASS: Ceph recovery after mgmt network outage for few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after rebooting active controller.
        PASS: Ceph recovery after case of dead office recovery (DOR).

Partial-Bug: 2004183

Depends-On: https://review.opendev.org/c/starlingx/integ/+/872202

Signed-off-by: Hediberto Cavalcante da Silva <email address hidden>
Change-Id: Id5b9f456a60235fe6446f58de9d513f0a9179c9b

Ghada Khalil (gkhalil) on 2023-03-29