Switch reboot collapses CEPH storage when with workload applications installed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Hediberto Cavalcante da Silva |
Bug Description
Brief Description
-----------------
This issue is majorly observed on both (Duplex) and (Duplex + 2 workers) where in if the switch connecting the oam0 and mgmt0 networks is rebooted both the controllers reboot the CEPH is not able to recover automatically.
Resulting in the Radio Site offline.
Severity
--------
Critical
Steps to Reproduce
------------------
- Both the controllers online and configured.
- Application Workload installed an online and configured.
- Switch is rebooted.
- The system comes online active and standby.
- But the CEPH does NOT recover.
Expected Behavior
------------------
The system should recover that KUBERNETES CLUSTER and the CEPH CLUSTER
Actual Behavior
----------------
The KUBERNETES CLUSTER looks OK but CEPH CLUSTER does not recover.
Reproducibility
---------------
100% reproducible.
System Configuration
-------
Multi-Node deployment with CEPH storage with replication factor=2
Timestamp/Logs
--------------
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 400.001 | Service group storage-services warning; ceph-osd(enabling, failed) | service_
| | | service_
| | | controller-1 | | |
| | | | | |
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=
| | | e65e33193e6e.
| | | controller-1 | | |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or undersized;Possible data damage: 46 | cluster=
| | pgs recovery_unfound]. Please check 'ceph -s' for more details. | e65e33193e6e | | 10.267334 |
| | | | | |
+------
cluster:
id: 334fd130-
health: HEALTH_ERR
1 filesystem is degraded
1 osds down
1 host (1 osds) down
Reduced data availability: 4 pgs inactive, 4 pgs down
mon: 1 daemons, quorum controller (age 13m)
mgr: controller-
mds: kube-cephfs:1/1 {0=controller-
osd: 2 osds: 1 up (since 4m), 2 in (since 4w) data:
pools: 3 pools, 192 pgs
objects: 98.79k objects, 16 GiB
usage: 33 GiB used, 1.7 TiB / 1.7 TiB avail
pgs: 2.083% pgs not active
132 active+
46 active+
10 active+undersized
4 down
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.74377 root storage-tier
-2 1.74377 chassis group-0
-4 0.87189 host controller-0
0 ssd 0.87189 osd.0 up 1.00000 1.00000
-3 0.87189 host controller-1
1 ssd 0.87189 osd.1 down 1.00000 1.00000
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+-
Workaround
----------
No Workarounds as such.
Changed in starlingx: | |
assignee: | nobody → Hediberto Cavalcante da Silva (hcavalca) |
tags: | added: stx.storage |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 |
Fix proposed to branch: master /review. opendev. org/c/starlingx /stx-puppet/ +/872196
Review: https:/