Distributed Cloud: ceph-osd crashed and was unrecoverable
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
In Progress
|
Medium
|
Paul-Ionut Vaduva |
Bug Description
Brief Description
-----------------
During a week long soak to verify the curator function of stx-monitor, ceph-osd crashed on controller-0 and was unrecoverable.
Severity
--------
Critical - monitor data had to be wiped out, controller-0 reboot, stx-monitor app removed and reapplied.
Steps to Reproduce
------------------
Set up a decent size distributed cloud (50+ subclouds)
Configure 425GB of ceph storage for monitor
Install stx-monitor on all subclouds
Deploy some test pods on all subclouds
Soak for a few days to check for curator behavior when ceph storage usage reaches its configured threshold
Expected Behavior
------------------
Curator is activated to remove older data to make room for new data when storage usage reaches configured threshold.
Actual Behavior
----------------
ceph-osd on controller-0 crashed and could not recover which led to controller failover.
+------
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+------
| 400. | Service group storage-services warning; ceph-osd(
| 001 | | service_
| | | controller-0 | | |
| | | | | |
| 800. | Loss of replication in replication group group-0: OSDs are down | cluster=
| 011 | | ad30-2777184f2f
| | | host=controller-0 | | |
| | | | | |
| 800. | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or | cluster=
| 001 | undersized;Possible data damage: 4 pgs recovery_unfound]. Please check 'ceph -s' | ad30-2777184f2f8a | | 53:46.880226 |
| | for more details. | | | |
| | | | | |
| 100. | NTP address 64:ff9b::d173:b56b is not a valid or a reachable NTP server. | host=controller
| 114 | | b56b | | 51:16.253590 |
| | | | | |
| 400. | Service group controller-services degraded; drbd-cephmon(
| 001 | degraded, data-standalone), drbd-dockerdist
| | data-standalone), drbd-etcd(
On the day this issue occurred, there was a power issue in the server room. However, there was no evidence that controller-0 was restarted due to power loss.
Reproducibility
---------------
<Reproducible/
Seen once
This issue is not readily reproducible. Extended soak with large number of subclouds or repeated DOR tests might induce the issue.
System Configuration
-------
IPv6 distributed cloud
Branch/Pull Time/Commit
-------
Feb 22 master load
Last Pass
---------
N/A
Timestamp/Logs
--------------
See ceph logs attached.
Test Activity
-------------
Evaluation
Workaround
----------
Wipe ceph data, remove stx-monitor, reboot controller, reapply stx-monitor
Changed in starlingx: | |
assignee: | nobody → Paul-Ionut Vaduva (pvaduva) |
stx.4.0 / medium priority - issue seen once, but should be investigated given ceph was not recoverable