Distributed Cloud: ceph-osd crashed and was unrecoverable

Bug #1873988 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
During a week long soak to verify the curator function of stx-monitor, ceph-osd crashed on controller-0 and was unrecoverable.

Severity
--------
Critical - monitor data had to be wiped out, controller-0 reboot, stx-monitor app removed and reapplied.

Steps to Reproduce
------------------
Set up a decent size distributed cloud (50+ subclouds)
Configure 425GB of ceph storage for monitor
Install stx-monitor on all subclouds
Deploy some test pods on all subclouds
Soak for a few days to check for curator behavior when ceph storage usage reaches its configured threshold

Expected Behavior
------------------
Curator is activated to remove older data to make room for new data when storage usage reaches configured threshold.

Actual Behavior
----------------
ceph-osd on controller-0 crashed and could not recover which led to controller failover.

+-------+-----------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+-----------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| 400. | Service group storage-services warning; ceph-osd(enabled-active, failed) | service_domain=controller. | minor | 2020-04-13T21: |
| 001 | | service_group=storage-services.host= | | 54:39.744935 |
| | | controller-0 | | |
| | | | | |
| 800. | Loss of replication in replication group group-0: OSDs are down | cluster=dfbbcd79-8114-417e- | major | 2020-04-13T21: |
| 011 | | ad30-2777184f2f8a.peergroup=group-0. | | 53:47.163216 |
| | | host=controller-0 | | |
| | | | | |
| 800. | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or | cluster=dfbbcd79-8114-417e- | critical | 2020-04-13T21: |
| 001 | undersized;Possible data damage: 4 pgs recovery_unfound]. Please check 'ceph -s' | ad30-2777184f2f8a | | 53:46.880226 |
| | for more details. | | | |
| | | | | |
| 100. | NTP address 64:ff9b::d173:b56b is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::d173: | minor | 2020-04-13T21: |
| 114 | | b56b | | 51:16.253590 |
| | | | | |
| 400. | Service group controller-services degraded; drbd-cephmon(enabled-standby, | service_domain=controller. | major | 2020-04-13T15: |
| 001 | degraded, data-standalone), drbd-dockerdistribution(enabled-standby, degraded, | service_group=controller-services. | | 59:59.082588 |
| | data-standalone), drbd-etcd(enabled-standby, degraded, data-standalone), ... | host=controller-0 | | |

On the day this issue occurred, there was a power issue in the server room. However, there was no evidence that controller-0 was restarted due to power loss.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
Seen once

This issue is not readily reproducible. Extended soak with large number of subclouds or repeated DOR tests might induce the issue.

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
Feb 22 master load

Last Pass
---------
N/A

Timestamp/Logs
--------------
See ceph logs attached.

Test Activity
-------------
Evaluation

 Workaround
 ----------
 Wipe ceph data, remove stx-monitor, reboot controller, reapply stx-monitor

Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue seen once, but should be investigated given ceph was not recoverable

tags: added: stx.4.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

This is most likely of duplicate of https://bugs.launchpad.net/starlingx/+bug/1856064
Although this bug has been raised on 21-04-2020 I can see
ceph-init logs from "Mon Feb 24 20:33:33 UTC 2020" I will assume the system
wasn't reinstalled after 2020-03-31 when the fix was merged into master branch

I propose to mark this as duplicate because that's all I can asses using just the ceph-logs
provided and investigate further if it reappears.

Changed in starlingx:
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.