StarlingX

Distributed Cloud: ceph-osd crashed and was unrecoverable

Bug #1873988 reported by Tee Ngo on 2020-04-21

This bug report is a duplicate of: Bug #1856064: Active controller became degraded after lock/unlock compute node. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	In Progress	Medium	Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
During a week long soak to verify the curator function of stx-monitor, ceph-osd crashed on controller-0 and was unrecoverable.

Severity
--------
Critical - monitor data had to be wiped out, controller-0 reboot, stx-monitor app removed and reapplied.

Steps to Reproduce
------------------
Set up a decent size distributed cloud (50+ subclouds)
Configure 425GB of ceph storage for monitor
Install stx-monitor on all subclouds
Deploy some test pods on all subclouds
Soak for a few days to check for curator behavior when ceph storage usage reaches its configured threshold

Expected Behavior
------------------
Curator is activated to remove older data to make room for new data when storage usage reaches configured threshold.

Actual Behavior
----------------
ceph-osd on controller-0 crashed and could not recover which led to controller failover.

+-------+-----------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+-----------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| 400. | Service group storage-services warning; ceph-osd(enabled-active, failed) | service_domain=controller. | minor | 2020-04-13T21: |
| 001 | | service_group=storage-services.host= | | 54:39.744935 |
| | | controller-0 | | |
| | | | | |
| 800. | Loss of replication in replication group group-0: OSDs are down | cluster=dfbbcd79-8114-417e- | major | 2020-04-13T21: |
| 011 | | ad30-2777184f2f8a.peergroup=group-0. | | 53:47.163216 |
| | | host=controller-0 | | |
| | | | | |
| 800. | Storage Alarm Condition: HEALTH_ERR [PGs are degraded/stuck or | cluster=dfbbcd79-8114-417e- | critical | 2020-04-13T21: |
| 001 | undersized;Possible data damage: 4 pgs recovery_unfound]. Please check 'ceph -s' | ad30-2777184f2f8a | | 53:46.880226 |
| | for more details. | | | |
| | | | | |
| 100. | NTP address 64:ff9b::d173:b56b is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::d173: | minor | 2020-04-13T21: |
| 114 | | b56b | | 51:16.253590 |
| | | | | |
| 400. | Service group controller-services degraded; drbd-cephmon(enabled-standby, | service_domain=controller. | major | 2020-04-13T15: |
| 001 | degraded, data-standalone), drbd-dockerdistribution(enabled-standby, degraded, | service_group=controller-services. | | 59:59.082588 |
| | data-standalone), drbd-etcd(enabled-standby, degraded, data-standalone), ... | host=controller-0 | | |

On the day this issue occurred, there was a power issue in the server room. However, there was no evidence that controller-0 was restarted due to power loss.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
Seen once

This issue is not readily reproducible. Extended soak with large number of subclouds or repeated DOR tests might induce the issue.

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
Feb 22 master load

Last Pass
---------
N/A

Timestamp/Logs
--------------
See ceph logs attached.

Test Activity
-------------
Evaluation

Workaround
----------
Wipe ceph data, remove stx-monitor, reboot controller, reapply stx-monitor

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-04-21:

ceph_log.tgz Edit (71.1 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-22:

stx.4.0 / medium priority - issue seen once, but should be investigated given ceph was not recoverable

tags:	added: stx.4.0 stx.storage
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged

Frank Miller (sensfan22) on 2020-05-20

Changed in starlingx:
assignee:	nobody → Paul-Ionut Vaduva (pvaduva)

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2020-07-06:

This is most likely of duplicate of https://bugs.launchpad.net/starlingx/+bug/1856064
Although this bug has been raised on 21-04-2020 I can see
ceph-init logs from "Mon Feb 24 20:33:33 UTC 2020" I will assume the system
wasn't reinstalled after 2020-03-31 when the fix was merged into master branch

I propose to mark this as duplicate because that's all I can asses using just the ceph-logs
provided and investigate further if it reappears.