StarlingX

ceph-osds are down after swact operation generating a coredump

Bug #1890293 reported by ayyappa on 2020-08-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Medium	Andrei Grosu

Bug Description

Brief Description
-----------------
ceph-osds are down after swact operation raising the alarm "Loss of replication in replication group group-0: OSDs are down " and generating a coredump

Severity
--------
Minor

Steps to Reproduce
------------------
This issue has been noticed in cert-manager test automation after the swact operation

1)After a swact operation, the following alarms are generated on the system

+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092.peergroup=group-0.host=controller-1 | major | 2020-08-01T13:15:30.993123 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092 | warning | 2020-08-01T13:15:30.706021 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+

'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'

+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| fb516ca6-5001-43bc-a337-d8e6377dd5f5 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092 | warning | 2020-08-01T13:16:31.602822 |
| 2759ec2b-8f8e-4e94-8d92-225eebfb9ce7 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092.peergroup=group-0.host=controller-1 | major | 2020-08-01T13:15:30.993123 |

Expected Behavior
------------------
swact should be successful without any errors

Actual Behavior
----------------
ceph osds down after swact

Reproducibility
---------------
50%

System Configuration
--------------------
WP_13_14 ipv6

Branch/Pull Time/Commit
-----------------------
2020-07-31

Last Pass
---------
2020-07-25_00-00-00

Timestamp/Logs
--------------
2020-08-01T13:15:30.993123

Test Activity
-------------
Automation

Workaround
----------
Haven't found any

See original description

Tags:

Revision history for this message

ayyappa (mantri425) wrote on 2020-08-04:

ALL_NODES_20200801.153948.tar Edit (47.8 MiB, application/x-tar)

Revision history for this message

ayyappa (mantri425) wrote on 2020-08-04:

core dumps Edit (1.8 MiB, application/octet-stream)

Ghada Khalil (gkhalil) on 2020-08-17

tags:

added: stx.storage

ayyappa (mantri425) on 2020-08-17

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-08-31:

stx.5.0 / medium priority - ceph appears to be down after a swact. The current assumption is that this is fairly intermittent since we don't see this reported in other sanity/regression runs.

tags:	added: stx.5.0
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)

Bob Church (rchurch) on 2020-12-03

Changed in starlingx:
assignee:	Ovidiu Poncea (ovidiu.poncea) → Andrei Grosu (agrosu1)

Revision history for this message

Frank Miller (sensfan22) wrote on 2021-04-16:

This issue is not seen recently and was only reported the one time. If the frequency of this issue increases then please open a new LP with a recent load.

Changed in starlingx:
status:	Triaged → Won't Fix

Ghada Khalil (gkhalil) on 2021-10-27

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.