ceph-osds are down after swact operation generating a coredump

Bug #1890293 reported by ayyappa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Andrei Grosu

Bug Description

Brief Description
-----------------
ceph-osds are down after swact operation raising the alarm "Loss of replication in replication group group-0: OSDs are down " and generating a coredump

Severity
--------
Minor

Steps to Reproduce
------------------
This issue has been noticed in cert-manager test automation after the swact operation

1)After a swact operation, the following alarms are generated on the system

+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092.peergroup=group-0.host=controller-1 | major | 2020-08-01T13:15:30.993123 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092 | warning | 2020-08-01T13:15:30.706021 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+

'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'

+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+----------+----------------------------+
| fb516ca6-5001-43bc-a337-d8e6377dd5f5 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092 | warning | 2020-08-01T13:16:31.602822 |
| 2759ec2b-8f8e-4e94-8d92-225eebfb9ce7 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=60b1de06-21f9-4ed0-9fd8-769b34ff8092.peergroup=group-0.host=controller-1 | major | 2020-08-01T13:15:30.993123 |

Expected Behavior
------------------
swact should be successful without any errors

Actual Behavior
----------------
ceph osds down after swact

Reproducibility
---------------
50%

System Configuration
--------------------
WP_13_14 ipv6

Branch/Pull Time/Commit
-----------------------
2020-07-31

Last Pass
---------
2020-07-25_00-00-00

Timestamp/Logs
--------------
2020-08-01T13:15:30.993123

Test Activity
-------------
Automation

Workaround
----------
Haven't found any

Revision history for this message
ayyappa (mantri425) wrote :
Revision history for this message
ayyappa (mantri425) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.storage
ayyappa (mantri425)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - ceph appears to be down after a swact. The current assumption is that this is fairly intermittent since we don't see this reported in other sanity/regression runs.

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Bob Church (rchurch)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Andrei Grosu (agrosu1)
Revision history for this message
Frank Miller (sensfan22) wrote :

This issue is not seen recently and was only reported the one time. If the frequency of this issue increases then please open a new LP with a recent load.

Changed in starlingx:
status: Triaged → Won't Fix
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.