Ceph storage condition alarm was not cleared after lock and unlock controller and swact

Bug #1892907 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Bob Church

Bug Description

Brief Description
---------------------------------
      During the automation run on while configuring PTP interface over dedicated interface below alarms are displayed. These alarms are generated during the controller-0 lock and unlock then swact controller-0. Later “Ceph Storage Alarm Condition: health warn” was not cleared.

fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:27:18,788] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 700.016 | Multi-Node Recovery Mode | subsystem=vim | major | 2020-08-24T05:26:13.657625 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:16:07.536024 |
| 750.006 | A configuration change requires a reapply of the cert-manager application. | k8s_application=cert-manager | warning | 2020-08-24T05:15:03.793042 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+--------------------------

[2020-08-24 05:35:56,206] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:35:58,214] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:33:27.691966 |

Steps to Reproduce
------------------

1. Initial health condition no alarm.
2. Configure PTP on system
3. Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne ptp-show'
[2020-08-24 05:08:47,810] 436 DEBUG MainThread ssh.expect :: Output:
+--------------+--------------------------------------+
| Property | Value |
+--------------+--------------------------------------+
| uuid | 42dd70b4-facf-4e3b-8c6b-5b58f4289f5e |
| mode | hardware |
| transport | l2 |
| mechanism | p2p |
| isystem_uuid | c88683bf-42fc-4907-8658-50e2084c5f11 |
| created_at | 2020-08-23T19:27:24.954757+00:00 |
| updated_at | 2020-08-24T03:50:41.607423+00:00 |
+--------------+--------------------------------------+
'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne service-parameter-add ptp global delay_mechanism=p2p'
[2020-08-24 05:09:04,802] 436 DEBUG MainThread ssh.expect :: Output:
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| uuid | 25696c3b-4e72-495a-9375-f63ccb736ea0 |
| service | ptp |
| section | global |
| name | delay_mechanism |
| value | p2p |
| personality | None |
| resource | None |
+-------------+--------------------------------------+

system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne service-parameter-add ptp global domainNumber=24'
[2020-08-24 05:09:12,100] 436 DEBUG MainThread ssh.expect :: Output:
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| uuid | bc44edaf-8ca0-4e46-a01c-675ff444c57a |
| service | ptp |
| section | global |
| name | domainNumber |
| value | 24 |
| personality | None |
| resource | None |
+-------------+--------------------------------------+
4. Locking and unlocking controller-0
'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'
     system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne storage-backend-list'
[2020-08-24 05:10:49,749] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
| uuid | name | backend | state | task | services | capabilities |
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
| f30c1cc8-356a-4d9f-81a9-45690c36b177 | ceph-store | ceph | configured | provision-storage | None | min_replication: 1 replication: 2 |
| | | | | | | |
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
5. system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-if-modify controller-0 mgmt0 --ptp-role none'
6. 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-if-modify controller-0 ptp0 --ptp-role slave'
  7. lock compute-1 and unlock
'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

  8. system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2020-08-24 05:25:59,871] 436 DEBUG MainThread ssh.expect :: Output:

System Configuration
--------------------
regular system WCP7-10

Expected Behavior
------------------
alarm should be cleared after lock and unlock .
Actual Behavior
----------------
As description says alarms are not cleared.
Reproducibility
---------------
100% reproducible in WCP_7_10.

Load
----

2020-08-22_20-00-00

Last Pass
---------
 2020-07-31_20-00-00 in WCP-7-10

Timestamp/Logs
--------------
fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:27:18,788] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 700.016 | Multi-Node Recovery Mode | subsystem=vim | major | 2020-08-24T05:26:13.657625 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:16:07.536024 |
| 750.006 | A configuration change requires a reapply of the cert-manager application. | k8s_application=cert-manager | warning | 2020-08-24T05:15:03.793042 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+--------------------------
Test Activity
-------------
Automated regression

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - as per Yang, issue seems reproducible on a system configured w/ PTP

summary: - Ceph starrage condition alarm was not cleared after lock and unlock
+ Ceph storage condition alarm was not cleared after lock and unlock
controller and swact
tags: added: stx.5.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Elena Taivan (etaivan)
Revision history for this message
Yang Liu (yliu12) wrote :

This is reproducible on the system with PTP config (wcp7-10) after lock/unlock a host, such as standby controller.

I tried this manually once and the alarm cleared after 30+ minutes. And clock skew was seen in ceph -s.

Revision history for this message
Difu Hu (difuhu) wrote :

Similar issue reproduced on 2020-08-26_00-00-00

Steps:
lock controller-0
remove sriov vf interface on host controller-0
remove sriov data network
unlock controller-0

+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 5e1a71f2-e4b1-41b6-8641-6223a406ec73 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=4750164e-1cbe-4e56-ba82-e56389f3918b | warning | 2020-08-27T07:07:36.319531 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+

[2020-08-27 07:09:14,677] 314 DEBUG MainThread ssh.send :: Send 'sudo ntpq -pn'
     remote refid st t when poll reach delay offset jitter
==============================================================================
 abcd:204::3
                 .XFAC. 16 u - 64 0 0.000 0.000 0.000
 64:ff9b::c11d:3f96
                 142.66.101.13 2 u 51 64 7 16.788 0.022 0.354
 64:ff9b::2d4c:f4c1
                 216.239.35.8 2 u 49 64 7 63.436 -2.186 1.248
 64:ff9b::4a06:a848
                 208.71.46.33 2 u 46 64 7 64.818 0.403 3.358

[2020-08-27 07:08:53,582] 314 DEBUG MainThread ssh.send :: Send 'ceph -s'
  cluster:
    id: 4750164e-1cbe-4e56-ba82-e56389f3918b
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive

  services:
    mon: 1 daemons, quorum controller-0
    mgr: controller-0(active)
    osd: 1 osds: 1 up, 1 in

  data:
    pools: 1 pools, 64 pgs
    objects: 0 objects, 0 B
    usage: 107 MiB used, 930 GiB / 930 GiB avail
    pgs: 100.000% pgs unknown
             64 unknown

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue reproduced in build : 2020-09-18_20-00-08.

description: updated
description: updated
Bob Church (rchurch)
Changed in starlingx:
assignee: Elena Taivan (etaivan) → Bob Church (rchurch)
Revision history for this message
Frank Miller (sensfan22) wrote :

This issue is not seen recently and was only reported the one time. If the frequency of this issue increases then please open a new LP with a recent load.

Changed in starlingx:
status: Triaged → Won't Fix
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.