StarlingX

Ceph storage condition alarm was not cleared after lock and unlock controller and swact

Bug #1892907 reported by Anujeyan Manokeran on 2020-08-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Medium	Bob Church

Bug Description

Brief Description
---------------------------------
During the automation run on while configuring PTP interface over dedicated interface below alarms are displayed. These alarms are generated during the controller-0 lock and unlock then swact controller-0. Later “Ceph Storage Alarm Condition: health warn” was not cleared.

fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:27:18,788] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 700.016 | Multi-Node Recovery Mode | subsystem=vim | major | 2020-08-24T05:26:13.657625 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:16:07.536024 |
| 750.006 | A configuration change requires a reapply of the cert-manager application. | k8s_application=cert-manager | warning | 2020-08-24T05:15:03.793042 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+--------------------------

[2020-08-24 05:35:56,206] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:35:58,214] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:33:27.691966 |

Steps to Reproduce
------------------

system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne service-parameter-add ptp global domainNumber=24'
[2020-08-24 05:09:12,100] 436 DEBUG MainThread ssh.expect :: Output:
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| uuid | bc44edaf-8ca0-4e46-a01c-675ff444c57a |
| service | ptp |
| section | global |
| name | domainNumber |
| value | 24 |
| personality | None |
| resource | None |
+-------------+--------------------------------------+
4. Locking and unlocking controller-0
'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'
system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne storage-backend-list'
[2020-08-24 05:10:49,749] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
| uuid | name | backend | state | task | services | capabilities |
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
| f30c1cc8-356a-4d9f-81a9-45690c36b177 | ceph-store | ceph | configured | provision-storage | None | min_replication: 1 replication: 2 |
| | | | | | | |
+--------------------------------------+------------+---------+------------+-------------------+----------+----------------------------------------------------------------------+
5. system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-if-modify controller-0 mgmt0 --ptp-role none'
6. 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-if-modify controller-0 ptp0 --ptp-role slave'
7. lock compute-1 and unlock
'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

8. system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2020-08-24 05:25:59,871] 436 DEBUG MainThread ssh.expect :: Output:

System Configuration
--------------------
regular system WCP7-10

Expected Behavior
------------------
alarm should be cleared after lock and unlock .
Actual Behavior
----------------
As description says alarms are not cleared.
Reproducibility
---------------
100% reproducible in WCP_7_10.

Load
----

2020-08-22_20-00-00

Last Pass
---------
2020-07-31_20-00-00 in WCP-7-10

Timestamp/Logs
--------------
fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2020-08-24 05:27:18,788] 436 DEBUG MainThread ssh.expect :: Output:
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 700.016 | Multi-Node Recovery Mode | subsystem=vim | major | 2020-08-24T05:26:13.657625 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=5df66a40-7bf8-46a7-8e2b-d5e0cda289b2 | warning | 2020-08-24T05:16:07.536024 |
| 750.006 | A configuration change requires a reapply of the cert-manager application. | k8s_application=cert-manager | warning | 2020-08-24T05:15:03.793042 |
+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+--------------------------
Test Activity
-------------
Automated regression

See original description

Tags:

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-08-25:

collect logs Edit (81.4 MiB, application/x-tar)

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-08-26:

stx.5.0 / medium priority - as per Yang, issue seems reproducible on a system configured w/ PTP

summary:	- Ceph starrage condition alarm was not cleared after lock and unlock + Ceph storage condition alarm was not cleared after lock and unlock controller and swact
tags:	added: stx.5.0 stx.storage
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Elena Taivan (etaivan)

Revision history for this message

Yang Liu (yliu12) wrote on 2020-08-26:

This is reproducible on the system with PTP config (wcp7-10) after lock/unlock a host, such as standby controller.

I tried this manually once and the alarm cleared after 30+ minutes. And clock skew was seen in ceph -s.

Revision history for this message

Difu Hu (difuhu) wrote on 2020-08-27:

Similar issue reproduced on 2020-08-26_00-00-00

Steps:
lock controller-0
remove sriov vf interface on host controller-0
remove sriov data network
unlock controller-0

+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 5e1a71f2-e4b1-41b6-8641-6223a406ec73 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=4750164e-1cbe-4e56-ba82-e56389f3918b | warning | 2020-08-27T07:07:36.319531 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+

[2020-08-27 07:09:14,677] 314 DEBUG MainThread ssh.send :: Send 'sudo ntpq -pn'
     remote refid st t when poll reach delay offset jitter
==============================================================================
abcd:204::3
                 .XFAC. 16 u - 64 0 0.000 0.000 0.000
64:ff9b::c11d:3f96
                 142.66.101.13 2 u 51 64 7 16.788 0.022 0.354
64:ff9b::2d4c:f4c1
                 216.239.35.8 2 u 49 64 7 63.436 -2.186 1.248
64:ff9b::4a06:a848
                 208.71.46.33 2 u 46 64 7 64.818 0.403 3.358

  services:
    mon: 1 daemons, quorum controller-0
    mgr: controller-0(active)
    osd: 1 osds: 1 up, 1 in

  data:
    pools: 1 pools, 64 pgs
    objects: 0 objects, 0 B
    usage: 107 MiB used, 930 GiB / 930 GiB avail
    pgs: 100.000% pgs unknown
             64 unknown

Similar issue reproduced on 2020-08-26_00-00-00

Steps:
lock controller-0
remove sriov vf interface on host controller-0
remove sriov data network
unlock controller-0

+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID                                 | Alarm ID | Reason Text                                                                                                           | Entity ID                                    | Severity | Time Stamp                 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 5e1a71f2-e4b1-41b6-8641-6223a406ec73 | 800.001  | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=4750164e-1cbe-4e56-ba82-e56389f3918b | warning  | 2020-08-27T07:07:36.319531 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+

[2020-08-27 07:09:14,677] 314  DEBUG MainThread ssh.send    :: Send 'sudo ntpq -pn'
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 abcd:204::3
                 .XFAC.          16 u    -   64    0    0.000    0.000   0.000
 64:ff9b::c11d:3f96
                 142.66.101.13    2 u   51   64    7   16.788    0.022   0.354
 64:ff9b::2d4c:f4c1
                 216.239.35.8     2 u   49   64    7   63.436   -2.186   1.248
 64:ff9b::4a06:a848
                 208.71.46.33     2 u   46   64    7   64.818    0.403   3.358

[2020-08-27 07:08:53,582] 314  DEBUG MainThread ssh.send    :: Send 'ceph -s'
  cluster:
    id:     4750164e-1cbe-4e56-ba82-e56389f3918b
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive
 
  services:
    mon: 1 daemons, quorum controller-0
    mgr: controller-0(active)
    osd: 1 osds: 1 up, 1 in
 
  data:
    pools:   1 pools, 64 pgs
    objects: 0  objects, 0 B
    usage:   107 MiB used, 930 GiB / 930 GiB avail
    pgs:     100.000% pgs unknown
             64 unknown

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-09-22:

This issue reproduced in build : 2020-09-18_20-00-08.

description:	updated
description:	updated

Bob Church (rchurch) on 2021-01-04

Changed in starlingx:
assignee:	Elena Taivan (etaivan) → Bob Church (rchurch)

Revision history for this message

Frank Miller (sensfan22) wrote on 2021-04-16:

This issue is not seen recently and was only reported the one time. If the frequency of this issue increases then please open a new LP with a recent load.