alarm 800.001 raised on lock storage-0 and not cleared when storage-0 unlocks

Bug #1844164 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
chen haochuan

Bug Description

Brief Description
-----------------
Lock, unlock operation on storage-0 unlocked storage-0 as expected but 800.001 alarm did not clear
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c | warning | 2019-09-14T23:06:47.386732 |

Severity
--------
standard

Steps to Reproduce
------------------

step 1.
confirm with ceph -s that all osd's are up and in
@ [2019-09-14 23:01:19,546] all osd's are up and in according to ceph -s ouput
 osd: 6 osds: 6 up, 6 in

step 2.
lock storage-0

[2019-09-14 23:01:22,925] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock storage-0'

Verify CEPH cluster health reflects the OSD being down

ceph -s @ [2019-09-14 23:01:43,527]
  cluster:
    id: 6231df84-33be-4aa4-82ea-7408e0f2421c
    health: HEALTH_WARN
            3 osds down
            1 host (3 osds) down
            Reduced data availability: 3 pgs inactive, 9 pgs peering
            too few PGs per OSD (19 < min 30)
            1/3 mons down, quorum controller-0,controller-1

| 2deecf77-765f-4749-a6c4-b7d8160c91b0 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c.peergroup=group-0.host=storage-0 | major | 2019-09-14T23:01:43.581914

 Test Step 3: Check that alarms are raised when storage-0 is locked
 Test Step 4: Check that OSDs are down
 Test Step 5: Check that loss of replication alarm is raise
 Test Step 6: Check that the ceph health warning alarm is raised
 Test Step 7: Attempt to lock the controller-0
 Test Step 8: Attempt to force lock controller-0
 Test Step 9: Attempt to lock the controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.
Test Step 10:
Attempt to force lock controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.

Test Step 11:
Unlock storage host storage-0 at this time:

[2019-09-14 23:02:10,244] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock storage-0'

see corresponding sysinv.log
2019-09-14 23:02:14.024 106806 WARNING sysinv.api.controllers.v1.vim_api [-] vim_host_action hostname=storage-0, action=unlock

Step 10 Confirm services enabled, host unlocked
system host-show storage-0 indicates the host is now unlocked, available
| action | none |
| administrative | unlocked |
| availability | available
| config_applied | e60c1692-46d0-4f17-95cb-199ff02a6c60
| boot_device | /dev/disk/by-path/pci-0000:85:00.0-nvme-
peers | {u'hosts': [u'storage-0', u'storage-1'], u'name': u'group-0'}
| operational | enabled
| vim_progress_status | services-enabled

Expected Behavior
------------------
The 800.001 Alarm that was raised when storage-0 was locked hould have cleared after storage-0 was unlocked

Actual Behavior
----------------

800.001 Alarm is raised and did not clear - see @ 2019-09-14T23:06:47

see fm-event.log
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c", "severity" : "warning", "state" : "set", "timestamp" : "2019-09-14 23:06:47.386732" }
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.011", "reason_text" : "Loss of replication in replication group group-0: OSDs are down", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c.peergroup=group-0.host=storage-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-09-14 23:06:47.724669"

The alarm was still there @ [2019-09-14 23:09:39,109]

[2019-09-14 23:09:39,109] fm alarm-list

| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| f5496afd-0703-4065-bcd0-1ee78ab6753b | 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c | warning | 2019-09-14T23:06:47.386732

Reproducibility
---------------
weekly regression

System Configuration
--------------------
storage system WCP_113_121

Lab-name:

Branch/Pull Time/Commit
-----------------------
2019-09-13_20-09-52

Last Pass
---------

Timestamp/Logs
--------------
see inline

Test Activity
-------------
storage regression

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

This caused the following testcase to fail: test_ceph.py::test_lock_cont_check_mon_down (and also the next testcase see below)
started at [2019-09-14 22:51:52,679]

FAIL ceph/test_ceph.py::test_lock_cont_check_mon_down
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:00:55,360]
FAIL ceph/test_ceph.py::test_storgroup_semantic_checks
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:09:17,136] 338

tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the storage team for triage and make a recommendation regarding importance/release target.

tags: added: stx.storage
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Cindy Xie (xxie1) wrote :

is this bug similar one as 1840267?

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Ghada Khalil (gkhalil)
tags: added: stx.3.0
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Logs attached as requested

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Changed in starlingx:
assignee: chen haochuan (martin1982) → nobody
yong hu (yhu6)
Changed in starlingx:
assignee: nobody → Linjia Chang (linjiachang)
Revision history for this message
chen haochuan (martin1982) wrote :

not reproduce with image 20191007

Revision history for this message
Linjia Chang (linjiachang) wrote :

What is the timeout in the test used?
We tried testing with image 2019-09-13_20-09-52 with 4 osds and did not reproduce issue.
From storage-0 shows unlocked, till the 800.001 and 800.011 alarms to disappear, it took about 2min
17:16:16
17:18:38

Will try 6 osds next.

Revision history for this message
chen haochuan (martin1982) wrote :

this alarm is not caused by host-lock and host-unlock. Before these two action, there is already alarm with alarm id 800.001

log excerpted from controller-0/var/log/ceph-manager.log
2019-09-14 14:34:47.891 102577 INFO ceph_manager.monitor [-] Current alarm: {'alarm_state': 'set', 'service_affecting': 'False', 'severity': 'warning', 'alarm_type': 'equipment', 'timestamp': '2019-09-14 14:29:44.697974', 'entity_type_id': u'cluster', 'probable_cause': 'equipment-malfunction', 'alarm_id': '800.001', 'proposed_repair_action': u'If problem persists, contact next level of support.', 'reason_text': u"Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details.", 'suppression': 'False', 'uuid': '260c9ed4-ad4a-40b4-9f59-dcb3393db06a', 'entity_instance_id': u'cluster=6231df84-33be-4aa4-82ea-7408e0f2421c'}
2019-09-14 14:34:47.891 102577 INFO ceph_client [-] Request params: url=https://controller-0:5001/request?wait=1, json={'prefix': 'health', 'format': 'text'}
2019-09-14 14:34:47.907 102577 INFO ceph_client [-] Result: {u'waiting': [], u'has_failed': False, u'state': u'success', u'is_waiting': False, u'running': [], u'failed': [], u'finished': [{u'outb': u'HEALTH_WARN too few PGs per OSD (21 < min 30)\n', u'outs': u'', u'command': u'health format=text'}], u'is_finished': True, u'id': u'139677338605200'}
2019-09-14 14:34:47.920 102577 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: too few PGs per OSD (21 < min 30)
2019-09-14 14:34:47.961 102577 INFO ceph_manager.monitor [-] Created storage alarm 5d0e07c0-fa05-42d8-a157-d6db156e3983 - severity: warning, reason: Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details., service_affecting: False
2019-09-14 14:34:47.962 102577 INFO ceph_manager.monitor [-] Ceph status changed: HEALTH_WARN detailed reason: too few PGs per OSD (21 < min 30)

This alarm could be found, even as early as when storage node provision done.

The root cause is currenlty default pg_num setting is not appropriate for deployment case with osd number 3, 6, 9 or 12

similar case
https://forum.proxmox.com/threads/too-few-pgs-per-osd-21-min-30.50299/
https://ekuric.wordpress.com/2016/02/09/increase-number-of-pgpgp-for-ceph-cluster-ceph-error-too-few-pgs-per-osd/
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018573.html

should study sysinv-conductor setting in ceph.py for pg_num and pgp_num

Revision history for this message
Linjia Chang (linjiachang) wrote :

In this env, there are total 128 pgs and 6 osds, so the warning persist the entire time.

pg_num is set as following:
if data_pt and osds:
            # [(Target PGs per OSD) * (# OSD) * (% Data) ]/ Size
            target_pg_num_raw = ((osds * constants.CEPH_TARGET_PGS_PER_OSD * data_pt / 100) /
                                 replication)
            # find next highest power of 2 via shift bit length
            target_pg_num = 1 << (int(target_pg_num_raw) - 1).bit_length()

Where constants.CEPH_TARGET_PGS_PER_OSD = 200
and data_pt% could be 10%, 20%, 28%, 27%, 35%, 80% based on the pool type
so it's possible that when #OSD goes high, the number of groups for each osd goes below 30.

We should either 1. change how this is calculated or 2. add a cap at the end to see if it meets the minimum 30 each requirement.

Changed in starlingx:
assignee: Linjia Chang (linjiachang) → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

controller-0:/opt/platform$ sudo grep "chunk_size" ./* --color -r -n
./helm/19.09/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml:14: chunk_size: 64
./helm/19.09/stx-openstack/1.0-19/openstack-nova.yaml:11: - rbd_chunk_size: 512
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:24: chunk_size: 8
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:29: chunk_size: 8

Currently chunk_size for all ceph pool is hardcode

stx/config/sysinv/sysinv/sysinv/sysinv/helm/rbd_provisioner.py
                "chunk_size": 64,

stx/config/sysinv/sysinv/sysinv/sysinv/helm/nova.py
            'rbd_chunk_size': constants.CEPH_POOL_EPHEMERAL_PG_NUM

stx/config/sysinv/sysinv/sysinv/sysinv/helm/cinder.py
                'chunk_size': constants.CEPH_POOL_VOLUMES_CHUNK_SIZE,

Depends on the guide in ceph documentation.
https://docs.ceph.com/docs/master/rados/operations/placement-groups/#placement-groups-tradeoffs

this chunk size should be calculated with application usage, osd number and replica

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697625

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving unresolved medium priority bugs (< 100 days OR recently reproduced) from stx.3.0 to stx.4.0

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority as this is a minor issue (stale alarm). This was also reported in previous stx releases and, therefore, will not hold up stx.4.0.

tags: removed: stx.4.0
Changed in starlingx:
importance: Medium → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by Bart Wensley (<email address hidden>) on branch: master
Review: https://review.opendev.org/697625
Reason: This patch has been idle for six months. I am abandoning it to keep the review queue sane. If you are still interested in working on this patch, please unabandon it and upload a new patchset.

Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message
Ramaswamy Subramanian (rsubrama) wrote :

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.