alarm 800.001 raised on lock storage-0 and not cleared when storage-0 unlocks

Bug #1844164 reported by Wendy Mitchell on 2019-09-16
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
chen haochuan

Bug Description

Brief Description
-----------------
Lock, unlock operation on storage-0 unlocked storage-0 as expected but 800.001 alarm did not clear
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c | warning | 2019-09-14T23:06:47.386732 |

Severity
--------
standard

Steps to Reproduce
------------------

step 1.
confirm with ceph -s that all osd's are up and in
@ [2019-09-14 23:01:19,546] all osd's are up and in according to ceph -s ouput
 osd: 6 osds: 6 up, 6 in

step 2.
lock storage-0

[2019-09-14 23:01:22,925] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock storage-0'

Verify CEPH cluster health reflects the OSD being down

ceph -s @ [2019-09-14 23:01:43,527]
  cluster:
    id: 6231df84-33be-4aa4-82ea-7408e0f2421c
    health: HEALTH_WARN
            3 osds down
            1 host (3 osds) down
            Reduced data availability: 3 pgs inactive, 9 pgs peering
            too few PGs per OSD (19 < min 30)
            1/3 mons down, quorum controller-0,controller-1

| 2deecf77-765f-4749-a6c4-b7d8160c91b0 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c.peergroup=group-0.host=storage-0 | major | 2019-09-14T23:01:43.581914

 Test Step 3: Check that alarms are raised when storage-0 is locked
 Test Step 4: Check that OSDs are down
 Test Step 5: Check that loss of replication alarm is raise
 Test Step 6: Check that the ceph health warning alarm is raised
 Test Step 7: Attempt to lock the controller-0
 Test Step 8: Attempt to force lock controller-0
 Test Step 9: Attempt to lock the controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.
Test Step 10:
Attempt to force lock controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.

Test Step 11:
Unlock storage host storage-0 at this time:

[2019-09-14 23:02:10,244] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock storage-0'

see corresponding sysinv.log
2019-09-14 23:02:14.024 106806 WARNING sysinv.api.controllers.v1.vim_api [-] vim_host_action hostname=storage-0, action=unlock

Step 10 Confirm services enabled, host unlocked
system host-show storage-0 indicates the host is now unlocked, available
| action | none |
| administrative | unlocked |
| availability | available
| config_applied | e60c1692-46d0-4f17-95cb-199ff02a6c60
| boot_device | /dev/disk/by-path/pci-0000:85:00.0-nvme-
peers | {u'hosts': [u'storage-0', u'storage-1'], u'name': u'group-0'}
| operational | enabled
| vim_progress_status | services-enabled

Expected Behavior
------------------
The 800.001 Alarm that was raised when storage-0 was locked hould have cleared after storage-0 was unlocked

Actual Behavior
----------------

800.001 Alarm is raised and did not clear - see @ 2019-09-14T23:06:47

see fm-event.log
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c", "severity" : "warning", "state" : "set", "timestamp" : "2019-09-14 23:06:47.386732" }
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.011", "reason_text" : "Loss of replication in replication group group-0: OSDs are down", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c.peergroup=group-0.host=storage-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-09-14 23:06:47.724669"

The alarm was still there @ [2019-09-14 23:09:39,109]

[2019-09-14 23:09:39,109] fm alarm-list

| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| f5496afd-0703-4065-bcd0-1ee78ab6753b | 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c | warning | 2019-09-14T23:06:47.386732

Reproducibility
---------------
weekly regression

System Configuration
--------------------
storage system WCP_113_121

Lab-name:

Branch/Pull Time/Commit
-----------------------
2019-09-13_20-09-52

Last Pass
---------

Timestamp/Logs
--------------
see inline

Test Activity
-------------
storage regression

Wendy Mitchell (wmitchellwr) wrote :

This caused the following testcase to fail: test_ceph.py::test_lock_cont_check_mon_down (and also the next testcase see below)
started at [2019-09-14 22:51:52,679]

FAIL ceph/test_ceph.py::test_lock_cont_check_mon_down
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:00:55,360]
FAIL ceph/test_ceph.py::test_storgroup_semantic_checks
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:09:17,136] 338

tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

Assigning to the storage team for triage and make a recommendation regarding importance/release target.

tags: added: stx.storage
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Cindy Xie (xxie1) wrote :

is this bug similar one as 1840267?

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Ghada Khalil (gkhalil) on 2019-09-23
tags: added: stx.3.0
Wendy Mitchell (wmitchellwr) wrote :

Logs attached as requested

Wendy Mitchell (wmitchellwr) wrote :
Changed in starlingx:
assignee: chen haochuan (martin1982) → nobody
yong hu (yhu6) on 2019-10-21
Changed in starlingx:
assignee: nobody → Linjia Chang (linjiachang)
chen haochuan (martin1982) wrote :

not reproduce with image 20191007

Linjia Chang (linjiachang) wrote :

What is the timeout in the test used?
We tried testing with image 2019-09-13_20-09-52 with 4 osds and did not reproduce issue.
From storage-0 shows unlocked, till the 800.001 and 800.011 alarms to disappear, it took about 2min
17:16:16
17:18:38

Will try 6 osds next.

chen haochuan (martin1982) wrote :

this alarm is not caused by host-lock and host-unlock. Before these two action, there is already alarm with alarm id 800.001

log excerpted from controller-0/var/log/ceph-manager.log
2019-09-14 14:34:47.891 102577 INFO ceph_manager.monitor [-] Current alarm: {'alarm_state': 'set', 'service_affecting': 'False', 'severity': 'warning', 'alarm_type': 'equipment', 'timestamp': '2019-09-14 14:29:44.697974', 'entity_type_id': u'cluster', 'probable_cause': 'equipment-malfunction', 'alarm_id': '800.001', 'proposed_repair_action': u'If problem persists, contact next level of support.', 'reason_text': u"Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details.", 'suppression': 'False', 'uuid': '260c9ed4-ad4a-40b4-9f59-dcb3393db06a', 'entity_instance_id': u'cluster=6231df84-33be-4aa4-82ea-7408e0f2421c'}
2019-09-14 14:34:47.891 102577 INFO ceph_client [-] Request params: url=https://controller-0:5001/request?wait=1, json={'prefix': 'health', 'format': 'text'}
2019-09-14 14:34:47.907 102577 INFO ceph_client [-] Result: {u'waiting': [], u'has_failed': False, u'state': u'success', u'is_waiting': False, u'running': [], u'failed': [], u'finished': [{u'outb': u'HEALTH_WARN too few PGs per OSD (21 < min 30)\n', u'outs': u'', u'command': u'health format=text'}], u'is_finished': True, u'id': u'139677338605200'}
2019-09-14 14:34:47.920 102577 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: too few PGs per OSD (21 < min 30)
2019-09-14 14:34:47.961 102577 INFO ceph_manager.monitor [-] Created storage alarm 5d0e07c0-fa05-42d8-a157-d6db156e3983 - severity: warning, reason: Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details., service_affecting: False
2019-09-14 14:34:47.962 102577 INFO ceph_manager.monitor [-] Ceph status changed: HEALTH_WARN detailed reason: too few PGs per OSD (21 < min 30)

This alarm could be found, even as early as when storage node provision done.

The root cause is currenlty default pg_num setting is not appropriate for deployment case with osd number 3, 6, 9 or 12

similar case
https://forum.proxmox.com/threads/too-few-pgs-per-osd-21-min-30.50299/
https://ekuric.wordpress.com/2016/02/09/increase-number-of-pgpgp-for-ceph-cluster-ceph-error-too-few-pgs-per-osd/
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018573.html

should study sysinv-conductor setting in ceph.py for pg_num and pgp_num

Linjia Chang (linjiachang) wrote :

In this env, there are total 128 pgs and 6 osds, so the warning persist the entire time.

pg_num is set as following:
if data_pt and osds:
            # [(Target PGs per OSD) * (# OSD) * (% Data) ]/ Size
            target_pg_num_raw = ((osds * constants.CEPH_TARGET_PGS_PER_OSD * data_pt / 100) /
                                 replication)
            # find next highest power of 2 via shift bit length
            target_pg_num = 1 << (int(target_pg_num_raw) - 1).bit_length()

Where constants.CEPH_TARGET_PGS_PER_OSD = 200
and data_pt% could be 10%, 20%, 28%, 27%, 35%, 80% based on the pool type
so it's possible that when #OSD goes high, the number of groups for each osd goes below 30.

We should either 1. change how this is calculated or 2. add a cap at the end to see if it meets the minimum 30 each requirement.

Changed in starlingx:
assignee: Linjia Chang (linjiachang) → chen haochuan (martin1982)
chen haochuan (martin1982) wrote :

controller-0:/opt/platform$ sudo grep "chunk_size" ./* --color -r -n
./helm/19.09/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml:14: chunk_size: 64
./helm/19.09/stx-openstack/1.0-19/openstack-nova.yaml:11: - rbd_chunk_size: 512
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:24: chunk_size: 8
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:29: chunk_size: 8

Currently chunk_size for all ceph pool is hardcode

stx/config/sysinv/sysinv/sysinv/sysinv/helm/rbd_provisioner.py
                "chunk_size": 64,

stx/config/sysinv/sysinv/sysinv/sysinv/helm/nova.py
            'rbd_chunk_size': constants.CEPH_POOL_EPHEMERAL_PG_NUM

stx/config/sysinv/sysinv/sysinv/sysinv/helm/cinder.py
                'chunk_size': constants.CEPH_POOL_VOLUMES_CHUNK_SIZE,

Depends on the guide in ceph documentation.
https://docs.ceph.com/docs/master/rados/operations/placement-groups/#placement-groups-tradeoffs

this chunk size should be calculated with application usage, osd number and replica

Fix proposed to branch: master
Review: https://review.opendev.org/697625

Changed in starlingx:
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers