StarlingX

alarm 800.001 raised on lock storage-0 and not cleared when storage-0 unlocks

Bug #1844164 reported by Wendy Mitchell on 2019-09-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Low	chen haochuan

Bug Description

Brief Description
-----------------
Lock, unlock operation on storage-0 unlocked storage-0 as expected but 800.001 alarm did not clear
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=6231df84-33be-4aa4-82ea-7408e0f2421c | warning | 2019-09-14T23:06:47.386732 |

Severity
--------
standard

Steps to Reproduce
------------------

step 1.
confirm with ceph -s that all osd's are up and in
@ [2019-09-14 23:01:19,546] all osd's are up and in according to ceph -s ouput
osd: 6 osds: 6 up, 6 in

step 2.
lock storage-0

[2019-09-14 23:01:22,925] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock storage-0'

Verify CEPH cluster health reflects the OSD being down

ceph -s @ [2019-09-14 23:01:43,527]
  cluster:
    id: 6231df84-33be-4aa4-82ea-7408e0f2421c
    health: HEALTH_WARN
            3 osds down
            1 host (3 osds) down
            Reduced data availability: 3 pgs inactive, 9 pgs peering
            too few PGs per OSD (19 < min 30)
            1/3 mons down, quorum controller-0,controller-1

Test Step 3: Check that alarms are raised when storage-0 is locked
Test Step 4: Check that OSDs are down
Test Step 5: Check that loss of replication alarm is raise
Test Step 6: Check that the ceph health warning alarm is raised
Test Step 7: Attempt to lock the controller-0
Test Step 8: Attempt to force lock controller-0
Test Step 9: Attempt to lock the controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.
Test Step 10:
Attempt to force lock controller-1
Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.

Test Step 11:
Unlock storage host storage-0 at this time:

[2019-09-14 23:02:10,244] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock storage-0'

see corresponding sysinv.log
2019-09-14 23:02:14.024 106806 WARNING sysinv.api.controllers.v1.vim_api [-] vim_host_action hostname=storage-0, action=unlock

Step 10 Confirm services enabled, host unlocked
system host-show storage-0 indicates the host is now unlocked, available
| action | none |
| administrative | unlocked |
| availability | available
| config_applied | e60c1692-46d0-4f17-95cb-199ff02a6c60
| boot_device | /dev/disk/by-path/pci-0000:85:00.0-nvme-
peers | {u'hosts': [u'storage-0', u'storage-1'], u'name': u'group-0'}
| operational | enabled
| vim_progress_status | services-enabled

Expected Behavior
------------------
The 800.001 Alarm that was raised when storage-0 was locked hould have cleared after storage-0 was unlocked

Actual Behavior
----------------

800.001 Alarm is raised and did not clear - see @ 2019-09-14T23:06:47

see fm-event.log
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c", "severity" : "warning", "state" : "set", "timestamp" : "2019-09-14 23:06:47.386732" }
2019-09-14T23:06:47.000 controller-0 fmManager: info { "event_log_id" : "800.011", "reason_text" : "Loss of replication in replication group group-0: OSDs are down", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-113-121.cluster=6231df84-33be-4aa4-82ea-7408e0f2421c.peergroup=group-0.host=storage-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-09-14 23:06:47.724669"

The alarm was still there @ [2019-09-14 23:09:39,109]

[2019-09-14 23:09:39,109] fm alarm-list

Reproducibility
---------------
weekly regression

System Configuration
--------------------
storage system WCP_113_121

Lab-name:

Branch/Pull Time/Commit
-----------------------
2019-09-13_20-09-52

Last Pass
---------

Timestamp/Logs
--------------
see inline

Test Activity
-------------
storage regression

Tags:

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-16:

This caused the following testcase to fail: test_ceph.py::test_lock_cont_check_mon_down (and also the next testcase see below)
started at [2019-09-14 22:51:52,679]

FAIL ceph/test_ceph.py::test_lock_cont_check_mon_down
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:00:55,360]
FAIL ceph/test_ceph.py::test_storgroup_semantic_checks
E Details: Timed out waiting for alarm 800.001 to disappear [2019-09-14 23:09:17,136] 338

Wendy Mitchell (wmitchellwr) on 2019-09-17

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-18:

Assigning to the storage team for triage and make a recommendation regarding importance/release target.

tags:	added: stx.storage
Changed in starlingx:
assignee:	nobody → Cindy Xie (xxie1)

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-09-21:

is this bug similar one as 1840267?

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	Cindy Xie (xxie1) → chen haochuan (martin1982)

Ghada Khalil (gkhalil) on 2019-09-23

tags:

added: stx.3.0

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-24:

Logs attached as requested

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-24:

repeat_ALL_NODES_20190914.233820.tar Edit (144.4 MiB, application/x-tar)

chen haochuan (martin1982) on 2019-10-21

Changed in starlingx:
assignee:	chen haochuan (martin1982) → nobody

yong hu (yhu6) on 2019-10-21

Changed in starlingx:
assignee:	nobody → Linjia Chang (linjiachang)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-10-29:

not reproduce with image 20191007

Revision history for this message

Linjia Chang (linjiachang) wrote on 2019-10-31:

What is the timeout in the test used?
We tried testing with image 2019-09-13_20-09-52 with 4 osds and did not reproduce issue.
From storage-0 shows unlocked, till the 800.001 and 800.011 alarms to disappear, it took about 2min
17:16:16
17:18:38

Will try 6 osds next.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-11-06:

this alarm is not caused by host-lock and host-unlock. Before these two action, there is already alarm with alarm id 800.001

log excerpted from controller-0/var/log/ceph-manager.log
2019-09-14 14:34:47.891 102577 INFO ceph_manager.monitor [-] Current alarm: {'alarm_state': 'set', 'service_affecting': 'False', 'severity': 'warning', 'alarm_type': 'equipment', 'timestamp': '2019-09-14 14:29:44.697974', 'entity_type_id': u'cluster', 'probable_cause': 'equipment-malfunction', 'alarm_id': '800.001', 'proposed_repair_action': u'If problem persists, contact next level of support.', 'reason_text': u"Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details.", 'suppression': 'False', 'uuid': '260c9ed4-ad4a-40b4-9f59-dcb3393db06a', 'entity_instance_id': u'cluster=6231df84-33be-4aa4-82ea-7408e0f2421c'}
2019-09-14 14:34:47.891 102577 INFO ceph_client [-] Request params: url=https://controller-0:5001/request?wait=1, json={'prefix': 'health', 'format': 'text'}
2019-09-14 14:34:47.907 102577 INFO ceph_client [-] Result: {u'waiting': [], u'has_failed': False, u'state': u'success', u'is_waiting': False, u'running': [], u'failed': [], u'finished': [{u'outb': u'HEALTH_WARN too few PGs per OSD (21 < min 30)\n', u'outs': u'', u'command': u'health format=text'}], u'is_finished': True, u'id': u'139677338605200'}
2019-09-14 14:34:47.920 102577 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: too few PGs per OSD (21 < min 30)
2019-09-14 14:34:47.961 102577 INFO ceph_manager.monitor [-] Created storage alarm 5d0e07c0-fa05-42d8-a157-d6db156e3983 - severity: warning, reason: Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details., service_affecting: False
2019-09-14 14:34:47.962 102577 INFO ceph_manager.monitor [-] Ceph status changed: HEALTH_WARN detailed reason: too few PGs per OSD (21 < min 30)

This alarm could be found, even as early as when storage node provision done.

The root cause is currenlty default pg_num setting is not appropriate for deployment case with osd number 3, 6, 9 or 12

similar case
https://forum.proxmox.com/threads/too-few-pgs-per-osd-21-min-30.50299/
https://ekuric.wordpress.com/2016/02/09/increase-number-of-pgpgp-for-ceph-cluster-ceph-error-too-few-pgs-per-osd/
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018573.html

should study sysinv-conductor setting in ceph.py for pg_num and pgp_num

this alarm is not caused by host-lock and host-unlock. Before these two action, there is already alarm with alarm id 800.001

This alarm could be found, even as early as when storage node provision done.

The root cause is currenlty default pg_num setting is not appropriate for deployment case with osd number 3, 6, 9 or 12

should study sysinv-conductor setting in ceph.py for pg_num and pgp_num

Revision history for this message

Linjia Chang (linjiachang) wrote on 2019-11-08:

In this env, there are total 128 pgs and 6 osds, so the warning persist the entire time.

pg_num is set as following:
if data_pt and osds:
            # [(Target PGs per OSD) * (# OSD) * (% Data) ]/ Size
            target_pg_num_raw = ((osds * constants.CEPH_TARGET_PGS_PER_OSD * data_pt / 100) /
                                 replication)
            # find next highest power of 2 via shift bit length
            target_pg_num = 1 << (int(target_pg_num_raw) - 1).bit_length()

Where constants.CEPH_TARGET_PGS_PER_OSD = 200
and data_pt% could be 10%, 20%, 28%, 27%, 35%, 80% based on the pool type
so it's possible that when #OSD goes high, the number of groups for each osd goes below 30.

We should either 1. change how this is calculated or 2. add a cap at the end to see if it meets the minimum 30 each requirement.

chen haochuan (martin1982) on 2019-11-21

Changed in starlingx:
assignee:	Linjia Chang (linjiachang) → chen haochuan (martin1982)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-11-28:

#10

controller-0:/opt/platform$ sudo grep "chunk_size" ./* --color -r -n
./helm/19.09/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml:14: chunk_size: 64
./helm/19.09/stx-openstack/1.0-19/openstack-nova.yaml:11: - rbd_chunk_size: 512
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:24: chunk_size: 8
./helm/19.09/stx-openstack/1.0-19/openstack-cinder.yaml:29: chunk_size: 8

Currently chunk_size for all ceph pool is hardcode

stx/config/sysinv/sysinv/sysinv/sysinv/helm/rbd_provisioner.py
"chunk_size": 64,

stx/config/sysinv/sysinv/sysinv/sysinv/helm/nova.py
'rbd_chunk_size': constants.CEPH_POOL_EPHEMERAL_PG_NUM

stx/config/sysinv/sysinv/sysinv/sysinv/helm/cinder.py
'chunk_size': constants.CEPH_POOL_VOLUMES_CHUNK_SIZE,

Depends on the guide in ceph documentation.
https://docs.ceph.com/docs/master/rados/operations/placement-groups/#placement-groups-tradeoffs

this chunk size should be calculated with application usage, osd number and replica

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-06: Fix proposed to config (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/697625

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-30:

#12

As per agreement with the community, moving unresolved medium priority bugs (< 100 days OR recently reproduced) from stx.3.0 to stx.4.0

tags:

added: stx.4.0
removed: stx.3.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-05-26:

#13

Lowering the priority as this is a minor issue (stale alarm). This was also reported in previous stx releases and, therefore, will not hold up stx.4.0.

tags:	removed: stx.4.0
Changed in starlingx:
importance:	Medium → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Change abandoned on config (master)

#14

Change abandoned by Bart Wensley (<email address hidden>) on branch: master
Review: https://review.opendev.org/697625
Reason: This patch has been idle for six months. I am abandoning it to keep the review queue sane. If you are still interested in working on this patch, please unabandon it and upload a new patchset.

Revision history for this message

Ramaswamy Subramanian (rsubrama) wrote on 2022-11-29:

#15

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message

Ramaswamy Subramanian (rsubrama) wrote on 2022-12-13:

#16

Changing the status to 'Won't Fix' as there is no activity.

Changed in starlingx:
status:	In Progress → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

repeat_ALL_NODES_20190914.233820.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.