Fuel for OpenStack

Ceph health is too many PGs per OSD (320 > max 300) after trying to delete ceph osds

Bug #1539555 reported by Andrey Sledzinskiy on 2016-01-29

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	MOS Ceph	Fuel for OpenStack 9.0
8.0.x	Invalid	High	Egor Kotko	Fuel for OpenStack 8.0
Mitaka	Invalid	High	MOS Ceph	Fuel for OpenStack 9.0

Bug Description

Steps to reproduce:
1. Create and deploy next cluster - Neutron Vlan, ceph for volumes/images/ephemeral/rados gw for objects, 3 controller, 3 ceph nodes, 1 compute node
2. After deployment add one ceph node and re-deploy (not necessary to reproduce)
3. After re-deploy start preparing ceph node to be deleted (using that guide - https://docs.mirantis.com/openstack/fuel/fuel-7.0/operations.html#how-to-safely-remove-a-ceph-osd-node)
4. Execute next commands (on node-2 in my cases):
- ceph osd out 1
- ceph osd out 3
5. Start waiting for 'ceph -s' to show OK status

Actual result - after an hour of waiting (test cluster without any data on ceph nodes) status is:

ceph -s
    cluster c3c93807-159d-46b4-93bf-285f03414733
     health HEALTH_WARN
            too many PGs per OSD (320 > max 300)
     monmap e3: 3 mons at {node-3=10.109.1.8:6789/0,node-4=10.109.1.6:6789/0,node-5=10.109.1.9:6789/0}
            election epoch 4, quorum 0,1,2 node-4,node-3,node-5
     osdmap e65: 8 osds: 8 up, 6 in
      pgmap v194: 640 pgs, 10 pools, 12977 kB data, 51 objects
            12566 MB used, 284 GB / 296 GB avail
                 640 active+clean

fuel iso - 478
logs are attached

See original description

Tags:

Revision history for this message

Ivan Ponomarev (ivanzipfer) wrote on 2016-01-29:

Please provide Fuel ISO version

Changed in fuel:
status:	New → Incomplete

Vasily Gorin (vgorin) on 2016-01-29

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-29:

logs
https://drive.google.com/file/d/0B9Qf0veURSnyTkQ1ZzFsdjBWSXc/view?usp=sharing

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

description:

updated

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-29:

After executing ceph osd out $id on ceph node it's health is constantly
ceph -s
    cluster c3c93807-159d-46b4-93bf-285f03414733
     health HEALTH_WARN
            too many PGs per OSD (320 > max 300)
     monmap e3: 3 mons at {node-3=10.109.1.8:6789/0,node-4=10.109.1.6:6789/0,node-5=10.109.1.9:6789/0}
            election epoch 4, quorum 0,1,2 node-4,node-3,node-5
     osdmap e65: 8 osds: 8 up, 6 in
      pgmap v194: 640 pgs, 10 pools, 12977 kB data, 51 objects
            12566 MB used, 284 GB / 296 GB avail
                 640 active+clean

so seems it's not qa issue

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

description:

updated

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → MOS Ceph (mos-ceph)

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

summary:

- add_delete_ceph test timed out waiting ceph health to be ok
+ Ceph health is too many PGs per OSD (320 > max 300) after trying to
+ delete ceph osds

Nastya Urlapova (aurlapova) on 2016-01-29

tags:

removed: area-qa

Roman Podoliaka (rpodolyaka) on 2016-01-30

tags:

added: area-ceph

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-01-30:

Not a bug.

The number of placement groups per OSD increases after removing OSDs (the placement
groups which have been served by the removed OSDs get distributed among the remaining OSDs).
ceph warns that having that much placement groups per an OSD *might* be suboptimal.

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-02-01:

@asledzinskiy

As I remember, sometimes I have seen the state HEALTH_WARN after 2 step(before any delete steps).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/274689

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-02: Related fix proposed to fuel-qa (stable/8.0)

Related fix proposed to branch: stable/8.0
Review: https://review.openstack.org/275417

Revision history for this message

Egor Kotko (ykotko) wrote on 2016-02-18:

Reproduced on RC2 ISO #570
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "570"
  build_id: "570"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Revision history for this message

Egor Kotko (ykotko) wrote on 2016-02-18:

fuel-snapshot-2016-02-17_18-20-24.tar.xz Edit (86.4 MiB, application/octet-stream)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-18:

#10

Egor, have you seen Alexei's comment on this? ( https://bugs.launchpad.net/fuel/+bug/1539555/comments/4 )

Why do you think this should be re-opened? What exactly breaks?

Revision history for this message

Egor Kotko (ykotko) wrote on 2016-02-18:

#11

Currently we got the warning because of the max value of placement groups per OSD. Can we increase the max number ?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-18:

#12

So looks like 300 is the default maximum PGs value (according to https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/storage-strategies/chapter-14-pg-count#maximum-pg-count), which *can* be tweaked.

As Alexei pointed out, having more PGs per OSD node is not actually fatal, but rather suboptimal. The Ceph cluster remains fully functional.

How many PGs per one OSD node we should have is probably a topic for another discussion.

Just to make this clear: giving the fact you still have to move PGs somewhere when removing an OSD node, there always will be a case (depending on the number of OSD nodes and PGs per node), when a Ceph cluster will remain in HEALTH_WARN status unless you add a new node.

Revision history for this message

Volodymyr Shypyguzov (vshypyguzov) wrote on 2016-02-19:

#13

Also, according to Ceph documentation ( http://docs.ceph.com/docs/master/rados/operations/placement-groups/#set-the-number-of-placement-groups ), you cannot decrease PG number:

To set the number of placement groups in a pool, you must specify the number of placement groups at the time you create the pool. See Create a Pool for details. Once you’ve set placement groups for a pool, you may increase the number of placement groups (but you cannot decrease the number of placement groups)

So this is kind of expected behavior.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-02-19:

#14

It looks like expected behaviour for Ceph, it is not a blocker for users and it is ok for Ceph cluster to be in WARN status if two nodes from three are down. This issue will be not reproduced for the large cluster with many nodes.

Please update your test scenario accordingly to the real use cases.

Status changed to Invalid for all releases.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fuel-snapshot-2016-02-17_18-20-24.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.