gnocchi pool have many more objects per pg than average

Bug #1804846 reported by José Pekkarinen
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
Medium
James Page

Bug Description

On a fresh deployment of queens on xenial, ceph reports status warning
and never comes to good health:

# ceph -s
  cluster:
    id: <uuid>
    health: HEALTH_WARN
            1 pools have many more objects per pg than average

  services:
    mon: 6 daemons, quorum juju-<hash>-28-lxd-0,juju-<hash>-18-lxd-0,juju-<hash>-8-lxd-0,juju-<hash>-13-lxd-0,juju-<hash>-23-lxd-0,juju-<hash>-3-lxd-0
    mgr: juju-<hash>-8-lxd-0(active), standbys: juju-<hash>-18-lxd-0, juju-<hash>-13-lxd-0, juju-<hash>-28-lxd-0, juju-<hash>-3-lxd-0, juju-<hash>-23-lxd-0
    osd: 180 osds: 180 up, 180 in

  data:
    pools: 21 pools, 2544 pgs
    objects: 83723 objects, 1601 MB
    usage: 193 GB used, 1309 TB / 1309 TB avail
    pgs: 2544 active+clean

  io:
    client: 1091 B/s rd, 1 op/s rd, 0 op/s wr

# # ceph health detail
HEALTH_WARN 1 pools have many more objects per pg than average
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
    pool gnocchi objects per pg (2606) is more than 81.4375 times cluster average (32)

This trigger alarms in nagios as it expects ceph to report good health status,
so expected behaviour on new deployment is that the status goes good, and if it
does leave that status, the thing would recover itself in a reasonable time.

Thanks!

José.

Revision history for this message
James Page (james-page) wrote :

I also see this on a Bionic/Queens deployment:

$ sudo ceph -s
  cluster:
    id: 80f2a4b2-7e13-11e8-a8c7-00163eb36865
    health: HEALTH_WARN
            1 pools have many more objects per pg than average

  services:
    mon: 3 daemons, quorum juju-182b52-6-lxd-0,juju-182b52-4-lxd-0,juju-182b52-5-lxd-0
    mgr: juju-182b52-4-lxd-0(active), standbys: juju-182b52-6-lxd-0, juju-182b52-5-lxd-0
    osd: 12 osds: 12 up, 12 in
    rgw: 1 daemon active

  data:
    pools: 22 pools, 420 pgs
    objects: 3502k objects, 421 GB
    usage: 938 GB used, 10235 GB / 11174 GB avail
    pgs: 419 active+clean
             1 active+clean+scrubbing+deep

  io:
    client: 6280 B/s rd, 17836 B/s wr, 7 op/s rd, 29 op/s wr

Changed in charm-ceph-mon:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
James Page (james-page) wrote :

rados df for reference:

POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
.rgw.root 1113 4 0 12 0 0 0 324 216k 4 4096
cinder-ceph 319G 84379 0 253137 0 0 0 189110731 781G 112724504 9124G
default.intent-log 0 0 0 0 0 0 0 0 0 0 0
default.log 0 0 0 0 0 0 0 0 0 0 0
default.rgw 0 0 0 0 0 0 0 0 0 0 0
default.rgw.buckets 0 0 0 0 0 0 0 0 0 0 0
default.rgw.buckets.data 13113M 3299 0 9897 0 0 0 287708 59148M 63648 13285M
default.rgw.buckets.extra 0 0 0 0 0 0 0 0 0 0 0
default.rgw.buckets.index 0 16 0 48 0 0 0 42873 42892k 25458 0
default.rgw.control 0 8 0 24 0 0 0 0 0 0 0
default.rgw.gc 0 0 0 0 0 0 0 0 0 0 0
default.rgw.log 0 207 0 621 0 0 0 30403188 29690M 20258723 0
default.rgw.meta 11618 56 0 168 0 0 0 2529057 1543M 10394 1244k
default.rgw.root 0 0 0 0 0 0 0 0 0 0 0
default.usage 0 0 0 0 0 0 0 0 0 0 0
default.users 0 0 0 0 0 0 0 0 0 0 0
default.users.email 0 0 0 0 0 0 0 0 0 0 0
default.users.swift 0 0 0 0 0 0 0 0 0 0 0
default.users.uid 0 0 0 0 0 0 0 0 0 0 0
glance 47321M 6472 0 19416 0 0 0 429110 400G 154681 202G
gnocchi 44558M 3492359 0 10477077 0 0 0 98827394 68749M 476545467 210G
nova 19 2 0 6 0 0 0 2893482 69727M 9270630 626G

Revision history for this message
James Page (james-page) wrote :

Gnocchi just appears to generate a large number of objects - 3492359 vs 84379 (cinder-ceph) as the next highest.

Revision history for this message
James Page (james-page) wrote :

From upstream ceph ML for this exact issue:

"> How to solve this problem correctly?

As a workaround, I'd just increase the skew option to make the warning go
away.

It seems to me like the underlying problem is that we're looking at object
count vs pg count, but ignoring the object sizes. Unfortunately it's a
bit awkward to fix because we don't have a way to quantify the size of
omap objects via the stats (currently). So for now, just adjust the skew
value enough to make the warning go away!"

Leaving that one Anon but a reliable source nonetheless!

Revision history for this message
James Page (james-page) wrote :

So we can either increase "mon pg warn max object skew" or disable the check completely.

Changed in charm-ceph-mon:
milestone: none → 19.04
James Page (james-page)
Changed in charm-ceph-mon:
status: Confirmed → Triaged
Revision history for this message
James Page (james-page) wrote :

I've gone for 'disable this check as its not super useful' rather than increasing the skew threshold value as I think we'll just need to keep increasing it over time for busy deployments.

Revision history for this message
James Page (james-page) wrote :
Changed in charm-ceph-mon:
assignee: nobody → James Page (james-page)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.openstack.org/624398
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=33f9bae6c7eba61b931191882f0696db909fb84e
Submitter: Zuul
Branch: master

commit 33f9bae6c7eba61b931191882f0696db909fb84e
Author: James Page <email address hidden>
Date: Tue Dec 11 13:39:54 2018 +0000

    Disable object skew warnings

    Ceph will issue a HEALTH_WARN in the event that one pool has a
    large number of objects compared to other pools in the cluster:

     "Issue a HEALTH_WARN in cluster log if the average object
      number of a certain pool is greater than mon pg warn max
      object skew times the average object number of the whole
      pool."

    For OpenStack deployments, Gnocchi and RADOS gateway can generate
    a large number of small objects compared to Cinder, Glance and
    Nova usage, causing the cluster to go into HEALTH_WARN status.

    Disable this check until the skew evaluation also includes the
    size of the objects as well as the number.

    Change-Id: I83211dbdec4dea8dca5b27a66e26a4431d2a7b77
    Closes-Bug: 1804846

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/18.11)

Fix proposed to branch: stable/18.11
Review: https://review.openstack.org/628917

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/18.11)

Reviewed: https://review.openstack.org/628917
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=a0e0db5774c573f50db3c11892b91a91d6eafec9
Submitter: Zuul
Branch: stable/18.11

commit a0e0db5774c573f50db3c11892b91a91d6eafec9
Author: James Page <email address hidden>
Date: Tue Dec 11 13:39:54 2018 +0000

    Disable object skew warnings

    Ceph will issue a HEALTH_WARN in the event that one pool has a
    large number of objects compared to other pools in the cluster:

     "Issue a HEALTH_WARN in cluster log if the average object
      number of a certain pool is greater than mon pg warn max
      object skew times the average object number of the whole
      pool."

    For OpenStack deployments, Gnocchi and RADOS gateway can generate
    a large number of small objects compared to Cinder, Glance and
    Nova usage, causing the cluster to go into HEALTH_WARN status.

    Disable this check until the skew evaluation also includes the
    size of the objects as well as the number.

    Change-Id: I83211dbdec4dea8dca5b27a66e26a4431d2a7b77
    Closes-Bug: 1804846
    (cherry picked from commit 33f9bae6c7eba61b931191882f0696db909fb84e)

David Ames (thedac)
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.