[SRU] Gathering of thin provisioning stats breaks c-vol

Bug #1704106 reported by Arne Wiebalck
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Gorka Eguileor
OpenStack Cinder-Ceph charm
Fix Released
Medium
Edward Hope-Morley
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ocata
Fix Released
High
Unassigned
Pike
Fix Released
High
Unassigned

Bug Description

[Impact]

Backport config option added in Queens to allow disabling the
collection of stats from all rbd volumes since this causes
tons of non-fatal race conditions and slows down deletes to
the point where the rpc thread pool fills up blocking further
requests. Our charms do not configure pool by default and we
are not aware of anyone doing this in the field so this patch
enables this option by default.

[Test Case]

By default no change in behaviour should occur. To test the
new feature we need to enable it i.e.:

* deploy openstack ocata
* set rbd_exclusive_cinder_pool = true in cinder.conf
* create 100 volumes via cinder
* also create 100 volumes from the cinder pool but using the rbd client directly
* delete cinder volumes (via cinder) and delete the non-cinder rbd volumes using rbd client
* ensure there are no exceptions in cinder-volume.log

[Regression Potential]
The default behaviour is unchanged so no regression is expected.

==========================================================================

The gathering of the thin provisioning stats is done by looping over all volumes:

https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/rbd.py#L369

For larger deployments, this loop (done at start-up, upon volume deletion and as periodic a task) is taking too long and hence breaks the c-vol service.

From what I understand, the overall idea of this stats gathering is to bring the current real fill status of the pool to the admin's attention in case over-subscription was configured. For this, a fill status at the pool level (rather than the volume level) should be good enough.

summary: - Gat
+ Gathering of thin provisioning stats breaks c-vol
Gorka Eguileor (gorka)
tags: added: ceph drivers rbd
Changed in cinder:
status: New → Confirmed
Revision history for this message
Arne Wiebalck (arne-wiebalck) wrote : Re: Gathering of thin provisioning stats breaks c-vol

As a simple mitigation, how about adding two config options to
- enable/disable provisioning stats gathering
- control the frequency of its periodic task ?

Revision history for this message
Daniel Speichert (dasp) wrote :

This is probably happening to me as well. No matter how many osapi_volume_workers we have, the periodic tasks takes down cinder-volume service (it's running but not handling any RPC calls and shows as dead - XXX).

Disabling _get_usage_info() (https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/rbd.py#L367-L375) e.g. by changing to "return" solves the problem.

My main concern is why can a single periodic task take over all threads and take down the agent? Shouldn't it be contained to one worker leaving the rest available for processing everything else? It's not like the periodic task runs multiple times because even before periodic_interval passes once from start-up, cinder-volume is already unavailable.

Revision history for this message
Daniel Speichert (dasp) wrote :

Refreshing this issue. It is still happening in latest stable/newton.

Revision history for this message
Marcus Klein (kleini76) wrote :

It happens with Ocata Cinder 10.0.6, too.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/541285

Changed in cinder:
assignee: nobody → Gorka Eguileor (gorka)
status: Confirmed → In Progress
Revision history for this message
George (lmihaiescu) wrote : Re: Gathering of thin provisioning stats breaks c-vol

Running Ocata Cinder 10.0.6 too and having the same problem.

Applied the following patch to get cinder-volume working back for the time being, but I would love a proper fix backported to Ocata.

root@controller1:~# diff rbd.py_bak /usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py
363,370c363
< with RADOSClient(self) as client:
< for t in self.RBDProxy().list(client.ioctx):
< if t.startswith('volume'):
< # Only check for "volume" to allow some flexibility with
< # non-default volume_name_template settings. Template
< # must start with "volume".
< with RBDVolumeProxy(self, t, read_only=True) as v:
< v.diff_iterate(0, v.size(), None, self._iterate_cb)
---
> return

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/541285
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=f33baccc3544cbda6cd5908328a56096046657ed
Submitter: Zuul
Branch: master

commit f33baccc3544cbda6cd5908328a56096046657ed
Author: Gorka Eguileor <email address hidden>
Date: Tue Feb 6 14:54:57 2018 +0100

    RBD: Don't query Ceph on stats for exclusive pools

    Collecting stats for provisioned_capacity_gb takes a long time since we
    have to query each individual image for the provisioned size. If we are
    using the pool just for Cinder and/or are willing to accept a potential
    deviation in Cinder stats we could just not retrieve this information
    and calculate this based on the DB information for the volumes.

    This patch adds configuration option `rbd_exclusive_cinder_pool` that
    allows us to disable the size collection and thus improve the stats
    reporting speed.

    Change-Id: I32c7746fa9149bce6cdec96ee9aa87b303de4271
    Closes-Bug: #1704106

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
Aaron (game-on-deactivatedaccount) wrote : Re: Gathering of thin provisioning stats breaks c-vol

This is affecting an OpenStack Ansible Pike deployment, running 16.0.8

Is there any possibility this could be backported to Pike?

Thank you

Revision history for this message
Aaron (game-on-deactivatedaccount) wrote :

This also affects the Pike version of Cinder

Changed in ubuntu:
status: New → Confirmed
no longer affects: ubuntu
Revision history for this message
Gorka Eguileor (gorka) wrote :

The biggest issue was when we were incorrectly doing calculations for the real used size instead of the provisioned size, but that was also fixed and backported to Pike: https://review.openstack.org/#/c/501316/

This new patch adds a new feature, with a specific configuration option, so unfortunately this cannot be backported, as my latest patch is more a feature/optimization/improvement.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/561721

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/queens)

Reviewed: https://review.openstack.org/561721
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=21821c16580377c4e6443d0b440f41cb7de0ca8d
Submitter: Zuul
Branch: stable/queens

commit 21821c16580377c4e6443d0b440f41cb7de0ca8d
Author: Gorka Eguileor <email address hidden>
Date: Tue Feb 6 14:54:57 2018 +0100

    RBD: Don't query Ceph on stats for exclusive pools

    Collecting stats for provisioned_capacity_gb takes a long time since we
    have to query each individual image for the provisioned size. If we are
    using the pool just for Cinder and/or are willing to accept a potential
    deviation in Cinder stats we could just not retrieve this information
    and calculate this based on the DB information for the volumes.

    This patch adds configuration option `rbd_exclusive_cinder_pool` that
    allows us to disable the size collection and thus improve the stats
    reporting speed.

    Change-Id: I32c7746fa9149bce6cdec96ee9aa87b303de4271
    Closes-Bug: #1704106
    (cherry picked from commit f33baccc3544cbda6cd5908328a56096046657ed)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 13.0.0.0b1

This issue was fixed in the openstack/cinder 13.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 12.0.2

This issue was fixed in the openstack/cinder 12.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/607192

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/pike)

Reviewed: https://review.openstack.org/607192
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=1dca272d8f47bc180cc481e6c6a835eda0bb06a8
Submitter: Zuul
Branch: stable/pike

commit 1dca272d8f47bc180cc481e6c6a835eda0bb06a8
Author: Gorka Eguileor <email address hidden>
Date: Tue Feb 6 14:54:57 2018 +0100

    RBD: Don't query Ceph on stats for exclusive pools

    Collecting stats for provisioned_capacity_gb takes a long time since we
    have to query each individual image for the provisioned size. If we are
    using the pool just for Cinder and/or are willing to accept a potential
    deviation in Cinder stats we could just not retrieve this information
    and calculate this based on the DB information for the volumes.

    This patch adds configuration option `rbd_exclusive_cinder_pool` that
    allows us to disable the size collection and thus improve the stats
    reporting speed.

    Change-Id: I32c7746fa9149bce6cdec96ee9aa87b303de4271
    Closes-Bug: #1704106
    (cherry picked from commit f33baccc3544cbda6cd5908328a56096046657ed)
    (cherry picked from commit 21821c16580377c4e6443d0b440f41cb7de0ca8d)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 11.2.0

This issue was fixed in the openstack/cinder 11.2.0 release.

Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: Gathering of thin provisioning stats breaks c-vol
Revision history for this message
Edward Hope-Morley (hopem) wrote :
summary: - Gathering of thin provisioning stats breaks c-vol
+ [SRU] Gathering of thin provisioning stats breaks c-vol
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/ocata)

Reviewed: https://review.openstack.org/644232
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=b1a0d62f357e431a6b74d38440a8392de972b824
Submitter: Zuul
Branch: stable/ocata

commit b1a0d62f357e431a6b74d38440a8392de972b824
Author: Gorka Eguileor <email address hidden>
Date: Tue Feb 6 14:54:57 2018 +0100

    RBD: Don't query Ceph on stats for exclusive pools

    Collecting stats for provisioned_capacity_gb takes a long time since we
    have to query each individual image for the provisioned size. If we are
    using the pool just for Cinder and/or are willing to accept a potential
    deviation in Cinder stats we could just not retrieve this information
    and calculate this based on the DB information for the volumes.

    This patch adds configuration option `rbd_exclusive_cinder_pool` that
    allows us to disable the size collection and thus improve the stats
    reporting speed.

    Change-Id: I32c7746fa9149bce6cdec96ee9aa87b303de4271
    Closes-Bug: #1704106
    (cherry picked from commit f33baccc3544cbda6cd5908328a56096046657ed)
    (cherry picked from commit 21821c16580377c4e6443d0b440f41cb7de0ca8d)
    (cherry picked from commit 1dca272d8f47bc180cc481e6c6a835eda0bb06a8)

Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Switching to triaged/high for the Ubuntu cloud archive ocata and pike releases as this patch is not yet committed there.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This is fixed in cinder 11.2.1 point release for pike so we might as well just pick this up in a new round of point releases. We'll do that via this bug: https://bugs.launchpad.net/cloud-archive/+bug/1822192

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This is currently not in the latest point release for ocata (10.0.8) so we'll have to cherry-pick it for now.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

New cinder package versions have been uploaded to ocata-staging and pike-staging.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Arne, or anyone else affected,

Accepted cinder into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This is also ready for testing in pike-proposed in the cinder 2:11.2.1-0ubuntu1~cloud0 package version.

Changed in charm-cinder-ceph:
milestone: none → 19.04
assignee: nobody → Edward Hope-Morley (hopem)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-cinder-ceph (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649106

Changed in charm-cinder-ceph:
status: New → In Progress
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

ocata-proposed tested and verified

tags: added: verification-ocata-done
removed: verification-ocata-needed
Revision history for this message
James Page (james-page) wrote : Update Released

The verification of the Stable Release Update for cinder has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package cinder - 2:10.0.8-0ubuntu1~cloud1
---------------

 cinder (2:10.0.8-0ubuntu1~cloud1) xenial-ocata; urgency=medium
 .
   * Backport: allow stop querying all rbd volumes for stats (LP: #1704106)
     - d/p/RBD-Don-t-query-Ceph-on-stats-for-exclusive-pools.patch

tags: added: sts-sru-done
David Ames (thedac)
Changed in charm-cinder-ceph:
milestone: 19.04 → 19.07
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder-ceph (master)

Reviewed: https://review.opendev.org/649106
Committed: https://git.openstack.org/cgit/openstack/charm-cinder-ceph/commit/?id=616ba364296687bfeb8d9bf35bb0e7628d573e5b
Submitter: Zuul
Branch: master

commit 616ba364296687bfeb8d9bf35bb0e7628d573e5b
Author: Edward Hope-Morley <email address hidden>
Date: Mon Apr 1 17:43:58 2019 +0100

    Add rbd_exclusive_pool support for Ocata and Pike

    The fix has been backported so we can now use it.

    Change-Id: I5dfad42def68634313af05d21ebe61bc229ebc35
    Closes-Bug: #1704106

Changed in charm-cinder-ceph:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-cinder-ceph (stable/19.04)

Fix proposed to branch: stable/19.04
Review: https://review.opendev.org/662986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder-ceph (stable/19.04)

Reviewed: https://review.opendev.org/662986
Committed: https://git.openstack.org/cgit/openstack/charm-cinder-ceph/commit/?id=2b7ffcaee81045af5893327f0cf2e603d8fd8f5f
Submitter: Zuul
Branch: stable/19.04

commit 2b7ffcaee81045af5893327f0cf2e603d8fd8f5f
Author: Edward Hope-Morley <email address hidden>
Date: Mon Apr 1 17:43:58 2019 +0100

    Add rbd_exclusive_pool support for Ocata and Pike

    The fix has been backported in Cinder so we
    can now use it in the charm.

    Also includes amulet test fix and context generator
    fix from commit c3456b7.

    Change-Id: I5dfad42def68634313af05d21ebe61bc229ebc35
    Closes-Bug: #1704106
    (cherry picked from commit 616ba364296687bfeb8d9bf35bb0e7628d573e5b)

David Ames (thedac)
Changed in charm-cinder-ceph:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/761422

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/cinder/+/792128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/792128
Committed: https://opendev.org/openstack/cinder/commit/93b3747806216cb44b23daf848d70e23fe0d3512
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 93b3747806216cb44b23daf848d70e23fe0d3512
Author: Gorka Eguileor <email address hidden>
Date: Wed Nov 4 15:34:37 2020 +0100

    RBD: Change rbd_exclusive_cinder_pool's default

    In Cinder we always try to have sane defaults, but the current RBD
    default for rbd_exclusive_cinder_pools may lead to issues on deployments
    with a large number of volumes:

    - Cinder taking a long time to start.
    - Cinder becoming non-responsive.
    - Cinder stats gathering taking longer than the gathering period.

    This is cause by the driver making an independent request to get
    detailed information on each image to accurately calculate the space
    used by the Cinder volumes.

    With this patch we change the default to make sure that these issues
    don't happen in the most common deployment case (the exclusive Cinder
    pool).

    Related-Bug: #1704106
    Change-Id: I839441a71238cdad540ba8d9d4d18b1f0fa3ee9d
    (cherry picked from commit 4ba6664deefdec5379d31b987be322745eeb4e87)

tags: added: in-stable-victoria
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.