Expose block_device_allocate_retries as a dedicated config option

Bug #1758607 reported by Nobuto Murata
48
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Charm Guide
Fix Released
Undecided
Nobuto Murata
OpenStack Nova Compute Charm
Fix Released
Wishlist
Nobuto Murata

Bug Description

It might be a good idea to expose "block_device_allocate_retries" as a dedicated config option in charm although it can be tweaked by "config-flags" option easily.

"block_device_allocate_retries" is an option I change in every engagement Windows guests are involved. The default block device attachment time out is 3 minutes (block_device_allocate_retries_interval=3 * block_device_allocate_retries=60 = 180 seconds). It may not be enough to download 10+ GB Windows image and convert it from QCOW2 to RAW (if the image was uploaded as QCOW2 originally and needs to be converted to RAW for Ceph backend).

I usually bump "block_device_allocate_retries" to 300 to set the timeout as 15 min to be safe (block_device_allocate_retries_interval=3 * block_device_allocate_retries=300 = 900 seconds).

block_device_allocate_retries_interval = 3
> (Integer) Interval (in seconds) between block device allocation retries on failures.
>
> This option allows the user to specify the time interval between consecutive retries. ‘block_device_allocate_retries’ option specifies the maximum number of retries.

block_device_allocate_retries = 60
> (Integer) Number of times to retry block device allocation on failures. Starting with Liberty, Cinder can use image volume cache. This may help with block device allocation performance. Look at the cinder image_volume_cache_enabled configuration option.

James Page (james-page)
Changed in charm-nova-compute:
status: New → Triaged
importance: Undecided → Wishlist
Trent Lloyd (lathiat)
tags: added: sts
Revision history for this message
Trent Lloyd (lathiat) wrote :

I agree with changing this and I think we should set the default to 10 or 15 minutes.

Many environments use qcow2 but with ceph volumes which requires the cinder node to download the volume, then convert it, then upload it. Usually using a HDD on the root disk. This both thrashes the HDD on these nodes affecting performance. But also if you try to create more than 2-6 VMs at once it will almost certainly timeout with the default (120) or even 240.

Revision history for this message
Trent Lloyd (lathiat) wrote :

Recently hit in 2 production environments when deploying Windows images.. Even 240 is not enough.

Revision history for this message
Jamon Camisso (jamon) wrote :

Likewise, supporting an arbitrarily large interval/retry combination (within reason) would be helpful. A customer ran into this timeout after 186s in a cloud with an 80GB Windows image.

Revision history for this message
Mark Maglana (mmaglana) wrote :

Does the charm have to expose block_device_allocate_retries directly or can we make it friendlier such that the charm user can indicate timeout in seconds rather than have to compute it in their head via block_device_allocate_retries and block_device_allocate_retries_interval?

Revision history for this message
Nobuto Murata (nobuto) wrote :

As long as the config description has mentions to block_device_allocate_retries and block_device_allocate_retries_interval to give an idea of upstream config values, friendlier config name would be nice to have.

Mark Maglana (mmaglana)
Changed in charm-nova-compute:
assignee: nobody → Mark Maglana (mmaglana)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)

Fix proposed to branch: master
Review: https://review.opendev.org/668933

Changed in charm-nova-compute:
status: Triaged → In Progress
Revision history for this message
Mark Maglana (mmaglana) wrote :

The preceding proposed fix now exposes block_device_allocate_retries_interval and block_device_allocate_retries as configuration options for the charm.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/669060

Revision history for this message
Mark Maglana (mmaglana) wrote :

For clarity:

* https://review.opendev.org/#/c/668933/ - Addresses the bug directly by exposing two OpenStack options

* https://review.opendev.org/#/c/669060/ - Adds a friendlier alternative to the two options in the above change.

Ryan Beisner (1chb1n)
Changed in charm-nova-compute:
milestone: none → 19.10
Revision history for this message
Trent Lloyd (lathiat) wrote :

I would love to see a fix for this merged, including an increased default as it's a common problem to need to increase this. Not only for large images, but also when deploying multiple VMs at once as multiple images copying at once are often converted through the root disk of the cinder node, and can bottleneck on slower root disks.

In the spirit of charms opinionated and simplified configuration, I think we should not merge 668933 and should prefer the simpler 669060 patch set - since users can already use config-flags to set the individual values if they really need to.

David Ames (thedac)
Changed in charm-nova-compute:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-nova-compute:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-nova-compute:
milestone: 20.05 → 20.08
Revision history for this message
David Coronel (davecore) wrote :

I just hit this problem in a lab environment with every openstack component virtualised in VMs on one baremetal machine. It's a low performance environment. I put block_device_allocate_retries = 300 and block_device_allocate_retries_interval = 3 and it worked for me.

James Page (james-page)
Changed in charm-nova-compute:
milestone: 20.08 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-compute (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/669060
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/668933
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Nobuto Murata (nobuto)
Changed in charm-nova-compute:
assignee: Mark Maglana (mmaglana) → nobody
Nobuto Murata (nobuto)
Changed in charm-nova-compute:
assignee: nobody → Nobuto Murata (nobuto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-medium for tracking.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-compute/+/828253
Committed: https://opendev.org/openstack/charm-nova-compute/commit/2283f12eddc45ab97701ece264391e545f6bda1c
Submitter: "Zuul (22348)"
Branch: master

commit 2283f12eddc45ab97701ece264391e545f6bda1c
Author: Nobuto Murata <email address hidden>
Date: Tue Feb 8 18:02:12 2022 +0900

    Expose block-device-allocate-retries and interval

    The upstream has 3 min as the timeout (60 retries at 3-seconds
    interval). It should work if an image is in a raw format to leverage
    Ceph's copy-on-write or an image is small enough to be copied quickly.
    However, there are some cases exeeding the 3 min deadline such as a big
    enough image with Qcow2 or other formats like Windows images, or storage
    backend doesn't have copy-on-write from Glance.

    Let's bump the deadline to 15 min (300 retries at 3-seconds interval) to
    cover most of the cases out of the box, and let operators tune it
    further by exposing those options.

    Co-authored-by: Mark Maglana <email address hidden>
    Closes-Bug: 1758607
    Change-Id: I6f6da8e90c6bbcd031ee183ae86d88eccd392230

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Nobuto Murata (nobuto)
Changed in charm-guide:
assignee: nobody → Nobuto Murata (nobuto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/836163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-compute/+/836163
Committed: https://opendev.org/openstack/charm-nova-compute/commit/c52d87f071874dfa55e84ea96c2d5d29df29fad9
Submitter: "Zuul (22348)"
Branch: stable/xena

commit c52d87f071874dfa55e84ea96c2d5d29df29fad9
Author: Nobuto Murata <email address hidden>
Date: Tue Feb 8 18:02:12 2022 +0900

    Expose block-device-allocate-retries and interval

    The upstream has 3 min as the timeout (60 retries at 3-seconds
    interval). It should work if an image is in a raw format to leverage
    Ceph's copy-on-write or an image is small enough to be copied quickly.
    However, there are some cases exeeding the 3 min deadline such as a big
    enough image with Qcow2 or other formats like Windows images, or storage
    backend doesn't have copy-on-write from Glance.

    Let's bump the deadline to 15 min (300 retries at 3-seconds interval) to
    cover most of the cases out of the box, and let operators tune it
    further by exposing those options.

    Co-authored-by: Mark Maglana <email address hidden>
    Closes-Bug: 1758607
    Change-Id: I6f6da8e90c6bbcd031ee183ae86d88eccd392230
    (cherry picked from commit 2283f12eddc45ab97701ece264391e545f6bda1c)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-guide (master)
Changed in charm-guide:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-guide (master)

Reviewed: https://review.opendev.org/c/openstack/charm-guide/+/837651
Committed: https://opendev.org/openstack/charm-guide/commit/fa51adbbd5d1c91c342b75d1ce7df2398ff50729
Submitter: "Zuul (22348)"
Branch: master

commit fa51adbbd5d1c91c342b75d1ce7df2398ff50729
Author: Nobuto Murata <email address hidden>
Date: Wed Apr 13 11:46:11 2022 +0900

    release-notes: block-device-allocate timeout

    Closes-Bug: #1758607
    Change-Id: I056d79682213a39bcaa44b847cb78b84fbaf95de

Changed in charm-guide:
status: In Progress → Fix Released
Changed in charm-nova-compute:
milestone: none → 22.04
Changed in charm-nova-compute:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.