Backup create failed: RBD volume flatten too long causing mq to timed out.

Bug #1916843 reported by likai
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Unassigned

Bug Description

In the process of creating a backup using a snapshot, there is an operation to create a temporary volume from a snapshot, which requires cinder-backup and cinder-volume to perform rpc interaction.default configuration "rpc_response_timeout" is 60s.
For CEPH RBD volume driver,if rbd_flatten_volume_from_snapshot=true,flatten operation may be so long that the rpc call doesn't return, which will cause an exception and fail to create a backup.

backup/manager.py:
backup_device = self.volume_rpcapi.get_backup_device(context,backup,volume)

Revision history for this message
likai (likai0906) wrote :
tags: added: backup-service rbd snapshot
Changed in cinder:
importance: Undecided → Medium
Revision history for this message
Sofia Enriquez (lsofia-enriquez) wrote :

Hi Lika,
After discussing this in the last upstream meeting [1] we need more info regarding this:
- Please add more context about the environment you are using.

- Looks like the main point in that bug is that the slow call shouldn't block the service up for RPC calls. Does it work for small volumes but large one will always take longer than 60s and the RPC call times out?

Looking forward for your replyRegards,Sofia
[1] http://eavesdrop.openstack.org/meetings/cinder/2021/cinder.2021-03-03-14.00.log.html

Changed in cinder:
status: New → Incomplete
Revision history for this message
likai (likai0906) wrote :

Of sourse, Sofia.
- About environment:volume backend is CEPH RBD,and rbd_flatten_volume_from_snapshot=true.
- Yes, the time it takes to flatten a clone increases with the size of the volume snapshot.It can takes longer than the RPC timeout. And making the rpc_response_timeout time longer will not fundamentally solve this problem.

Revision history for this message
Walt Boring (walter-boring) wrote :

We are running into this same problem in our deployments with vmware.
get_backup_device() blocks waiting for a temporary volume to be created on the backend over rpc.
With vmware driver, it boils doing to doing a full clone of the original volume, which can take hours to do for large volumes full of data. We have many customers that have 200G and even 2TB volumes. Every backup fails due to this rpc timeout.

Revision history for this message
Eric Harney (eharney) wrote :

It looks like this oslo.messaging functionality:
https://review.opendev.org/c/openstack/oslo.messaging/+/546763

might provide a simpler way to fix this than what is proposed in https://review.opendev.org/c/openstack/cinder/+/784477 currently.

Have you taken a look at this option?

Revision history for this message
Walt Boring (walter-boring) wrote :

The oslo option doesn't help and it's reasonable for backups that can take upwards of 10 hours to complete.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/790492

Changed in cinder:
status: Incomplete → In Progress
Revision history for this message
Gorka Eguileor (gorka) wrote :

Proposed fix to skip flattening temporary volumes:

 https://review.opendev.org/c/openstack/cinder/+/790492

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/784477
Committed: https://opendev.org/openstack/cinder/commit/e38fb71aac05a9ddc29670d4395c408d565f5d37
Submitter: "Zuul (22348)"
Branch: master

commit e38fb71aac05a9ddc29670d4395c408d565f5d37
Author: Hemna <email address hidden>
Date: Thu Apr 1 16:37:20 2021 -0400

    Rework backup process to make it async

    This patch updates the backup process to call the volume manager
    asynchronously to get the backup device in which to do the backup on.
    This fixes a major issue with certain cinder drivers that take a long
    time to create a temporary clone of the volume being backed up.

    Closes-Bug: #1916843
    Change-Id: Ib861e1bc35247f932fbae3796ed9025a560461c4

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 20.0.0.0rc1

This issue was fixed in the openstack/cinder 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/790492
Committed: https://opendev.org/openstack/cinder/commit/e726c07948138f514706cc69440971a2105c2bc0
Submitter: "Zuul (22348)"
Branch: master

commit e726c07948138f514706cc69440971a2105c2bc0
Author: Gorka Eguileor <email address hidden>
Date: Mon May 10 10:35:57 2021 +0200

    RBD: Don't flatten temporary resources

    There are instances where cinder needs to create a temporary volume and
    this can trigger a flatten of the new temporary volume, which will make
    the operation take a lot longer.

    In some cases this means slower operations, but in others it leads to
    rpc timeout failures.

    A case where we see timeout failures is when doing a backup of a
    snapshot and we have rbd_flatten_volume_from_snapshot=true.

    This patch ensures that we don't flatten temporary volumes.

    Closes-Bug: #1916843
    Change-Id: I8f55c3beb2f8df5b2227506f82ddf6ee57c951ae

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/cinder/+/845039

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/cinder/+/845130

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/845039
Committed: https://opendev.org/openstack/cinder/commit/50c94ed0960e8bebbf5d17bac06c1646538f2fc2
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 50c94ed0960e8bebbf5d17bac06c1646538f2fc2
Author: Gorka Eguileor <email address hidden>
Date: Mon May 10 10:35:57 2021 +0200

    RBD: Don't flatten temporary resources

    There are instances where cinder needs to create a temporary volume and
    this can trigger a flatten of the new temporary volume, which will make
    the operation take a lot longer.

    In some cases this means slower operations, but in others it leads to
    rpc timeout failures.

    A case where we see timeout failures is when doing a backup of a
    snapshot and we have rbd_flatten_volume_from_snapshot=true.

    This patch ensures that we don't flatten temporary volumes.

    Closes-Bug: #1916843
    Change-Id: I8f55c3beb2f8df5b2227506f82ddf6ee57c951ae
    (cherry picked from commit e726c07948138f514706cc69440971a2105c2bc0)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/845130
Committed: https://opendev.org/openstack/cinder/commit/5a21cd9be243e076a402db2c465319d5d946658d
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 5a21cd9be243e076a402db2c465319d5d946658d
Author: Gorka Eguileor <email address hidden>
Date: Mon May 10 10:35:57 2021 +0200

    RBD: Don't flatten temporary resources

    There are instances where cinder needs to create a temporary volume and
    this can trigger a flatten of the new temporary volume, which will make
    the operation take a lot longer.

    In some cases this means slower operations, but in others it leads to
    rpc timeout failures.

    A case where we see timeout failures is when doing a backup of a
    snapshot and we have rbd_flatten_volume_from_snapshot=true.

    This patch ensures that we don't flatten temporary volumes.

    Closes-Bug: #1916843
    Change-Id: I8f55c3beb2f8df5b2227506f82ddf6ee57c951ae
    (cherry picked from commit e726c07948138f514706cc69440971a2105c2bc0)
    (cherry picked from commit 50c94ed0960e8bebbf5d17bac06c1646538f2fc2)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 21.0.0.0rc1

This issue was fixed in the openstack/cinder 21.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 19.2.0

This issue was fixed in the openstack/cinder 19.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 20.1.0

This issue was fixed in the openstack/cinder 20.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.