Volume backup timeout for large volumes when using backend based on chunkeddriver

Bug #1918119 reported by kiran pawar
54
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Cinder
In Progress
Low
Unassigned

Bug Description

It was observed that large volumes e.g. 2 TB takes around 20 hours and sometimes timed out. The volume is divided into chunks and uploaded sequentially.
We can parallelize the upload chunk (backup chunk) process thereby creating threadpool of around 5-10 threads.

Revision history for this message
kiran pawar (kpdev) wrote :
Changed in cinder:
status: New → In Progress
assignee: nobody → kiran pawar (kiranpawar89)
Changed in cinder:
importance: Undecided → Low
milestone: none → 19.0.0
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :
Download full text (6.9 KiB)

This response is going to be long. Please excuse that, but also take it as an indication that I am really interested in improving on cinder-backups current state, most importantly in regards to performance.

I myself was also wondering about the experiences with Object Storage (S3) as cinder-backup target and started a thread on the ML: https://lists.openstack.org/pipermail/openstack-discuss/2022-September/030263.html. I attended the operator hour at the Antelope PTG and we discussed (among other things not really great in cinder-backup) the performance issues with non-RBD drivers. See notes starting at ... https://etherpad.opendev.org/p/antelope-ptg-cinder#L119. Following some more conversations at the PTG and cinder weekly meetings,I was just about to open a new bug, but I am observing the same issues in regards to performance with cinder-backup drivers using the abstract chunked driver (https://opendev.org/openstack/cinder/src/branch/master/cinder/backup/chunkeddriver.py), so almost all but RBD.

1) Some benchmarks and observations:

```
Test scenario is a new, clean 1 TiB volume attached to a VM. 20 GiB were written to it via dd from /dev/urandom, so the volume holds ~20 GiB in total (plus file system metadata). Ceph and S3 backup drivers were tested, creating a new full backup.

Ceph RBD: 8m42s (avg. time over 3 runs)
S3 to local MinIO (via chunked driver): Error after 5h30m
```

The problem with the S3 driver (or rather the chunked driver) is that every block gets processed and uploaded for itself, with no deduplication at all. Even with only 20 GiB written to the 1 TiB volume, the S3 driver hashes, compresses and uploads all “empty” blocks bit by bit.

2) We dug a little deeper and tries to find the bottlenecks via profiling

We used OpenStack Wallaby (but could gladly repeat this with Yoga, if someone believes the results would be different). To speed things up, all test scenarios now used a 10 GB volume with 3 GB written from /dev/urandom. The profiling analyzed the main stages of the chunk backup process: volume read, hashing, compression and S3 upload. The following avg times are given for one chunk.

Results of profiling:

```
Avg times as-is

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~8m
```

```
Avg times with your concurrency patch (https://review.opendev.org/c/openstack/cinder/+/779233) applied:

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~7-8m (fastest: ~6m)
    Summary: A little bit faster than without concurrency
```

```
Avg times with concurrency patch and zstd

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~2,5m
    Summary: Compression with zstd is 24x faster than with zlib; overall 3x faster
```

```
Avg times as-is with zstd{}

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~4,4m
    Summary: Much faster than with zlib, even without upload concurrency
```

Summary:
The overall backup time can be optimized easily by switching from the default zlib to zstd c...

Read more...

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

I just found https://review.opendev.org/c/openstack/cinder/+/611079 which complains about backups to NFS being too slow / blocking. Likely this is not really related to the way object storages are used, but parallelizing over multiple connections (multipart uploads) and things are likely similar to reduce the effect of long round trips.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

I suppose the assignment is more for a core dev to pick this up?

Changed in cinder:
assignee: kiran pawar (kpdev) → nobody
summary: - Volume backup timeout for large volumes
+ Volume backup timeout for large volumes when using backend based on
+ chunkeddriver
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Talk by Cern on their observation with sub-par Cinder Backup performance using the chunked driver: https://youtu.be/ni-UgftgAy0?si=n3px75TpTApa1v7c&t=705

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

There was some discussion about this in the cinder weekly, see https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-64

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.