Cinder

Bug #1918119
Comment #2

Comment 2 for bug 1918119

Revision history for this message

Christian Rohmann (christian-rohmann) wrote on 2023-04-13: Re: Volume backup timeout for large volumes

This response is going to be long. Please excuse that, but also take it as an indication that I am really interested in improving on cinder-backups current state, most importantly in regards to performance.

I myself was also wondering about the experiences with Object Storage (S3) as cinder-backup target and started a thread on the ML: https://lists.openstack.org/pipermail/openstack-discuss/2022-September/030263.html. I attended the operator hour at the Antelope PTG and we discussed (among other things not really great in cinder-backup) the performance issues with non-RBD drivers. See notes starting at ... https://etherpad.opendev.org/p/antelope-ptg-cinder#L119. Following some more conversations at the PTG and cinder weekly meetings,I was just about to open a new bug, but I am observing the same issues in regards to performance with cinder-backup drivers using the abstract chunked driver (https://opendev.org/openstack/cinder/src/branch/master/cinder/backup/chunkeddriver.py), so almost all but RBD.

1) Some benchmarks and observations:

```
Test scenario is a new, clean 1 TiB volume attached to a VM. 20 GiB were written to it via dd from /dev/urandom, so the volume holds ~20 GiB in total (plus file system metadata). Ceph and S3 backup drivers were tested, creating a new full backup.

Ceph RBD: 8m42s (avg. time over 3 runs)
S3 to local MinIO (via chunked driver): Error after 5h30m
```

The problem with the S3 driver (or rather the chunked driver) is that every block gets processed and uploaded for itself, with no deduplication at all. Even with only 20 GiB written to the 1 TiB volume, the S3 driver hashes, compresses and uploads all “empty” blocks bit by bit.

2) We dug a little deeper and tries to find the bottlenecks via profiling

We used OpenStack Wallaby (but could gladly repeat this with Yoga, if someone believes the results would be different). To speed things up, all test scenarios now used a 10 GB volume with 3 GB written from /dev/urandom. The profiling analyzed the main stages of the chunk backup process: volume read, hashing, compression and S3 upload. The following avg times are given for one chunk.

Results of profiling:

```
Avg times as-is

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~8m
```

```
Avg times with your concurrency patch (https://review.opendev.org/c/openstack/cinder/+/779233) applied:

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~7-8m (fastest: ~6m)
    Summary: A little bit faster than without concurrency
```

```
Avg times with concurrency patch and zstd

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~2,5m
    Summary: Compression with zstd is 24x faster than with zlib; overall 3x faster
```

```
Avg times as-is with zstd{}

    Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~4,4m
    Summary: Much faster than with zlib, even without upload concurrency
```

Summary:
The overall backup time can be optimized easily by switching from the default zlib to zstd compression algorithm. Python's zstd implementation seems to be multi-threaded by default [0], causing the compression to be around 24x faster. The overall backup time becomes 3x faster.
I also tested the concurrency patch [1] and a slight performance increase can be observed. The fastest overall backup time was ~20% faster than without the patch, which seems to match the patch author's measurements.
[0]: https://pypi.org/project/zstd/
[1]: https://review.opendev.org/c/openstack/cinder/+/779233

Regarding your patch, I am wondering if and how it relates to the `backup_native_threads_pool_size` parameter
(https://opendev.org/openstack/cinder/commit/e570436d1cca5cfa89388aec8b2daa63d01d0250)?

3) High memory utilization
See (https://etherpad.opendev.org/p/cinder-bobcat-meetings#L164):

    Issues of high memory utilization:
    * https://github.com/openstack/cinder/commit/b661d115f5011cf51095e698c68acc4ab5440011
    * https://github.com/openstack/cinder/commit/30c2289c9b0456d3783f01e3d65985ed1b09976a

    (How) to reduce memory footprint for "all" deployment tools, as done here for
    Bug: https://bugs.launchpad.net/cinder/+bug/1908805
    Change to DevStack https://review.opendev.org/c/openstack/devstack/+/845805
    Change to Triple-O https://review.opendev.org/c/openstack/tripleo-common/+/845807

This is particular important for all "chunked" backup drivers as they read blocks into memory.
RBD does simply pipe data via "rbd export -> rbd import".

I intend to write a similar change for openstack-ansible to get the memory consumption down
for those installations as well.

4) Streaming instead of chunking?

Could we switch to using streams to "pipe" the data read from the volume through the compressor
and then directly to the driver / storage? Would reduce the memory footprint for compressing and
de-compressing chunks and make the whole process more efficient?

This seems to be supported for (most) of the used components:

Python:
Streams - https://docs.python.org/3/library/io.html#

Storage:
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html

Compressors:
GZIP https://docs.python.org/3/library/zlib.html#zlib.compressobj

BZ2 https://docs.python.org/3/library/bz2.html#bz2.decompress

ZSTD NOT IMPLEMENTED https://github.com/sergey-dryabzhinsky/python-zstd/pull/31#issuecomment-429532288, but there is https://github.com/indygreg/python-zstandard

5) Restoring into sparse volumes
There now is support to restore into sparse volumes, see https://review.opendev.org/c/openstack/cinder/+/852654
This was also an issue with the drivers using the chunked driver as base, but is now implemented and merged.

6) Decouple backup state from volume state
Shameless cross advertising of a spec I worked on: https://review.opendev.org/c/openstack/cinder-specs/+/868761

The idea is to decouple the state of a backup from the volume state, to not have cinder-backup working on a volume backup
block any other volume actions during that time. This is also an even bigger issue because backups take such a long time.

So sum this all too long response up:
-------------------------------------------
Looking at realistic volume sizes of 4-10TiB should be something cinder-backup can handle.
Also considering there are now servers with many CPU cores, 100G interfaces and fast NVME storage, a thoughput >1 GB/s
for a single backups (stream) should be the goal.

Consider an not crazy big 8TiB volume is being backed up in full.
At 1 GiB/s the volume backup will still take ~2.5 hrs to complete.

And I am not even talking about the issue of cinder-backup not being able to resume a
backup if it was interrupted for some reason (restarted, updated, network glitch, ...).

1) Some benchmarks and observations:

Ceph RBD: 8m42s (avg. time over 3 runs)
S3 to local MinIO (via chunked driver): Error after 5h30m
```
 
The problem with the S3 driver (or rather the chunked driver) is that every block gets processed and uploaded for itself, with no deduplication at all. Even with only 20 GiB written to the 1 TiB volume, the S3 driver hashes, compresses and uploads all “empty” blocks bit by bit.

2) We dug a little deeper and tries to find the bottlenecks via profiling

Results of profiling:

```
Avg times as-is

Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~8m
```

```
Avg times with your concurrency patch (https://review.opendev.org/c/openstack/cinder/+/779233) applied:

Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~1,3s
    S3 upload: ~0,4s
    Total backup time: ~7-8m (fastest: ~6m)
    Summary: A little bit faster than without concurrency
```

```
Avg times with concurrency patch and zstd

Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~2,5m
    Summary: Compression with zstd is 24x faster than with zlib; overall 3x faster
```
 
```
Avg times as-is with zstd{}

Volume read: ~0,2s
    Hashing: ~0,1s
    Compression: ~0,05s
    S3 upload: ~0,4s
    Total backup time: ~4,4m
    Summary: Much faster than with zlib, even without upload concurrency
```

Regarding your patch, I am wondering if and how it relates to the `backup_native_threads_pool_size` parameter
(https://opendev.org/openstack/cinder/commit/e570436d1cca5cfa89388aec8b2daa63d01d0250)?

3) High memory utilization
See (https://etherpad.opendev.org/p/cinder-bobcat-meetings#L164):
    
    Issues of high memory utilization:
    * https://github.com/openstack/cinder/commit/b661d115f5011cf51095e698c68acc4ab5440011
    * https://github.com/openstack/cinder/commit/30c2289c9b0456d3783f01e3d65985ed1b09976a

This is particular important for all "chunked" backup drivers as they read blocks into memory. 
    RBD does simply pipe data via "rbd export -> rbd import".

I intend to write a similar change for openstack-ansible to get the memory consumption down 
for those installations as well.

4) Streaming instead of chunking?

Could we switch to using streams to "pipe" the data read from the volume through the compressor 
and then directly to the driver / storage? Would reduce the memory footprint for compressing and 
de-compressing chunks and make the whole process more efficient?

This seems to be supported for (most) of the used components:
        
        Python:
            Streams - https://docs.python.org/3/library/io.html#
        
        Storage:
            https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html
        
        Compressors:
            GZIP https://docs.python.org/3/library/zlib.html#zlib.compressobj
            
            BZ2 https://docs.python.org/3/library/bz2.html#bz2.decompress
            
            ZSTD NOT IMPLEMENTED https://github.com/sergey-dryabzhinsky/python-zstd/pull/31#issuecomment-429532288, but there is https://github.com/indygreg/python-zstandard

6) Decouple backup state from volume state
Shameless cross advertising of a spec I worked on: https://review.opendev.org/c/openstack/cinder-specs/+/868761

Consider an not crazy big 8TiB volume is being backed up in full. 
At 1 GiB/s the volume backup will still take ~2.5 hrs to complete.

And I am not even talking about the issue of cinder-backup not being able to resume a
backup if it was interrupted for some reason (restarted, updated, network glitch, ...).