Performance degradation of "zfs clone"

Bug #1567557 reported by Stéphane Graber on 2016-04-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Native ZFS for Linux
New
Unknown
zfs-linux (Ubuntu)
Low
Colin Ian King
Xenial
Low
Colin Ian King
Zesty
Low
Colin Ian King
Artful
Low
Colin Ian King

Bug Description

[SRU Justification]

Creating tens of hundreds of clones can be prohibitively slow. The underlying mechanism to gather clone information is using a 16K buffer which limits performance. Also, the initial assumption is to pass in zero sized buffer to the underlying ioctl() to get an idea of the size of the buffer required to fetch information back to userspace. If we bump the initial buffer to a larger size then we reduce the need for two ioctl calls which improves performance.

[Fix]
Bump initial buffer size from 16K to 256K

[Regression Potential]
This is minimal as this is just a tweak in the initial buffer size and larger sizes are handled correctly by ZFS since they are normally used on the second ioctl() call once we have established the size of the buffer required from the first ioctl() call. Larger initial buffers just remove the need for the initial size estimation for most cases where the number of clones is less than ~5000. There is a risk that a larger buffer size could lead to a ENOMEM issue when allocating the buffer, but the size of buffer used is still trivial for modern large 64 bit servers running ZFS.

[Test case]
Create 4000 clones. With the fix this takes 35-40% less time than without the fix. See the example test.sh script as an example of how to create this many clones.

------------------------------------------

I've been running some scale tests for LXD and what I've noticed is that "zfs clone" gets slower and slower as the zfs filesystem is getting busier.

It feels like "zfs clone" requires some kind of pool-wide lock or something and so needs for all operations to complete before it can clone a new filesystem.

A basic LXD scale test with btrfs vs zfs shows what I mean, see below for the reports.

The test is run on a completely dedicated physical server with the pool on a dedicated SSD, the exact same machine and SSD was used for the btrfs test.

The zfs filesystem is configured with those settings:
 - relatime=on
 - sync=disabled
 - xattr=sa

So it shouldn't be related to pending sync() calls...

The workload in this case is ultimately 1024 containers running busybox as their init system and udhcpc grabbing an IP.
The problem gets significantly worse if spawning busier containers, say a full Ubuntu system.

=== zfs ===
root@edfu:~# /home/ubuntu/lxd-benchmark spawn --count=1024 --image=images:alpine/edge/amd64 --privileged=true
Test environment:
  Server backend: lxd
  Server version: 2.0.0.rc8
  Kernel: Linux
  Kernel architecture: x86_64
  Kernel version: 4.4.0-16-generic
  Storage backend: zfs
  Storage version: 5
  Container backend: lxc
  Container version: 2.0.0.rc15

Test variables:
  Container count: 1024
  Container mode: privileged
  Image: images:alpine/edge/amd64
  Batches: 128
  Batch size: 8
  Remainder: 0

[Apr 3 06:42:51.170] Importing image into local store: 64192037277800298d8c19473c055868e0288b039349b1c6579971fe99fdbac7
[Apr 3 06:42:52.657] Starting the test
[Apr 3 06:42:53.994] Started 8 containers in 1.336s
[Apr 3 06:42:55.521] Started 16 containers in 2.864s
[Apr 3 06:42:58.632] Started 32 containers in 5.975s
[Apr 3 06:43:05.399] Started 64 containers in 12.742s
[Apr 3 06:43:20.343] Started 128 containers in 27.686s
[Apr 3 06:43:57.269] Started 256 containers in 64.612s
[Apr 3 06:46:09.112] Started 512 containers in 196.455s
[Apr 3 06:58:19.309] Started 1024 containers in 926.652s
[Apr 3 06:58:19.309] Test completed in 926.652s

=== btrfs ===
Test environment:
  Server backend: lxd
  Server version: 2.0.0.rc8
  Kernel: Linux
  Kernel architecture: x86_64
  Kernel version: 4.4.0-16-generic
  Storage backend: btrfs
  Storage version: 4.4
  Container backend: lxc
  Container version: 2.0.0.rc15

Test variables:
  Container count: 1024
  Container mode: privileged
  Image: images:alpine/edge/amd64
  Batches: 128
  Batch size: 8
  Remainder: 0

[Apr 3 07:42:12.053] Importing image into local store: 64192037277800298d8c19473c055868e0288b039349b1c6579971fe99fdbac7
[Apr 3 07:42:13.351] Starting the test
[Apr 3 07:42:14.793] Started 8 containers in 1.442s
[Apr 3 07:42:16.495] Started 16 containers in 3.144s
[Apr 3 07:42:19.881] Started 32 containers in 6.530s
[Apr 3 07:42:26.798] Started 64 containers in 13.447s
[Apr 3 07:42:42.048] Started 128 containers in 28.697s
[Apr 3 07:43:13.210] Started 256 containers in 59.859s
[Apr 3 07:44:26.238] Started 512 containers in 132.887s
[Apr 3 07:47:30.708] Started 1024 containers in 317.357s
[Apr 3 07:47:30.708] Test completed in 317.357s

Richard Laager (rlaager) wrote :

Since you've already filed one bug report upstream, would you be interested in filing this one upstream? I can certainly copy-and-paste it upstream, but it seems like it'd be better to have it come from you. I don't know anything about LXD. (I'm just trying to help out with ZoL bug reports.)

Changed in zfs-linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)
status: New → In Progress
Colin Ian King (colin-king) wrote :

I've tried to reproduce this but when I pass more than 256 containers I run out of resources. What kind of configuration are you using to test this. Once I can reproduce this I'll try and see what the performance constraint is.

Changed in zfs-linux (Ubuntu):
status: In Progress → Incomplete
Stéphane Graber (stgraber) wrote :

Our test machines aren't particularly impressive, just 12GB of RAM or so.
Note that as can be seen above, we're using Alpine (busybox) images rather than Ubuntu to limit the resource usage and get us to a lot more containers per system.

Stéphane Graber (stgraber) wrote :

I'm trying to remember if we had to bump any of the sysctls to actually reach 1024 containers, I don't think any of the usual suspects would be in play until you reach 2000+ Alpine containers though.

If you do run out of some kernel resources, you can try applying the following sysctls to get you past that:
net.ipv6.neigh.default.gc_thresh3 = 8192
net.ipv6.neigh.default.gc_thresh2 = 4096
net.ipv6.neigh.default.gc_thresh1 = 1024
net.ipv6.route.gc_thresh = 8192
kernel.pty.max = 65536
kernel.pid_max = 2097152
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

Colin Ian King (colin-king) wrote :

I've created 4096 zfs clones using the same commands that lxd is using and I have observed that the biggest pinch point is when fetching the zfs clone stats as this has to lock using dsl_dataset_hold_obj, fetch the data and then unlock with dsl_dataset_rele. Traversing hundreds of clones is linear and slow, taking ~1 second with ~4000+ clones.

+ 99.22% 0.00% zfs [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
- 99.14% 0.00% zfs [kernel.kallsyms] [k] zfsdev_ioctl
   - zfsdev_ioctl
      - 99.08% zfs_ioc_objset_stats
         - 98.87% zfs_ioc_objset_stats_impl.part.20
            - 98.81% dmu_objset_stats
               - 98.81% dsl_dataset_stats
                  - 98.76% get_clones_stat
                     - 63.56% fnvlist_add_nvlist
                          nvlist_add_nvlist
                          nvlist_add_common.part.51
                          nvlist_copy_embedded.isra.54
                        + nvlist_copy_pairs.isra.52
                     + 32.74% fnvlist_add_boolean
                     + 1.81% dsl_dataset_hold_obj

Colin Ian King (colin-king) wrote :
Download full text (3.9 KiB)

Most of the CPU being consumed is in strcmp()

- 99.39% 0.00% zfs [kernel.kallsyms] [k] zfsdev_ioctl ▒
   - 99.39% zfsdev_ioctl
      - 99.34% zfs_ioc_objset_stats
         - 99.22% zfs_ioc_objset_stats_impl.part.20
            - 99.01% dmu_objset_stats
               - 99.00% dsl_dataset_stats
                  - 98.98% get_clones_stat
                     - 62.55% fnvlist_add_nvlist
                          nvlist_add_nvlist
                          nvlist_add_common.part.51
                          nvlist_copy_embedded.isra.54
                        - nvlist_copy_pairs.isra.52
                           - 62.52% nvlist_add_common.part.51
                              - 31.25% nvlist_copy_embedded.isra.54
                                 - nvlist_copy_pairs.isra.52
                                    - 31.23% nvlist_add_common.part.51
                                       - 30.78% nvlist_remove_all.part.49
                                            strcmp
                              - 30.86% nvlist_remove_all.part.49
                                   strcmp
            ...

Read more...

Colin Ian King (colin-king) wrote :

After a load tests, I've found the same anomaly with btrfs. I'm using a 80GB btrfs raw device, 16GB memory, 16 CPU threads:

BTRFS:
Time
  24.026 32
  54.584 64
 125.520 128
 272.835 256
 809.038 512
3451.172 1024

So the pinch point happens later than with ZFS, but there seems to be a scaling issue here too.

Colin Ian King (colin-king) wrote :

I've now compared ZFS with BTRFS and the overall CPU load profile from lxd in both scenarios is surprisingly similar - lxd is eating a lot of CPU and I think there is an underlying issue there rather than a massively broken performance issue with ZFS.

See the attached data sheet; the CPU consumption graphs for lxd in both scenarios show that lxd's CPU consumption increases with the number of container instances and this seems to be causing throttling of the ZFS backing store. Why is lxd eating so much CPU?

Changed in zfs-linux (Ubuntu):
status: Incomplete → In Progress
Colin Ian King (colin-king) wrote :

Running health-check against lxd I see that there is a lot of contention on a futex and a load of context switches going on in lxd.

Colin Ian King (colin-king) wrote :
Download full text (7.3 KiB)

Looks like there is a lot of contention on a futex and the underlying database too. Perf record and perf report can show that most of the issues are more to do with lxd, database and futex lock contention. Once these are resolved, I'll be happy to re-analyze the entire ZFS + lxd stack, but I believe the main issue is lxd eating cycles and not ZFS per se.

- 27.21% 1.05% lxd [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
   - 26.16% entry_SYSCALL_64_fastpath
      - 19.28% sys_futex
         - 18.91% do_futex
            - 12.70% futex_wake
               + 9.91% wake_up_q
               + 0.90% _raw_spin_lock
            + 5.00% futex_wait
      + 1.48% sys_select
      + 0.73% sys_newstat
      + 0.57% syscall_return_slowpath
      + 0.54% sys_pread64
- 25.16% 0.00% lxd libgolang-1.8-std.so.1 [.] runtime.goexit ▒
   - runtime.goexit
      - 14.98% main.(*operation).Run.func1
         - 13.92% main.createFromImage.func1
            - 13.86% main.containerCreateFromImage
               - 12.26% main.containerCreateInternal
                  - 12.24% main.containerLXCCreate
                     - 11.30% main.networkUpdateStatic
                        - 9.16% main.containerLoadByName ...

Read more...

Changed in lxd (Ubuntu):
importance: Undecided → High
Changed in zfs-linux (Ubuntu):
importance: Medium → Low
Changed in lxd (Ubuntu):
status: New → Confirmed
Changed in zfs-linux (Ubuntu):
status: In Progress → Confirmed
status: Confirmed → Triaged
Stéphane Graber (stgraber) wrote :

Creating 100 clones
Took: 4 seconds (25/s)
Creating 200 clones
Took: 13 seconds (15/s)
Creating 400 clones
Took: 46 seconds (8/s)
Creating 600 clones
Took: 156 seconds (3/s)

```
#!/bin/sh
zfs destroy -R castiana/testzfs
rm -Rf /tmp/testzfs

zfs create castiana/testzfs -o mountpoint=none
zfs snapshot castiana/testzfs@base
mkdir /tmp/testzfs

clone() {
    echo "Creating ${1} clones"
    BASE=$(date +%s)
    for i in $(seq ${1}); do
        UUID=$(uuidgen)
        zfs clone castiana/testzfs@base castiana/testzfs-$UUID -o mountpoint=/tmp/testzfs/${UUID}
    done
    END=$(date +%s)
    TOTAL=$((END-BASE))
    PER=$((${1}/${TOTAL}))
    echo "Took: $TOTAL seconds ($PER/s)"
}

clone 100
clone 200
clone 400
clone 600
```

That's on an up to date artful system. I'll do a similar test on an up to date xenial system.

Stéphane Graber (stgraber) wrote :

I'm seeing similar performance degradation on xenial except that the server I used seems to be pretty slow to start with too:

root@athos:~# sh test.sh
Creating 100 clones
Took: 11 seconds (9/s)
Creating 200 clones
Took: 46 seconds (4/s)
Creating 400 clones
Took: 297 seconds (1/s)

Stéphane Graber (stgraber) wrote :

That's pure ZFS completely outside of LXD.

Stéphane Graber (stgraber) wrote :

(but doing something pretty similar to what LXD does internally as far as clones and mountpoint handling)

Colin Ian King (colin-king) wrote :

So compound that with the huge amount of CPU suckage I'm seeing with lxd...

Colin Ian King (colin-king) wrote :
Download full text (5.7 KiB)

Running strace against zfs create I see the following ioctl() taking the time:

1500475028.118005 ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffc7c2184f0) = -1 ENOMEM (Cannot allocate memory) <0.390093>
1500475028.508153 mmap(NULL, 290816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd487b000 <0.000017>
1500475028.508201 ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffc7c2184f0) = 0 <0.382304>

I believe this and ioctl on /dev/zfs, namely ZFS_IOC_OBJSET_STATS which is getting stats on all the zfs file systems. This ioctl takes longer to do as the number of clones increases. I believe this is the API causing the bottleneck.

perf shows that over 99.9% of the zfs clone is indeed performing this ioctl:

- 99.39% 0.00% zfs [kernel.kallsyms] [k] sys_ioctl
     sys_ioctl
     do_vfs_ioctl
   - zfsdev_ioctl
      - 99.33% zfs_ioc_objset_stats
         - 99.30% zfs_ioc_objset_stats_impl.part.20
            - 99.21% dmu_objset_stats
               - dsl_dataset_stats
                  - 99.18% get_clones_stat
                     - 60.46% fnvlist_add_nvlist
                          nvlist_add_nvlist
                          nvlist_add_common.part.51
                          nvlist_copy_embedded.isra.54
                          nvlist_copy_pairs.isra.52
                        - nvlist_add_common.part.51
                           - 30.35% nvlist_copy_embedded.isra.54
                                nvlist_copy_pairs.isra.52
                              + nvlist_add_common.part.51
                           - 29.62% nvli...

Read more...

Colin Ian King (colin-king) wrote :

Based on measurements, we get a x^3 polynomial in clone time. Sub 2000 clones it's not too bad, but over 2000 clones the delay in the ioctl() ramps up pretty quickly. I've got a fairly solid set of stats and drew some estimates based on the trend line. See attached datasheet.

Changed in zfs:
status: Unknown → New
summary: - Performance degradation of "zfs clone" when under load
+ Performance degradation of "zfs clone"
Colin Ian King (colin-king) wrote :

I've bumped up the zcmd size to reduce the double ioctl() calls and we're seeing 25-40% improvement clones > 1000. This is an interim fix until we get a better fix from upstream. I'm going to SRU this once we're OK with it in Arful. Attached are the new benchmarks.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.6.5.11-1ubuntu2

---------------
zfs-linux (0.6.5.11-1ubuntu2) artful; urgency=medium

  * Improve cloning performance for large numbers of clones (LP: #1567557)
    Bump zcmd buffer from 16K to 256K.

 -- Colin Ian King <email address hidden> Tue, 8 Aug 2017 17:19:21 +0100

Changed in zfs-linux (Ubuntu Artful):
status: Triaged → Fix Released
Changed in zfs-linux (Ubuntu Xenial):
assignee: nobody → Colin Ian King (colin-king)
Changed in zfs-linux (Ubuntu Zesty):
assignee: nobody → Colin Ian King (colin-king)
status: New → In Progress
Changed in zfs-linux (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Low
Changed in zfs-linux (Ubuntu Zesty):
importance: Undecided → Low
description: updated

Hello Stéphane, or anyone else affected,

Accepted zfs-linux into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.6.5.9-5ubuntu4.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in zfs-linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Brian Murray (brian-murray) wrote :

Hello Stéphane, or anyone else affected,

Accepted zfs-linux into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/zfs-linux/0.6.5.6-0ubuntu18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in zfs-linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Colin Ian King (colin-king) wrote :

Tested with zfs 0.6.5.6-0ubuntu18 on Xenial. Passes zfs smoke tests. The performance increase on Xenial is not so significant than later releases, but I'm seeing double digit % increase in performance with a test of 3000 clones.

tags: added: verification-done-xenial
no longer affects: lxd (Ubuntu)
no longer affects: lxd (Ubuntu Xenial)
no longer affects: lxd (Ubuntu Zesty)
no longer affects: lxd (Ubuntu Artful)
Brian Murray (brian-murray) wrote :

Has this also been verified with Ubuntu 17.04? While its unlikely people are upgrading from 16.04 to 17.04 it is an upgrade path and we should ensure this fix does not regress and release these updates together.

Colin Ian King (colin-king) wrote :

@Brian, Yes indeed I have performed exhaustive testing on this on Xenial, Zesty, Artful. A full exhaustive test takes 5 hours and I have performed the upgrade steps and checked this out.

Łukasz Zemczak (sil2100) wrote :

As per comment #24, considering the upload to be tested for zesty as well ^

tags: added: verification-done-zesty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.6.5.9-5ubuntu4.2

---------------
zfs-linux (0.6.5.9-5ubuntu4.2) zesty; urgency=medium

  * Improve cloning performance for large numbers of clones (LP: #1567557)
    Bump zcmd buffer from 16K to 256K.

 -- Colin Ian King <email address hidden> Thu, 10 Aug 2017 13:48:11 +0100

Changed in zfs-linux (Ubuntu Zesty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for zfs-linux has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package zfs-linux - 0.6.5.6-0ubuntu18

---------------
zfs-linux (0.6.5.6-0ubuntu18) xenial; urgency=medium

  * Improve cloning performance for large numbers of clones (LP: #1567557)
    Bump zcmd buffer from 16K to 256K.
  * Correctly install libraries such as libnvpair.so.1 to /lib and not /usr/lib,
    this also fixes boot issues (LP: #1596835)

 -- Colin Ian King <email address hidden> Fri, 11 Aug 2017 08:33:23 +0100

Changed in zfs-linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.