ZFS performance drops suddenly

Bug #1883676 reported by Andrey Gelman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
zfs-linux (Ubuntu)
Won't Fix
High
Colin Ian King

Bug Description

Run Phoronix Flexible IO / Random write benchmark.
After running the benchmark repeatedly multiple times (occasionally it can be 30 minutes or 8 hours), the performance drops suddenly. The only way (I know of) to get back to high performance is reboot.
---
    $ for i in $(seq 1 1000); do printf "2\n4\n2\n1\n1\n1\nn\n" | phoronix-test-suite run fio | awk '/Average.*IOPS/ {print $2}' >> /tmp/iops; done
    $ less /tmp/iops
    ...
    2095
    2095
    2093
    783
    388
    389
    ...

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: zfsutils-linux 0.7.5-1ubuntu16.9
ProcVersionSignature: User Name 5.3.0-1023.25~18.04.1-aws 5.3.18
Uname: Linux 5.3.0-1023-aws x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.9-0ubuntu7.15
Architecture: amd64
Date: Tue Jun 16 08:55:01 2020
Ec2AMI: ami-06fd83ae47b05282f
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1f
Ec2InstanceType: c4.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
SourcePackage: zfs-linux
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.sudoers.d.zfs: [deleted]

Revision history for this message
Andrey Gelman (andrey-gelman) wrote :
Revision history for this message
Balint Harmath (bharmath) wrote :

Reproducible.

Changed in zfs-linux (Ubuntu):
status: New → Confirmed
Changed in zfs-linux (Ubuntu):
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Can you inform me the exact test configuration you were using?

Revision history for this message
Colin Ian King (colin-king) wrote :

Oh, no problem, I can see that in the bug report. Apologies.

Revision history for this message
Colin Ian King (colin-king) wrote :

It would be interesting in knowing the following data:

1. How much memory does the machine have when booted?

run: free

2. What the arcstats look like when it goes slowly

use: cat /proc/spl/kstat/zfs/arcstats

thanks

Changed in zfs-linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Andrey Gelman (andrey-gelman) wrote : Re: [Bug 1883676] Re: ZFS performance drops suddenly
Download full text (6.5 KiB)

1. free
              total used free shared buff/cache
available
Mem: 8025436 1255616 6457736 11780 312084
6512296
Swap: 1425404 0 1425404

2. cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 4085867100 20862566338548
name type data
hits 4 8977275
misses 4 20770
demand_data_hits 4 7154775
demand_data_misses 4 13760
demand_metadata_hits 4 1810210
demand_metadata_misses 4 4292
prefetch_data_hits 4 112
prefetch_data_misses 4 719
prefetch_metadata_hits 4 12178
prefetch_metadata_misses 4 1999
mru_hits 4 4041554
mru_ghost_hits 4 0
mfu_hits 4 4923504
mfu_ghost_hits 4 0
deleted 4 31
mutex_miss 4 0
access_skip 4 9
evict_skip 4 124
evict_not_enough 4 0
evict_l2_cached 4 0
evict_l2_eligible 4 396800
evict_l2_ineligible 4 8192
evict_l2_skip 4 0
hash_elements 4 20337
hash_elements_max 4 22033
hash_collisions 4 82157
hash_chains 4 196
hash_chain_max 4 2
p 4 2054511616
c 4 4109023232
c_min 4 256813952
c_max 4 4109023232
size 4 1856513648
compressed_size 4 1349638144
uncompressed_size 4 1738882048
overhead_size 4 446370304
hdr_size 4 7828496
data_size 4 1709473792
metadata_size 4 86534656
dbuf_size 4 9437376
dnode_size 4 23086368
bonus_size 4 20152960
anon_size 4 438411264
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 1186710528
mru_evictable_data 4 908280320
mru_evictable_metadata 4 4658688
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 170886656
mfu_evictable_data 4 134954496
mfu_evictable_metadata 4 3134976
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_r...

Read more...

Revision history for this message
Colin Ian King (colin-king) wrote :

I've been examining the zfs_arc_max setting to see if this can stabilize the I/O performance. You may like to try setting this as a proportion of your available memory to see if it improves things. For example, with a machine with 8GB of memory, I set this to 2GB and my test system went from being able to reproduce your issue to one where the I/O performance remained constant for hundreds of fio tests.

For example, to set 2GB arc max use:

echo 2147483648 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches

If it is set too low it impacts on cache hits and this reduces I/O rates. If it is set too high you hit the issue you are seeing because of cache contention between ZFS and the block layer. I suspect the only way to find the sweet spot for your use case is with some careful experimentation. Don't rely on synthetic tests to set this, real world I/O patterns are best to use for this kind of tuning.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Andrey, does setting zfs_arc_max resolve your issue?

Revision history for this message
Colin Ian King (colin-king) wrote :

@Andrey, it is noted that ZFS I/O can slow down when ZFS volumes fill up, noticeably at around 80-85% full capacity writes can slow dramatically while free space is being found. Is this possibly the reason for the speed bottleneck?

Revision history for this message
Colin Ian King (colin-king) wrote :

After a lot of tests on AWS with generic and AWS kernels and on non-AWS bare-metal systems with the AWS and generic kernels I discovered that one gets I/O throttling on AWS instances regardless of using ZFS or not.

I ran a really simple test writing and reading to the raw device and found the *same* I/O throttling characteristics.

This is due to a feature known as "EBS I/O throttling", see:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html
and
https://aws.amazon.com/ebs/features/

This is not a zfs issue per se, or a kernel issue, but due to I/O throttling with the cloud provider. Marking this as won't fix.

Changed in zfs-linux (Ubuntu):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.