Performance degradation when copying from LVM snapshot backed by NVMe disk

Bug #1833319 reported by Matthew Ruffell on 2019-06-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1833319

[Impact]
When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk.

The kernel is not merging sector requests and is instead issuing many small
sector requests to the NVMe storage controller instead of one larger request.

Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours.

The following was found with btrace, running alongside cat (see Testing):

A = IO remapped to different device
Q = IO handled by request queue
G = Get request
U = Unplug request
I = IO inserted onto request queue
D = IO issued to driver
C = IO completion

When reading from the LVM snapshot, we see:

259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824
259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat]
259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat]
259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832
259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat]
259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat]
259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat]
259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat]
259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat]
...
259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H]
259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H]
259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H]
...
259,0 1 183 0.001848512 0 C RS 837632 + 8 [0]

Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver.

Comparing this behaviour to reading data from a LVM snapshot on 4.6+ mainline or the Ubuntu 4.15 HWE kernel:

M = IO back merged with request on queue

259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912
259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat]
259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat]
259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920
259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat]
259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat]
259,0 0 200 0.000534474 1897 UT N [cat] 1
259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat]
259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat]
259,0 0 203 0.002242539 0 C R 7358464 + 512 [0]

This shows us a 8 sector read is added to the request queue, and is then
subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes.

[Fix]

The problem is that the NVMe driver on 4.4 xenial kernel is not merging 8 sector requests.

Merging is controlled per device by this sysfs entry:
/sys/block/nvme1n1/queue/nomerges

On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES).
On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge.

Setting this to 0 on the 4.4 kernel with:

# echo "0" > /sys/block/nvme1n1/queue/nomerges

and testing again, we find performance is restored and the problem is fixed.

Performing a btrace, we see 8 sector reads get backmerged into a 512 sector read which is done in one go.

The problem was fixed in 4.5 upstream with the below commit:

commit ef2d4615c59efb312e531a5e949970f37ca1c841
Author: Keith Busch <email address hidden>
Date: Thu Feb 11 13:05:40 2016 -0700
Subject: NVMe: Allow request merges

This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0.

Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4
kernels.

[Testcase]

You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger.

Steps (with NVMe disk being /dev/nvme1n1):
  1. sudo pvcreate /dev/nvme1n1
  2. sudo vgcreate secvol /dev/nvme1n1
  3. sudo lvcreate --name seclv -l 80%FREE secvol
  4. sudo mkfs.ext4 /dev/secvol/seclv
  5. sudo mount /dev/mapper/secvol-seclv /mnt
  6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done
  7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE'
  8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX)
  9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT

To replicate, simply read one of those 512mb files:
  10. time cat $NEWMOUNT/dummy1 1> /dev/null

On a stock xenial kernel, expect to see the following:

4.4.0-151-generic #178-Ubuntu

$ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null

real 0m42.693s
user 0m0.008s
sys 0m0.388s
$ cat /sys/block/nvme1n1/queue/nomerges
2

On a patched xenial kernel, performance is restored:

4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu

$ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null

real 0m1.773s
user 0m0.008s
sys 0m0.184s
$ cat /sys/block/nvme1n1/queue/nomerges
0

[Regression Potential]

Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads.

Regarding stability, this flag was originally set when the NVMe driver was
bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here:

https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html

Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver.

Regarding performance for other workloads, I reference the commit which
QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced:
commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c
subject: block: Added in stricter no merge semantics for block I/O

nomerges Throughput %System Improvement (tput / %sys)
-------- ------------ ----------- -------------------------
0 12.45 MB/sec 0.669365609
1 12.50 MB/sec 0.641519199 0.40% / 2.71%
2 12.52 MB/sec 0.639849750 0.56% / 2.96%

It shows a 0.56% performance increase for no merging / 2, over allowing
merging / 0 for random IO workloads.

Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin.

The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions.

If you are interested in testing, I have prepared two ppas with
ef2d4615c59efb312e531a5e949970f37ca1c841 patched:

linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic
linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test

tags: added: sts
description: updated
description: updated
Changed in linux (Ubuntu Xenial):
assignee: nobody → Matthew Ruffell (mruffell)
status: New → In Progress

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1833319

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
description: updated
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Matthew Ruffell (mruffell) wrote :

I enabled enabled -proposed and installed 4.4.0-155.182, and went through the
test case on a c5.large instance on aws. Note, I used the -generic kernel since
-aws doesn't seem to be ready yet.

The problem is solved and performance is the same as non-snapshot mounted disks.

We can see that merging has been enabled by looking at the flag:

$ cat /sys/block/nvme1n1/queue/nomerges
0

The problem is fixed. Changing tag to verified.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Matthew Ruffell (mruffell) wrote :

I enabled -proposed and installed 4.4.0-1088-aws #99-Ubuntu, and again went through the test case on a c5.large instance on aws.

The problem is solved and performance is restored, and performs the same as a non-snapshot mounted disk.

Again, we can see merging has been enabled with:

$ cat /sys/block/nvme1n1/queue/nomerges
0

The problem is fixed. Happy with verification status of done.

Launchpad Janitor (janitor) wrote :
Download full text (30.5 KiB)

This bug was fixed in the package linux - 4.4.0-157.185

---------------
linux (4.4.0-157.185) xenial; urgency=medium

  * linux: 4.4.0-157.185 -proposed tracker (LP: #1837476)

  * systemd 229-4ubuntu21.22 ADT test failure with linux 4.4.0-156.183 (storage)
    (LP: #1837235)
    - Revert "block/bio: Do not zero user pages"
    - Revert "block: Clear kernel memory before copying to user"
    - Revert "bio_copy_from_iter(): get rid of copying iov_iter"

linux (4.4.0-156.183) xenial; urgency=medium

  * linux: 4.4.0-156.183 -proposed tracker (LP: #1836880)

  * BCM43602 802.11ac Wireless regression - PCI ID 14e4:43ba (LP: #1836801)
    - brcmfmac: add eth_type_trans back for PCIe full dongle

linux (4.4.0-155.182) xenial; urgency=medium

  * linux: 4.4.0-155.182 -proposed tracker (LP: #1834918)

  * Geneve tunnels don't work when ipv6 is disabled (LP: #1794232)
    - geneve: correctly handle ipv6.disable module parameter

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * Handle overflow in proc_get_long of sysctl (LP: #1833935)
    - sysctl: handle overflow in proc_get_long

  * Xenial update: 4.4.181 upstream stable release (LP: #1832661)
    - x86/speculation/mds: Revert CPU buffer clear on double fault exit
    - x86/speculation/mds: Improve CPU buffer clear documentation
    - ARM: exynos: Fix a leaked reference by adding missing of_node_put
    - crypto: vmx - fix copy-paste error in CTR mode
    - crypto: crct10dif-generic - fix use via crypto_shash_digest()
    - crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
    - ALSA: usb-audio: Fix a memory leak bug
    - ALSA: hda/hdmi - Consider eld_valid when reporting jack event
    - ALSA: hda/realtek - EAPD turn on later
    - ASoC: max98090: Fix restore of DAPM Muxes
    - ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
    - mm/mincore.c: make mincore() more conservative
    - ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
    - mfd: da9063: Fix OTP control register names to match datasheets for
      DA9063/63L
    - tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
    - ext4: actually request zeroing of inode table after grow
    - ext4: fix ext4_show_options for file systems w/o journal
    - Btrfs: do not start a transaction at iterate_extent_inodes()
    - bcache: fix a race between cache register and cacheset unregister
    - bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
    - ipmi:ssif: compare block number correctly for multi-part return messages
    - crypto: gcm - Fix error return code in crypto_gcm_create_common()
    - crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
    - crypto: chacha20poly1305 - set cra_name correctly
    - crypto: salsa20 - don't access already-freed walk.iv
    - crypto: arm/aes-neonbs - don't access already-freed walk.iv
    - writeback: synchronize sync(2) against cgroup writeback membership switches
    - fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going
      into workqueue when umount
    - ALSA: hda/realtek - Fix for Lenovo B...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers