Ubuntu
linux package

Activity log for bug #1833319

Date	Who	What changed	Old value	New value	Message
2019-06-18 23:30:15	Matthew Ruffell	bug			added bug
2019-06-18 23:30:36	Matthew Ruffell	nominated for series		Ubuntu Xenial
2019-06-18 23:30:36	Matthew Ruffell	bug task added		linux (Ubuntu Xenial)
2019-06-18 23:31:12	Matthew Ruffell	tags		sts
2019-06-18 23:37:42	Matthew Ruffell	description	BugLink: [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. [Fix] The following was found with btrace, running alongside cat (see Testing): Standard lvm copy: $ cat /mnt/dummy1 1> /dev/null LVM snapshot copy: $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null Tracing: # btrace /dev/nvme1n1 > trace.data Looking at the "control" case, of copying from /mnt, which is the standard lvm volume, see see a trace like: 259,0 1 13 0.002545516 1579 A R 280576 + 512 <- (252,0) 278528 259,0 1 14 0.002545701 1579 Q R 280576 + 512 [cat] 259,0 1 15 0.002547020 1579 G R 280576 + 512 [cat] 259,0 1 16 0.002547631 1579 U N [cat] 1 259,0 1 17 0.002547775 1579 I RS 280576 + 512 [cat] 259,0 1 18 0.002551381 1579 D RS 280576 + 512 [cat] 259,0 1 19 0.004099666 0 C RS 280576 + 512 [0] A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion Firstly, the request is mapped from a different device, from /mnt which is dm-1 to the nvme disk. A 512 sector read is placed on the IO request queue, where it is then inserted into the driver request queue and then the driver is commanded to fetch the data, and then it completes. Now, when reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+ or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. The problem is that the 4.4 xenial kernel is not merging 8 sector requests. After digging in git log between 4.4 and 4.6, this commit stood out: commit 9c573de3283af007ea11c17bde1e4568d9417328 Author: Shaohua Li <shli@fb.com> Date: Mon Apr 25 16:52:38 2016 -0700 Subject: MD: make bio mergeable You can read it here: https://github.com/torvalds/linux/commit/9c573de3283af007ea11c17bde1e4568d9417328 "blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable." The snapshot is dm-3, and it has two underlayer disks, dm-1 and nvme1n1. Which means we qualify for merging. Looking at the xenial 4.4 kernel tree, this commit is actually already applied, since it got backported to the mainline 4.4 kernel: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1141165.html So why is xenial effected? Looking at the bugzilla page for that commit: https://bugzilla.kernel.org/show_bug.cgi?id=117051 We see that merging is controlled by a sysfs entry, /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Looking at the trace with btrace, we see that performs 8 sector reads, which get backmerged into a 512 sector read which is done in one go. Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we come across commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test	BugLink: https://bugs.launchpad.net/bugs/1833319 [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. [Fix] The following was found with btrace, running alongside cat (see Testing): Standard lvm copy: $ cat /mnt/dummy1 1> /dev/null LVM snapshot copy: $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null Tracing: # btrace /dev/nvme1n1 > trace.data Looking at the "control" case, of copying from /mnt, which is the standard lvm volume, see see a trace like: 259,0 1 13 0.002545516 1579 A R 280576 + 512 <- (252,0) 278528 259,0 1 14 0.002545701 1579 Q R 280576 + 512 [cat] 259,0 1 15 0.002547020 1579 G R 280576 + 512 [cat] 259,0 1 16 0.002547631 1579 U N [cat] 1 259,0 1 17 0.002547775 1579 I RS 280576 + 512 [cat] 259,0 1 18 0.002551381 1579 D RS 280576 + 512 [cat] 259,0 1 19 0.004099666 0 C RS 280576 + 512 [0] A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion Firstly, the request is mapped from a different device, from /mnt which is dm-1 to the nvme disk. A 512 sector read is placed on the IO request queue, where it is then inserted into the driver request queue and then the driver is commanded to fetch the data, and then it completes. Now, when reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+ or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. The problem is that the 4.4 xenial kernel is not merging 8 sector requests. I came across this bugzilla entry, https://bugzilla.kernel.org/show_bug.cgi?id=117051 and we see that merging is controlled by a sysfs entry, /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Looking at the trace with btrace, we see that performs 8 sector reads, which get backmerged into a 512 sector read which is done in one go. Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we come across commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test
2019-06-18 23:38:58	Matthew Ruffell	description	BugLink: https://bugs.launchpad.net/bugs/1833319 [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. [Fix] The following was found with btrace, running alongside cat (see Testing): Standard lvm copy: $ cat /mnt/dummy1 1> /dev/null LVM snapshot copy: $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null Tracing: # btrace /dev/nvme1n1 > trace.data Looking at the "control" case, of copying from /mnt, which is the standard lvm volume, see see a trace like: 259,0 1 13 0.002545516 1579 A R 280576 + 512 <- (252,0) 278528 259,0 1 14 0.002545701 1579 Q R 280576 + 512 [cat] 259,0 1 15 0.002547020 1579 G R 280576 + 512 [cat] 259,0 1 16 0.002547631 1579 U N [cat] 1 259,0 1 17 0.002547775 1579 I RS 280576 + 512 [cat] 259,0 1 18 0.002551381 1579 D RS 280576 + 512 [cat] 259,0 1 19 0.004099666 0 C RS 280576 + 512 [0] A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion Firstly, the request is mapped from a different device, from /mnt which is dm-1 to the nvme disk. A 512 sector read is placed on the IO request queue, where it is then inserted into the driver request queue and then the driver is commanded to fetch the data, and then it completes. Now, when reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+ or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. The problem is that the 4.4 xenial kernel is not merging 8 sector requests. I came across this bugzilla entry, https://bugzilla.kernel.org/show_bug.cgi?id=117051 and we see that merging is controlled by a sysfs entry, /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Looking at the trace with btrace, we see that performs 8 sector reads, which get backmerged into a 512 sector read which is done in one go. Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we come across commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test	BugLink: https://bugs.launchpad.net/bugs/1833319 [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. [Fix] The following was found with btrace, running alongside cat (see Testing): Standard lvm copy: $ cat /mnt/dummy1 1> /dev/null LVM snapshot copy: $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null Tracing: # btrace /dev/nvme1n1 > trace.data Looking at the "control" case, of copying from /mnt, which is the standard lvm volume, see see a trace like: 259,0 1 13 0.002545516 1579 A R 280576 + 512 <- (252,0) 278528 259,0 1 14 0.002545701 1579 Q R 280576 + 512 [cat] 259,0 1 15 0.002547020 1579 G R 280576 + 512 [cat] 259,0 1 16 0.002547631 1579 U N [cat] 1 259,0 1 17 0.002547775 1579 I RS 280576 + 512 [cat] 259,0 1 18 0.002551381 1579 D RS 280576 + 512 [cat] 259,0 1 19 0.004099666 0 C RS 280576 + 512 [0] A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion Firstly, the request is mapped from a different device, from /mnt which is dm-1 to the nvme disk. A 512 sector read is placed on the IO request queue, where it is then inserted into the driver request queue and then the driver is commanded to fetch the data, and then it completes. Now, when reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+ or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. The problem is that the 4.4 xenial kernel is not merging 8 sector requests. I came across this bugzilla entry, https://bugzilla.kernel.org/show_bug.cgi?id=117051 and we see that merging is controlled by a sysfs entry, /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Looking at the trace with btrace, we see that performs 8 sector reads, which get backmerged into a 512 sector read which is done in one go. Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we come across commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test
2019-06-18 23:39:24	Matthew Ruffell	linux (Ubuntu Xenial): assignee		Matthew Ruffell (mruffell)
2019-06-18 23:39:39	Matthew Ruffell	linux (Ubuntu Xenial): status	New	In Progress
2019-06-19 00:00:06	Ubuntu Kernel Bot	linux (Ubuntu): status	New	Incomplete
2019-06-19 00:00:09	Ubuntu Kernel Bot	tags	sts	sts xenial
2019-06-19 00:15:11	Matthew Ruffell	description	BugLink: https://bugs.launchpad.net/bugs/1833319 [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. [Fix] The following was found with btrace, running alongside cat (see Testing): Standard lvm copy: $ cat /mnt/dummy1 1> /dev/null LVM snapshot copy: $ cat /tmp/mount.backup_OXV/dummy2 1> /dev/null Tracing: # btrace /dev/nvme1n1 > trace.data Looking at the "control" case, of copying from /mnt, which is the standard lvm volume, see see a trace like: 259,0 1 13 0.002545516 1579 A R 280576 + 512 <- (252,0) 278528 259,0 1 14 0.002545701 1579 Q R 280576 + 512 [cat] 259,0 1 15 0.002547020 1579 G R 280576 + 512 [cat] 259,0 1 16 0.002547631 1579 U N [cat] 1 259,0 1 17 0.002547775 1579 I RS 280576 + 512 [cat] 259,0 1 18 0.002551381 1579 D RS 280576 + 512 [cat] 259,0 1 19 0.004099666 0 C RS 280576 + 512 [0] A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion Firstly, the request is mapped from a different device, from /mnt which is dm-1 to the nvme disk. A 512 sector read is placed on the IO request queue, where it is then inserted into the driver request queue and then the driver is commanded to fetch the data, and then it completes. Now, when reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6 mainline+ or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. The problem is that the 4.4 xenial kernel is not merging 8 sector requests. I came across this bugzilla entry, https://bugzilla.kernel.org/show_bug.cgi?id=117051 and we see that merging is controlled by a sysfs entry, /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Looking at the trace with btrace, we see that performs 8 sector reads, which get backmerged into a 512 sector read which is done in one go. Looking into the kernel tree with cscope, on QUEUE_FLAG_NOMERGES, we come across commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test	BugLink: https://bugs.launchpad.net/bugs/1833319 [Impact] When copying files from a mounted LVM snapshot which resides on NVMe storage devices, there is a massive performance degradation in the rate sectors are read from the disk. The kernel is not merging sector requests and is instead issuing many small sector requests to the NVMe storage controller instead of one larger request. Experiments have shown a 14x-25x performance degradation in reads, where copies used to take seconds, now take minutes, and copies which took thirty minutes now take many hours. The following was found with btrace, running alongside cat (see Testing): A = IO remapped to different device Q = IO handled by request queue G = Get request U = Unplug request I = IO inserted onto request queue D = IO issued to driver C = IO completion When reading from the LVM snapshot, we see: 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat] 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat] 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat] 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat] 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat] 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat] 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat] ... 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H] 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H] 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H] ... 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0] Now what is happening here, is that a request for 8 sector read is placed onto the IO request queue, and is then inserted one at a time to the driver request queue and then fetched by the driver. Comparing this behaviour to reading data from a LVM snapshot on 4.6+ mainline or the Ubuntu 4.15 HWE kernel: M = IO back merged with request on queue 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat] 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat] 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat] 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat] 259,0 0 200 0.000534474 1897 UT N [cat] 1 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat] 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat] 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0] This shows us a 8 sector read is added to the request queue, and is then subsequently [M]erged backward with other requests on the queue until the sum of all of those merged requests becomes 512 sectors. From there, the 512 sector read is placed onto the IO queue, where it is fetched by the device driver, and completes. [Fix] The problem is that the NVMe driver on 4.4 xenial kernel is not merging 8 sector requests. Merging is controlled per device by this sysfs entry: /sys/block/nvme1n1/queue/nomerges On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES). On 4.6+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge. Setting this to 0 on the 4.4 kernel with: # echo "0" > /sys/block/nvme1n1/queue/nomerges and testing again, we find performance is restored and the problem is fixed. Performing a btrace, we see 8 sector reads get backmerged into a 512 sector read which is done in one go. The problem was fixed in 4.5 upstream with the below commit: commit ef2d4615c59efb312e531a5e949970f37ca1c841 Author: Keith Busch <keith.busch@intel.com> Date: Thu Feb 11 13:05:40 2016 -0700 Subject: NVMe: Allow request merges This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver init, allowing requests to be backmerged. This also has a direct effect of defaulting /sys/block/nvme1n1/queue/nomerges to 0. Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4 kernels. [Testcase] You can replicate the problem with a system with a NVMe disk. I recommend using c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger. Steps (with NVMe disk being /dev/nvme1n1): 1. sudo pvcreate /dev/nvme1n1 2. sudo vgcreate secvol /dev/nvme1n1 3. sudo lvcreate --name seclv -l 80%FREE secvol 4. sudo mkfs.ext4 /dev/secvol/seclv 5. sudo mount /dev/mapper/secvol-seclv /mnt 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE' 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX) 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT To replicate, simply read one of those 512mb files: 10. time cat $NEWMOUNT/dummy1 1> /dev/null On a stock xenial kernel, expect to see the following: 4.4.0-151-generic #178-Ubuntu $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null real 0m42.693s user 0m0.008s sys 0m0.388s $ cat /sys/block/nvme1n1/queue/nomerges 2 On a patched xenial kernel, performance is restored: 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null real 0m1.773s user 0m0.008s sys 0m0.184s $ cat /sys/block/nvme1n1/queue/nomerges 0 [Regression Potential] Cherry picking "NVMe: Allow request merges" changes the default request policy for NVMe drives, which may give some cause for concern in both terms of stability and performance for other workloads. Regarding stability, this flag was originally set when the NVMe driver was bio based, before the driver had been converted to blk-mq and separated out from /block. You can read a mailing list thread about it here: https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html Along with the commit "MD: make bio mergeable" there is no reason to not allow requests to be mergeable for the new NVMe driver. Regarding performance for other workloads, I reference the commit which QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced: commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c subject: block: Added in stricter no merge semantics for block I/O nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% It shows a 0.56% performance increase for no merging / 2, over allowing merging / 0 for random IO workloads. Comparing this with the 14x-25x performance degradation for reads where requests are not able to be merged, it is clear that changing the default to 0 will not impact any other workloads by any significant margin. The commit is also present in Linux 4.5 mainline, can be cleanly cherry picked and is still present in the kernel to this day, and after review of the NVMe driver, I believe there will be no regressions. If you are interested in testing, I have prepared two ppas with ef2d4615c59efb312e531a5e949970f37ca1c841 patched: linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test
2019-06-27 05:49:01	Khaled El Mously	linux (Ubuntu Xenial): status	In Progress	Fix Committed
2019-07-03 11:05:29	Ubuntu Kernel Bot	tags	sts xenial	sts verification-needed-xenial xenial
2019-07-04 04:25:50	Matthew Ruffell	tags	sts verification-needed-xenial xenial	sts verification-done-xenial xenial
2019-07-24 20:29:02	Launchpad Janitor	linux (Ubuntu Xenial): status	Fix Committed	Fix Released
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2018-12126
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2018-12127
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2018-12130
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2019-11091
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2019-11833
2019-07-24 20:29:02	Launchpad Janitor	cve linked		2019-2054

Ubuntulinux package

Activity log for bug #1833319

Ubuntu
linux package