Ubuntu

QEMU-KVM / detect_zeroes causes KVM to start unlimited number of threads on Guest-Sided High-IO with big Blocksize

Bug #1687653 reported by Florian Strankowski on 2017-05-02

260

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	QEMU	Invalid	Undecided	Unassigned
	Ubuntu	Confirmed	Undecided	Unassigned

Bug Description

QEMU-KVM in combination with "detect_zeroes=on" makes a Guest able to DoS the Host. This is possible if the Host itself has "detect_zeroes" enabled and the Guest writes a large Chunk of data with a huge blocksize onto the drive.

E.g.: dd if=/dev/zero of=/tmp/DoS bs=1G count=1 oflag=direct

All QEMU-Versions after implementation of detect_zeroes are affected. Prior are unaffected. This is absolutely critical, please fix this ASAP!

#####

Provided by Dominik Csapak:

source , bs , count , O_DIRECT, behaviour

urandom , bs 1M, count 1024, O_DIRECT: OK
file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1M, count 1024, O_DIRECT: OK
zero file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1G, count 1, O_DIRECT: NOT OK
zero file , bs 1G, count 1, O_DIRECT: NOT OK
zero file , bs 1G, count 1, no O_DIRECT: NOT OK
rand file , bs 1G, count 1, O_DIRECT: OK
rand file , bs 1G, count 1, no O_DIRECT: OK

discard on:

urandom , bs 1M, count 1024, O_DIRECT: OK
rand file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1M, count 1024, O_DIRECT: OK
zero file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1G, count 1, O_DIRECT: NOT OK
zero file , bs 1G, count 1, O_DIRECT: NOT OK
zero file , bs 1G, count 1, no O_DIRECT: NOT OK
rand file , bs 1G, count 1, O_DIRECT: OK
rand file , bs 1G, count 1, no O_DIRECT: OK

detect_zeros off:

urandom , bs 1M, count 1024, O_DIRECT: OK
rand file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1M, count 1024, O_DIRECT: OK
zero file , bs 1M, count 1024, O_DIRECT: OK
/dev/zero , bs 1G, count 1, O_DIRECT: OK
zero file , bs 1G, count 1, O_DIRECT: OK
zero file , bs 1G, count 1, no O_DIRECT: OK
rand file , bs 1G, count 1, O_DIRECT: OK
rand file , bs 1G, count 1, no O_DIRECT: OK

#####

Provided by Florian Strankowski

bs - count - io-threads

512K - 2048 - 2
1M - 1024 - 2
2M - 512 - 4
4M - 256 - 6
8M - 128 - 10
16M - 64 - 18
32M - 32 - uncountable

Please refer to further information here:

https://bugzilla.proxmox.com/show_bug.cgi?id=1368

Tags:

Revision history for this message

Florian Strankowski (fstrankowski) wrote on 2017-05-02:

damn.png Edit (207.0 KiB, image/png)

information type:

Private Security → Public

Florian Strankowski (fstrankowski) on 2017-05-02

information type:	Public → Public Security
information type:	Public Security → Private Security
information type:	Private Security → Public Security
Changed in qemu:
status:	New → Confirmed

Revision history for this message

Florian Strankowski (fstrankowski) wrote on 2017-05-02:

Sorry ab out the visibility settings, this bugtracker drives me nuts.

information type:	Public Security → Private Security
information type:	Private Security → Public Security

Revision history for this message

Florian Strankowski (fstrankowski) wrote on 2017-05-02:

Just to make this clear: This bug affects only LVM-backed storages. File-based-storage is not affected. LVM-Thin and also LVM-Thick.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-05-03:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status:	New → Confirmed

Revision history for this message

Stefan Hajnoczi (stefanha) wrote on 2017-05-03:

I'm unable to reproduce this issue. The host stays responsive and the dd command completes in a reasonable amount of time. QEMU does not exceed the 64-thread pool size.

Please post steps to reproduce the issue using a minimal command-line without libvirt.

Here is information on my attempt to reproduce the problem:

Guest: Kernel 4.10.8-200.fc25.x86_64
Host: 4.10.11-200.fc25.x86_64
QEMU: qemu.git/master (e619b14746e5d8c0e53061661fd0e1da01fd4d60)

The LV is 1 GB on top of LUKS on a Samsung MZNLN256HCHP SATA SSD drive.

mpstat -P ALL 5 output:
11:02:02 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:02:07 AM all 3.36 0.00 6.22 34.54 0.25 0.50 0.00 3.11 0.00 52.03
11:02:07 AM 0 2.82 0.00 5.63 32.39 0.80 1.21 0.00 3.22 0.00 53.92
11:02:07 AM 1 3.02 0.00 6.04 28.77 0.20 0.20 0.00 3.02 0.00 58.75
11:02:07 AM 2 3.56 0.00 7.71 44.27 0.20 0.40 0.00 2.37 0.00 41.50
11:02:07 AM 3 3.81 0.00 5.61 32.46 0.00 0.40 0.00 4.01 0.00 53.71

vmstat 5 output:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1617404 6484 3541468 0 0 2145 84794 1976 8814 8 8 64 20 0
0 0 0 1619492 6484 3538592 0 0 613 69340 1518 7430 6 7 70 17 0
0 0 0 1618920 6484 3538680 0 0 280 75199 1421 6811 6 7 52 35 0

pidstat -v -p $PID_OF_QEMU 5 output:
11:01:08 AM UID PID threads fd-nr Command
11:02:03 AM 0 13043 67 37 qemu-system-x86
11:02:08 AM 0 13043 67 37 qemu-system-x86
11:02:13 AM 0 13043 67 37 qemu-system-x86

$ sudo x86_64-softmmu/qemu-system-x86_64 -enable-kvm -m 1024 -cpu host \
        -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \
        -drive file=test.img,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on \
        -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 \
        -drive file=/dev/path/to/testlv,if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on \
        -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1,bootindex=101 \
        -nographic

guest# dd if=/dev/zero of=/dev/sdb bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.0681 s, 71.3 MB/s

I'm unable to reproduce this issue.  The host stays responsive and the dd command completes in a reasonable amount of time.  QEMU does not exceed the 64-thread pool size.

Please post steps to reproduce the issue using a minimal command-line without libvirt.

Here is information on my attempt to reproduce the problem:

Guest: Kernel 4.10.8-200.fc25.x86_64
Host: 4.10.11-200.fc25.x86_64
QEMU: qemu.git/master (e619b14746e5d8c0e53061661fd0e1da01fd4d60)

The LV is 1 GB on top of LUKS on a Samsung MZNLN256HCHP SATA SSD drive.

mpstat -P ALL 5 output:
11:02:02 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:02:07 AM  all    3.36    0.00    6.22   34.54    0.25    0.50    0.00    3.11    0.00   52.03
11:02:07 AM    0    2.82    0.00    5.63   32.39    0.80    1.21    0.00    3.22    0.00   53.92
11:02:07 AM    1    3.02    0.00    6.04   28.77    0.20    0.20    0.00    3.02    0.00   58.75
11:02:07 AM    2    3.56    0.00    7.71   44.27    0.20    0.40    0.00    2.37    0.00   41.50
11:02:07 AM    3    3.81    0.00    5.61   32.46    0.00    0.40    0.00    4.01    0.00   53.71

vmstat 5 output:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 1617404   6484 3541468    0    0  2145 84794 1976 8814  8  8 64 20  0
 0  0      0 1619492   6484 3538592    0    0   613 69340 1518 7430  6  7 70 17  0
 0  0      0 1618920   6484 3538680    0    0   280 75199 1421 6811  6  7 52 35  0

pidstat -v -p $PID_OF_QEMU 5 output:
11:01:08 AM   UID       PID threads   fd-nr  Command
11:02:03 AM     0     13043      67      37  qemu-system-x86
11:02:08 AM     0     13043      67      37  qemu-system-x86
11:02:13 AM     0     13043      67      37  qemu-system-x86

guest# dd if=/dev/zero of=/dev/sdb bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.0681 s, 71.3 MB/s

Revision history for this message

Florian Strankowski (fstrankowski) wrote on 2017-05-03:

Please be so kind and go for a 6G LVM-Vol and do "dd if=/dev/zero of=/dev/sdb bs=3G count=2 oflag=direct". Please keep an eye on your processor usage in comparison to the threads created. Its harder to knock-down an SSD-Backed system than one with spinners.

Revision history for this message

Stefan Hajnoczi (stefanha) wrote on 2017-05-03:

After further investigation on IRC the following points were raised:

1. Non-vcpu threads in QEMU weren't being isolated. Libvirt can do this
using the <cputune> domain XML element. The guest can create a high
load if some QEMU threads are unconstrained.

2. The wait% CPU stat was causing confusion. It's the idle time during
   which synchronous I/O is pending. High wait% does not mean that the
   system is under high CPU load. detect-zeroes=on can take a
   synchronous I/O path even when aio=native is used, and this results
   in wait% instead of idle%.

I'm closing the bug.