Activity log for bug #1911848

Date Who What changed Old value New value Message
2021-01-15 05:07:22 Matthew Ruffell bug added bug
2021-01-15 05:07:35 Matthew Ruffell linux (Ubuntu): status New Fix Released
2021-01-15 05:07:41 Matthew Ruffell nominated for series Ubuntu Focal
2021-01-15 05:07:41 Matthew Ruffell bug task added linux (Ubuntu Focal)
2021-01-15 05:07:47 Matthew Ruffell linux (Ubuntu Focal): status New In Progress
2021-01-15 05:07:49 Matthew Ruffell linux (Ubuntu Focal): importance Undecided Medium
2021-01-15 05:07:52 Matthew Ruffell linux (Ubuntu Focal): assignee Matthew Ruffell (mruffell)
2021-01-15 05:08:24 Matthew Ruffell tags focal sts
2021-01-15 05:09:39 Matthew Ruffell description [Impact] On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will fail to start once they have enabled the hyper-v role for nested virtualisation. The Windows Server guests will get stuck in the late stages of boot, before the graphical login screen appears, on Windows Server systems with the desktop environment installed. If you look at performance metrics for the guest, the CPU will appear to be stuck at 100%, and it never changes from 100%. The Windows Server guest is unresponsive. The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some very specific settings needed for nested virtualisation. See testcase section. If you use any other vcpu type, the problem does not reproduce. Known workarounds are to install the 5.8 HWE kernel, in which case the server will come up as expected. [Fix] The following commit fixes the issue, and landed in mainline 5.8-rc1: commit 8081ad06b68a728e676d3b08e9ab70ce4039747b Author: Sean Christopherson <seanjc@google.com> Date: Wed Apr 22 19:25:40 2020 -0700 Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set Link: https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b It appears that pending requests to the hypervisor can be lost or delayed if an immediate exit was requested in vcpu_enter_guest(). As the commit message mentions, only the !injected case is affected, so we add a check at the cancel_injection label to see if we got there as a result of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we are. The Windows guest is waiting for an event to be processed, which never happens, and so gets stuck. Even though the above commit has a Fixes: tag to a commit in 3.15-rc1, in my testing the 4.15 kernel with a Bionic-ussuri userspace does not reproduce the issue, so SRU to Bionic will not be needed. [Testcase] A cascadelake based Xeon server is required. Anything else and the bug will not reproduce. I used a c5.metal server on AWS. It has the following processor: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can reach the desktop. Copy a Windows Server 2k19 image to the destination server, as well as a recent ISO image of virtio drivers. Install virt-manager. Create a new virtual machine using the Windows 2k19 defaults. Use 8 vcpus, 16gb ram. Click customise button to change settings before install. Change the hard disk to be SATA, attach a new cd rom driver for the virtio drivers. Change networking to virtio. Change CPU to Cascadelake-Server-noTSX. Edit the virsh xml, and ensure you set the following features for CPU: <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>Cascadelake-Server-noTSX</model> <topology sockets='8' cores='1' threads='1'/> <feature policy='require' name='invpcid'/> <feature policy='require' name='pcid'/> <feature policy='require' name='vmx'/> <feature policy='require' name='hypervisor'/> <feature policy='disable' name='mpx'/> <feature policy='require' name='pku'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='rdctl-no'/> <feature policy='require' name='ibrs-all'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='mds-no'/> </cpu> Those settings are an absolute must. Boot the VM, and install Windows 2k19 with the desktop environment. Once it is installed, open up computer management > device manager and install drivers from the virtio ISO for missing hardware, likely the network and balloon devices. From there, go to server manager, and install the hyper-v role. Reboot the server. It will reboot a few times, and on the final time, it will lock up before it reaches the log in screen. In virt-manager, go to the performance tab. The CPU will be stuck at 100%. The windows guest will be non responsive. A patched kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test If you install this kernel and boot the Windows 2k19 guest, it will come up normally when the hyper-v role is enabled, and you will be able to log in. [Where problems could occur] This is a change to a core part of the kvm subsystem, so there is potential for regression which could affect all users of KVM. If a regression were to occur, there are no workarounds. Users would need to downgrade their kernel while a fix is developed. BugLink: https://bugs.launchpad.net/bugs/1911848 [Impact] On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will fail to start once they have enabled the hyper-v role for nested virtualisation. The Windows Server guests will get stuck in the late stages of boot, before the graphical login screen appears, on Windows Server systems with the desktop environment installed. If you look at performance metrics for the guest, the CPU will appear to be stuck at 100%, and it never changes from 100%. The Windows Server guest is unresponsive. The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some very specific settings needed for nested virtualisation. See testcase section. If you use any other vcpu type, the problem does not reproduce. Known workarounds are to install the 5.8 HWE kernel, in which case the server will come up as expected. [Fix] The following commit fixes the issue, and landed in mainline 5.8-rc1: commit 8081ad06b68a728e676d3b08e9ab70ce4039747b Author: Sean Christopherson <seanjc@google.com> Date: Wed Apr 22 19:25:40 2020 -0700 Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set Link: https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b It appears that pending requests to the hypervisor can be lost or delayed if an immediate exit was requested in vcpu_enter_guest(). As the commit message mentions, only the !injected case is affected, so we add a check at the cancel_injection label to see if we got there as a result of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we are. The Windows guest is waiting for an event to be processed, which never happens, and so gets stuck. Even though the above commit has a Fixes: tag to a commit in 3.15-rc1, in my testing the 4.15 kernel with a Bionic-ussuri userspace does not reproduce the issue, so SRU to Bionic will not be needed. [Testcase] A cascadelake based Xeon server is required. Anything else and the bug will not reproduce. I used a c5.metal server on AWS. It has the following processor: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can reach the desktop. Copy a Windows Server 2k19 image to the destination server, as well as a recent ISO image of virtio drivers. Install virt-manager. Create a new virtual machine using the Windows 2k19 defaults. Use 8 vcpus, 16gb ram. Click customise button to change settings before install. Change the hard disk to be SATA, attach a new cd rom driver for the virtio drivers. Change networking to virtio. Change CPU to Cascadelake-Server-noTSX. Edit the virsh xml, and ensure you set the following features for CPU:   <cpu mode='custom' match='exact' check='full'>     <model fallback='forbid'>Cascadelake-Server-noTSX</model>     <topology sockets='8' cores='1' threads='1'/>     <feature policy='require' name='invpcid'/>     <feature policy='require' name='pcid'/>     <feature policy='require' name='vmx'/>     <feature policy='require' name='hypervisor'/>     <feature policy='disable' name='mpx'/>     <feature policy='require' name='pku'/>     <feature policy='require' name='arch-capabilities'/>     <feature policy='require' name='rdctl-no'/>     <feature policy='require' name='ibrs-all'/>     <feature policy='require' name='skip-l1dfl-vmentry'/>     <feature policy='require' name='mds-no'/>   </cpu> Those settings are an absolute must. Boot the VM, and install Windows 2k19 with the desktop environment. Once it is installed, open up computer management > device manager and install drivers from the virtio ISO for missing hardware, likely the network and balloon devices. From there, go to server manager, and install the hyper-v role. Reboot the server. It will reboot a few times, and on the final time, it will lock up before it reaches the log in screen. In virt-manager, go to the performance tab. The CPU will be stuck at 100%. The windows guest will be non responsive. A patched kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test If you install this kernel and boot the Windows 2k19 guest, it will come up normally when the hyper-v role is enabled, and you will be able to log in. [Where problems could occur] This is a change to a core part of the kvm subsystem, so there is potential for regression which could affect all users of KVM. If a regression were to occur, there are no workarounds. Users would need to downgrade their kernel while a fix is developed.
2021-01-15 05:22:13 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1911848 [Impact] On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will fail to start once they have enabled the hyper-v role for nested virtualisation. The Windows Server guests will get stuck in the late stages of boot, before the graphical login screen appears, on Windows Server systems with the desktop environment installed. If you look at performance metrics for the guest, the CPU will appear to be stuck at 100%, and it never changes from 100%. The Windows Server guest is unresponsive. The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some very specific settings needed for nested virtualisation. See testcase section. If you use any other vcpu type, the problem does not reproduce. Known workarounds are to install the 5.8 HWE kernel, in which case the server will come up as expected. [Fix] The following commit fixes the issue, and landed in mainline 5.8-rc1: commit 8081ad06b68a728e676d3b08e9ab70ce4039747b Author: Sean Christopherson <seanjc@google.com> Date: Wed Apr 22 19:25:40 2020 -0700 Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set Link: https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b It appears that pending requests to the hypervisor can be lost or delayed if an immediate exit was requested in vcpu_enter_guest(). As the commit message mentions, only the !injected case is affected, so we add a check at the cancel_injection label to see if we got there as a result of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we are. The Windows guest is waiting for an event to be processed, which never happens, and so gets stuck. Even though the above commit has a Fixes: tag to a commit in 3.15-rc1, in my testing the 4.15 kernel with a Bionic-ussuri userspace does not reproduce the issue, so SRU to Bionic will not be needed. [Testcase] A cascadelake based Xeon server is required. Anything else and the bug will not reproduce. I used a c5.metal server on AWS. It has the following processor: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can reach the desktop. Copy a Windows Server 2k19 image to the destination server, as well as a recent ISO image of virtio drivers. Install virt-manager. Create a new virtual machine using the Windows 2k19 defaults. Use 8 vcpus, 16gb ram. Click customise button to change settings before install. Change the hard disk to be SATA, attach a new cd rom driver for the virtio drivers. Change networking to virtio. Change CPU to Cascadelake-Server-noTSX. Edit the virsh xml, and ensure you set the following features for CPU:   <cpu mode='custom' match='exact' check='full'>     <model fallback='forbid'>Cascadelake-Server-noTSX</model>     <topology sockets='8' cores='1' threads='1'/>     <feature policy='require' name='invpcid'/>     <feature policy='require' name='pcid'/>     <feature policy='require' name='vmx'/>     <feature policy='require' name='hypervisor'/>     <feature policy='disable' name='mpx'/>     <feature policy='require' name='pku'/>     <feature policy='require' name='arch-capabilities'/>     <feature policy='require' name='rdctl-no'/>     <feature policy='require' name='ibrs-all'/>     <feature policy='require' name='skip-l1dfl-vmentry'/>     <feature policy='require' name='mds-no'/>   </cpu> Those settings are an absolute must. Boot the VM, and install Windows 2k19 with the desktop environment. Once it is installed, open up computer management > device manager and install drivers from the virtio ISO for missing hardware, likely the network and balloon devices. From there, go to server manager, and install the hyper-v role. Reboot the server. It will reboot a few times, and on the final time, it will lock up before it reaches the log in screen. In virt-manager, go to the performance tab. The CPU will be stuck at 100%. The windows guest will be non responsive. A patched kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test If you install this kernel and boot the Windows 2k19 guest, it will come up normally when the hyper-v role is enabled, and you will be able to log in. [Where problems could occur] This is a change to a core part of the kvm subsystem, so there is potential for regression which could affect all users of KVM. If a regression were to occur, there are no workarounds. Users would need to downgrade their kernel while a fix is developed. BugLink: https://bugs.launchpad.net/bugs/1911848 [Impact] On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will fail to start once they have enabled the hyper-v role for nested virtualisation. The Windows Server guests will get stuck in the late stages of boot, before the graphical login screen appears, on Windows Server systems with the desktop environment installed. If you look at performance metrics for the guest, the CPU will appear to be stuck at 100%, and it never changes from 100%. The Windows Server guest is unresponsive. The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some very specific settings needed for nested virtualisation. See testcase section. If you use any other vcpu type, the problem does not reproduce. Known workarounds are to install the 5.8 HWE kernel, in which case the server will come up as expected. [Fix] The following commit fixes the issue, and landed in mainline 5.8-rc1: commit 8081ad06b68a728e676d3b08e9ab70ce4039747b Author: Sean Christopherson <seanjc@google.com> Date: Wed Apr 22 19:25:40 2020 -0700 Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set Link: https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b It appears that pending requests to the hypervisor can be lost or delayed if an immediate exit was requested in vcpu_enter_guest(). As the commit message mentions, only the !injected case is affected, so we add a check at the cancel_injection label to see if we got there as a result of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we are. The Windows guest is waiting for an event to be processed, which never happens, and so gets stuck. Even though the above commit has a Fixes: tag to a commit in 3.15-rc1, in my testing the 4.15 kernel with a Bionic-ussuri userspace does not reproduce the issue, so SRU to Bionic will not be needed. [Testcase] A cascadelake based Xeon server is required. Anything else and the bug will not reproduce. I used a c5.metal server on AWS. It has the following processor: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can reach the desktop. Copy a Windows Server 2k19 image to the destination server, as well as a recent ISO image of virtio drivers. Install virt-manager. Create a new virtual machine using the Windows 2k19 defaults. Use 8 vcpus, 16gb ram. Click customise button to change settings before install. Change the hard disk to be SATA, attach a new cd rom drive for the virtio drivers. Change networking to virtio. Change CPU to Cascadelake-Server-noTSX. Edit the virsh xml, and ensure you set the following features for CPU:   <cpu mode='custom' match='exact' check='full'>     <model fallback='forbid'>Cascadelake-Server-noTSX</model>     <topology sockets='8' cores='1' threads='1'/>     <feature policy='require' name='invpcid'/>     <feature policy='require' name='pcid'/>     <feature policy='require' name='vmx'/>     <feature policy='require' name='hypervisor'/>     <feature policy='disable' name='mpx'/>     <feature policy='require' name='pku'/>     <feature policy='require' name='arch-capabilities'/>     <feature policy='require' name='rdctl-no'/>     <feature policy='require' name='ibrs-all'/>     <feature policy='require' name='skip-l1dfl-vmentry'/>     <feature policy='require' name='mds-no'/>   </cpu> Those settings are an absolute must. Boot the VM, and install Windows 2k19 with the desktop environment. Once it is installed, open up computer management > device manager and install drivers from the virtio ISO for missing hardware, likely the network and balloon devices. From there, go to server manager, and install the hyper-v role. Reboot the server. It will reboot a few times, and on the final time, it will lock up before it reaches the log in screen. In virt-manager, go to the performance tab. The CPU will be stuck at 100%. The windows guest will be non responsive. A patched kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test If you install this kernel and boot the Windows 2k19 guest, it will come up normally when the hyper-v role is enabled, and you will be able to log in. [Where problems could occur] This is a change to a core part of the kvm subsystem, so there is potential for regression which could affect all users of KVM. If a regression were to occur, there are no workarounds. Users would need to downgrade their kernel while a fix is developed.
2021-01-19 14:57:32 Terry Rudd bug added subscriber Terry Rudd
2021-01-22 19:21:21 Kelsey Steele linux (Ubuntu Focal): status In Progress Fix Committed
2021-02-05 10:18:18 Ubuntu Kernel Bot tags focal sts focal sts verification-needed-focal
2021-02-09 23:58:29 Matthew Ruffell attachment added 5.4.0-65-generic https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1911848/+attachment/5462096/+files/5.4.0-65-generic.png
2021-02-09 23:59:27 Matthew Ruffell attachment added 5.4.0-66-generic https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1911848/+attachment/5462097/+files/5.4.0-66-generic.png
2021-02-10 00:03:50 Matthew Ruffell tags focal sts verification-needed-focal focal sts verification-done-focal
2021-02-23 16:16:31 Launchpad Janitor linux (Ubuntu Focal): status Fix Committed Fix Released
2021-02-23 16:16:31 Launchpad Janitor cve linked 2020-27777
2021-02-23 16:16:31 Launchpad Janitor cve linked 2020-29372