2020-09-08 19:37:34 |
Dexuan Cui |
bug |
|
|
added bug |
2020-10-23 20:33:00 |
Marcelo Cerri |
nominated for series |
|
Ubuntu Groovy |
|
2020-10-23 20:33:00 |
Marcelo Cerri |
bug task added |
|
linux-azure (Ubuntu Groovy) |
|
2020-10-23 20:33:00 |
Marcelo Cerri |
nominated for series |
|
Ubuntu Focal |
|
2020-10-23 20:33:00 |
Marcelo Cerri |
bug task added |
|
linux-azure (Ubuntu Focal) |
|
2020-10-23 20:33:12 |
Marcelo Cerri |
linux-azure (Ubuntu Focal): status |
New |
In Progress |
|
2020-10-23 20:33:16 |
Marcelo Cerri |
linux-azure (Ubuntu Groovy): status |
New |
Fix Committed |
|
2020-10-23 20:33:20 |
Marcelo Cerri |
linux-azure (Ubuntu Groovy): status |
Fix Committed |
In Progress |
|
2020-10-23 20:35:39 |
Marcelo Cerri |
description |
There are failed logs after resume from hibernation in NV6 (GPU passthrough size) VM in Azure:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
This happens to the latest stable release of the linux-azure 5.4.0-1023.23 kernel and the latest mainline linux kernel.
How reproducible:
100%
Steps to Reproduce:
1. Start a Standard_NV6 VM in Azure and enable hibernation properly (please refer to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1880032/comments/14 )
E.g. here I create a Generation-1 Ubuntu 20.04 Standard NV6_Promo (6 vcpus, 56 GiB memory) VM in East US 2.
2. Make sure the in-kernel open-source nouveau driver is loaded, or blacklist the nouveau driver and install the official Nvidia GPU driver (please follow https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup : "Install GRID drivers on NV or NVv3-series VMs" -- the most important step to run the "./NVIDIA-Linux-x86_64-grid.run".)
3. Run hibernation from serial console
# systemctl hibernate
4. After hibernation finishes, start VM and check dmesg
# dmesg|grep fail
Actual results:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
And /proc/interrupts shows that the GPU interrupts are no longer happening.
Expected results:
No failed logs, and the GPU interrupt should still happen after hibernation.
BUG FIX:
I made a fix here: https://lkml.org/lkml/2020/9/4/1268.
Without the patch, we see the error "hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5" during hibernation when the VM has the Nvidia GPU driver loaded, and after hibernation the GPU driver can no longer receive any MSI/MSI-X interrupts when we check /proc/interrupts.
With the patch, we should no longer see the error, and the GPU driver should still receive interrupts after hibernation. |
[Impact]
There are failed logs after resume from hibernation in NV6 (GPU passthrough size) VM in Azure:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
This happens to the latest stable release of the linux-azure 5.4.0-1023.23 kernel and the latest mainline linux kernel.
[Test Case]
How reproducible:
100%
Steps to Reproduce:
1. Start a Standard_NV6 VM in Azure and enable hibernation properly (please refer to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1880032/comments/14 )
E.g. here I create a Generation-1 Ubuntu 20.04 Standard NV6_Promo (6 vcpus, 56 GiB memory) VM in East US 2.
2. Make sure the in-kernel open-source nouveau driver is loaded, or blacklist the nouveau driver and install the official Nvidia GPU driver (please follow https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup : "Install GRID drivers on NV or NVv3-series VMs" -- the most important step to run the "./NVIDIA-Linux-x86_64-grid.run".)
3. Run hibernation from serial console
# systemctl hibernate
4. After hibernation finishes, start VM and check dmesg
# dmesg|grep fail
Actual results:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
And /proc/interrupts shows that the GPU interrupts are no longer happening.
Expected results:
No failed logs, and the GPU interrupt should still happen after hibernation.
[Regression Potential]
The fix touches the pci-hyperv and can compromise the hyper-v guest drivers. However the change is focuses on the execution path used for hibernation that is still not officially supported.
[Other info]
BUG FIX:
I made a fix here: https://lkml.org/lkml/2020/9/4/1268.
Without the patch, we see the error "hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5" during hibernation when the VM has the Nvidia GPU driver loaded, and after hibernation the GPU driver can no longer receive any MSI/MSI-X interrupts when we check /proc/interrupts.
With the patch, we should no longer see the error, and the GPU driver should still receive interrupts after hibernation. |
|
2020-10-26 08:15:07 |
Stefan Bader |
linux-azure (Ubuntu Focal): importance |
Undecided |
Medium |
|
2020-10-26 08:15:13 |
Stefan Bader |
linux-azure (Ubuntu Groovy): importance |
Undecided |
Medium |
|
2020-10-26 08:15:34 |
Stefan Bader |
linux-azure (Ubuntu): status |
New |
Invalid |
|
2020-10-26 18:31:43 |
Ian May |
linux-azure (Ubuntu Focal): status |
In Progress |
Fix Committed |
|
2020-10-27 10:36:56 |
Kleber Sacilotto de Souza |
linux-azure (Ubuntu Groovy): status |
In Progress |
Fix Committed |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
linux-azure (Ubuntu Focal): status |
Fix Committed |
Fix Released |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
cve linked |
|
2020-12351 |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
cve linked |
|
2020-12352 |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
cve linked |
|
2020-14351 |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
cve linked |
|
2020-24490 |
|
2020-11-30 15:46:41 |
Launchpad Janitor |
cve linked |
|
2020-8694 |
|
2020-12-04 15:37:27 |
Marcelo Cerri |
linux-azure (Ubuntu Groovy): status |
Fix Committed |
Fix Released |
|