[linux-azure][hibernation] GPU device no longer working after resume from hibernation in NV6 VM size

Bug #1894893 reported by Dexuan Cui
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
Medium
Unassigned
Groovy
Fix Released
Medium
Unassigned

Bug Description

[Impact]

There are failed logs after resume from hibernation in NV6 (GPU passthrough size) VM in Azure:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5

This happens to the latest stable release of the linux-azure 5.4.0-1023.23 kernel and the latest mainline linux kernel.

[Test Case]

How reproducible:
100%

Steps to Reproduce:
1. Start a Standard_NV6 VM in Azure and enable hibernation properly (please refer to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1880032/comments/14 )

E.g. here I create a Generation-1 Ubuntu 20.04 Standard NV6_Promo (6 vcpus, 56 GiB memory) VM in East US 2.

2. Make sure the in-kernel open-source nouveau driver is loaded, or blacklist the nouveau driver and install the official Nvidia GPU driver (please follow https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup : "Install GRID drivers on NV or NVv3-series VMs" -- the most important step to run the "./NVIDIA-Linux-x86_64-grid.run".)

3. Run hibernation from serial console
# systemctl hibernate

4. After hibernation finishes, start VM and check dmesg
# dmesg|grep fail

Actual results:
[ 1432.153730] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5
[ 1432.167910] hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5

And /proc/interrupts shows that the GPU interrupts are no longer happening.

Expected results:
No failed logs, and the GPU interrupt should still happen after hibernation.

[Regression Potential]

The fix touches the pci-hyperv and can compromise the hyper-v guest drivers. However the change is focuses on the execution path used for hibernation that is still not officially supported.

[Other info]

BUG FIX:
I made a fix here: https://lkml.org/lkml/2020/9/4/1268.

Without the patch, we see the error "hv_pci 47505500-0001-0000-3130-444531334632: hv_irq_unmask() failed: 0x5" during hibernation when the VM has the Nvidia GPU driver loaded, and after hibernation the GPU driver can no longer receive any MSI/MSI-X interrupts when we check /proc/interrupts.

With the patch, we should no longer see the error, and the GPU driver should still receive interrupts after hibernation.

Revision history for this message
Dexuan Cui (decui) wrote :

The fix is in the PCI tree now:

"PCI: hv: Fix hibernation in case interrupts are not re-create" (
https://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/pci.git/commit/?h=pci/hv&id=915cff7f38c5e4d47f187f8049245afc2cb3e503 )

Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
Changed in linux-azure (Ubuntu Groovy):
status: New → Fix Committed
status: Fix Committed → In Progress
description: updated
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Stefan Bader (smb)
Changed in linux-azure (Ubuntu Focal):
importance: Undecided → Medium
Changed in linux-azure (Ubuntu Groovy):
importance: Undecided → Medium
Changed in linux-azure (Ubuntu):
status: New → Invalid
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

The fix was already reviewed/acked in the mailing list.

Ian May (ian-may)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

For Groovy, the proposed fix has already been applied to the generic groovy/linux kernel as part of "Groovy update: v5.8.17 upstream stable release" (bug 1902137). Therefore, the patch applied to the linux-azure branch went away during the rebase so it's missing the BugLink to this bug report, due to that this bug will not be closed automatically when the package is released.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (80.2 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1032.33

---------------
linux-azure (5.4.0-1032.33) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1032.33 -proposed tracker (LP: #1903162)

  * Focal update: v5.4.66 upstream stable release (LP: #1896824)
    - [Config] azure: updateconfigs for VGACON_SOFT_SCROLLBACK

  * [linux-azure][hibernation] Mellanox CX4 NIC's TX/RX packets stop increasing
    after hibernation/resume (LP: #1894896)
    - hv_netvsc: Fix hibernation for mlx5 VF driver

  * [linux-azure][hibernation] GPU device no longer working after resume from
    hibernation in NV6 VM size (LP: #1894893)
    - PCI: hv: Fix hibernation in case interrupts are not re-created

  * linux-azure: build and include the tcm_loop module to the main kernel
    package (LP: #1791794)
    - [Config] linux-azure: CONFIG_LOOPBACK_TARGET=m (tcm_loop)

  * [linux-azure] Two Fixes For kdump Over Network (LP: #1883261)
    - PCI: hv: Fix the PCI HyperV probe failure path to release resource properly
    - PCI: hv: Retry PCI bus D0 entry on invalid device state

  [ Ubuntu: 5.4.0-55.61 ]

  * focal/linux: 5.4.0-55.61 -proposed tracker (LP: #1903175)
  * Update kernel packaging to support forward porting kernels (LP: #1902957)
    - [Debian] Update for leader included in BACKPORT_SUFFIX
  * Avoid double newline when running insertchanges (LP: #1903293)
    - [Packaging] insertchanges: avoid double newline
  * EFI: Fails when BootCurrent entry does not exist (LP: #1899993)
    - efivarfs: Replace invalid slashes with exclamation marks in dentries.
  * CVE-2020-14351
    - perf/core: Fix race in the perf_mmap_close() function
  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for submitting discard bio
    - md/raid10: extend r10bio devs to raid disks
    - md/raid10: pull codes that wait for blocked dev into one function
    - md/raid10: improve raid10 discard request
    - md/raid10: improve discard request for far layout
    - dm raid: fix discard limits for raid1 and raid10
    - dm raid: remove unnecessary discard limits for raid10
  * Bionic: btrfs: kernel BUG at /build/linux-
    eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
    - btrfs: drop unnecessary offset_in_page in extent buffer helpers
    - btrfs: extent_io: do extra check for extent buffer read write functions
    - btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
    - btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
    - btrfs: ctree: check key order before merging tree blocks
  * Ethernet no link lights after reboot (Intel i225-v 2.5G) (LP: #1902578)
    - igc: Add PHY power management control
  * Undetected Data corruption in MPI workloads that use VSX for reductions on
    POWER9 DD2.1 systems (LP: #1902694)
    - powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
    - selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load
      workaround
  * [20.04 FEAT] Support/enhancement of NVMe IPL (LP: #1902179)
    - s390: nvme ipl
    - s390: nvme reipl
    - s390/ipl: support NVMe IPL kernel para...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Groovy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.