[linux-azure] Enable Hibernation on The 18.04 and 20.04 5.4 Kernels

Bug #1880032 reported by Joseph Salisbury
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

Microsoft would like to request commits to enable VM hibernation in the Azure 5.4 kernels for 18.04 and 20.04.

Some of the commits needed to enable VM hibernation were included in mainline 5.4 and older. However, 24 commits were added in 5.5 and later, which are required in the 5.4 kernel. The list of commits requested are:

38dce4195f0d x86/hyperv: Properly suspend/resume reenlightenment notifications
2351f8d295ed PM: hibernate: Freeze kernel threads in software_resume()
421f090c819d x86/hyperv: Suspend/resume the VP assist page for hibernation
1a06d017fb3f Drivers: hv: vmbus: Fix Suspend-to-Idle for Generation-2 VM
3704a6a44579 PM: hibernate: Propagate the return value of hibernation_restore()
54e19d34011f hv_utils: Add the support of hibernation
ffd1d4a49336 hv_utils: Support host-initiated hibernation request
3e9c72056ed5 hv_utils: Support host-initiated restart request
9fc3c01a1fae6 Tools: hv: Reopen the devices if read() or write() returns
05bd330a7fd8 x86/hyperv: Suspend/resume the hypercall page for hibernation
382a46221757 video: hyperv_fb: Fix hibernation for the deferred IO feature
e2379b30324c Input: hyperv-keyboard: Add the support of hibernation
ac82fc8327088 PCI: hv: Add hibernation support
a8e37506e79a PCI: hv: Reorganize the code in preparation of hibernation
1349401ff1aa4 clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for hibernation
af13f9ed6f9a HID: hyperv: Add the support of hibernation
25bd2b2f1f053 hv_balloon: Add the support of hibernation
b96f86534fa31 x86/hyperv: Implement hv_is_hibernation_supported()
4df4cb9e99f83 x86/hyperv: Initialize clockevents earlier in CPU onlining
0efeea5fb1535 hv_netvsc: Add the support of hibernation
2194c2eb6717f hv_sock: Add the support of hibernation
1ecf302021040 video: hyperv_fb: Add the support of hibernation
56fb105859345 scsi: storvsc: Add the support of hibernation
f2c33ccacb2d4 PCI/PM: Always return devices to D0 when thawing

Revision history for this message
Dexuan Cui (decui) wrote :

There is another important bug fix for hibernation:
net/mlx5: Fix crash upon suspend/resume (https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=8fc3e29be9248048f449793502c15af329f35c6e).

So far the fix is only present in the net.git tree, but I expect it will be in the mainline tree’s v5.8-rc1 (or even v5.7, if we’re lucky).

Please consider picking it up. Thanks!

Revision history for this message
Dexuan Cui (decui) wrote :

FYI: the patch "net/mlx5: Fix crash upon suspend/resume" is in v5.7 now (i.e. today's latest mainline): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.7&id=8fc3e29be9248048f449793502c15af329f35c6e

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

The following patches weren't necessary because they were already applied via upstream stable updates:

f2c33ccacb2d PCI/PM: Always return devices to D0 when thawing
Via https://bugs.launchpad.net/bugs/1858427

2351f8d295ed PM: hibernate: Freeze kernel threads in software_resume()
Via https://bugs.launchpad.net/bugs/1877592

1a06d017fb3f Drivers: hv: vmbus: Fix Suspend-to-Idle for Generation-2 VM
Via https://bugs.launchpad.net/bugs/1877592

Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (31.5 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1020.20

---------------
linux-azure (5.4.0-1020.20) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1020.20 -proposed tracker (LP: #1885048)

  * linux-azure: Update SGX version to version LD_1.33 (LP: #1881338)
    - SAUCE: linux-azure: Update SGX to version LD_1.33
    - SAUCE: ubuntu/sgx: Add module alias for ACPI device INT0E0C

  * [linux-azure] Enable Hibernation on The 18.04 and 20.04 5.4 Kernels
    (LP: #1880032)
    - x86/hyperv: Initialize clockevents earlier in CPU onlining
    - scsi: storvsc: Add the support of hibernation
    - video: hyperv_fb: Add the support of hibernation
    - hv_sock: Add the support of hibernation
    - hv_netvsc: Add the support of hibernation
    - x86/hyperv: Implement hv_is_hibernation_supported()
    - hv_balloon: Add the support of hibernation
    - HID: hyperv: Add the support of hibernation
    - PCI: hv: Reorganize the code in preparation of hibernation
    - PCI: hv: Add hibernation support
    - clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for
      hibernation
    - Input: hyperv-keyboard: Add the support of hibernation
    - video: hyperv_fb: Fix hibernation for the deferred IO feature
    - Tools: hv: Reopen the devices if read() or write() returns errors
    - hv_utils: Support host-initiated restart request
    - hv_utils: Support host-initiated hibernation request
    - hv_utils: Add the support of hibernation
    - x86/hyperv: Suspend/resume the hypercall page for hibernation
    - PM: hibernate: Propagate the return value of hibernation_restore()
    - x86/hyperv: Suspend/resume the VP assist page for hibernation
    - net/mlx5: Fix crash upon suspend/resume

  [ Ubuntu: 5.4.0-40.44 ]

  * linux-oem-5.6-tools-common and -tools-host should be dropped (LP: #1881120)
    - [Packaging] Add Conflicts/Replaces to remove linux-oem-5.6-tools-common and
      -tools-host
  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
  * Slow send speed with Intel I219-V on Ubuntu 18.04.1 (LP: #1802691)
    - e1000e: Disable TSO for buffer overrun workaround
  * CVE-2020-0543
    - UBUNTU/SAUCE: x86/speculation/srbds: do not try to turn mitigation off when
      not supported
  * Realtek 8723DE [10ec:d723] subsystem [10ec:d738] disconnects unsolicitedly
    when Bluetooth is paired: Reason: 23=IEEE8021X_FAILED (LP: #1878147)
    - SAUCE: Revert "UBUNTU: SAUCE: rtw88: Move driver IQK to set channel before
      association for 11N chip"
    - SAUCE: Revert "UBUNTU: SAUCE: rtw88: fix rate for a while after being
      connected"
    - SAUCE: Revert "UBUNTU: SAUCE: rtw88: No retry and report for auth and assoc"
    - SAUCE: Revert "UBUNTU: SAUCE: rtw88: 8723d: Add coex support"
    - rtw88: add a debugfs entry to dump coex's info
    - rtw88: add a debugfs entry to enable/disable coex mechanism
    - rtw88: 8723d: Add coex support
    - SAUCE: rtw88: coex: 8723d: set antanna control owner
    - SAUCE: rtw88: coex: 8723d: handle BT inquiry cases
    - SAUCE: rtw88: fix EAPOL 4-way failure by finish IQK earlier
  * CPU stress test fails with focal kernel (LP: #1867900)
    - [Config] Disable hisi_sec2 tempora...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.4.0-1022.22

---------------
linux-azure (5.4.0-1022.22) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1022.22 -proposed tracker (LP: #1887060)

  [ Ubuntu: 5.4.0-42.46 ]

  * focal/linux: 5.4.0-42.46 -proposed tracker (LP: #1887069)
  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux-azure (5.4.0-1021.21) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1021.21 -proposed tracker (LP: #1885845)

  * module intel_sgx appears to be blacklisted by the kernel. (LP: #1862201)
    - Revert "UBUNTU: [Packaging] linux-azure: Prevent intel_sgx from being
      automatically loaded"
    - [Packaging] linux-azure: Divert conf files blacklisting intel_sgx

  * Add XDP support to hv_netvsc driver (LP: #1877654)
    - hv_netvsc: Add XDP support
    - hv_netvsc: Update document for XDP support
    - hv_netvsc: Fix XDP refcnt for synthetic and VF NICs

  * Request to include two NUMA related commits in Azure kernels (LP: #1880975)
    - PCI: hv: Decouple the func definition in hv_dr_state from VSP message
    - PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2

  [ Ubuntu: 5.4.0-41.45 ]

  * focal/linux: 5.4.0-41.45 -proposed tracker (LP: #1885855)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * CVE-2019-19642
    - kernel/relay.c: handle alloc_percpu returning NULL in relay_open
  * CVE-2019-16089
    - SAUCE: nbd_genl_status: null check for nla_nest_start
  * CVE-2020-11935
    - aufs: do not call i_readcount_inc()
  * ip_defrag.sh in net from ubuntu_kernel_selftests failed with 5.0 / 5.3 / 5.4
    kernel (LP: #1826848)
    - selftests: net: ip_defrag: ignore EPERM
  * Update lockdown patches (LP: #1884159)
    - SAUCE: acpi: disallow loading configfs acpi tables when locked down
  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc
  * Introduce the new NVIDIA 418-server and 440-server series, and update the
    current NVIDIA drivers (LP: #1881137)
    - [packaging] add signed modules for the 418-server and the 440-server
      flavours

 -- Khalid Elmously <email address hidden> Fri, 10 Jul 2020 01:51:58 -0400

Changed in linux-azure (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Dexuan Cui (decui) wrote :
Download full text (3.3 KiB)

Unluckily this commit breaks hibernation:
0a14dbaa0736 ("video: hyperv_fb: Fix hibernation for the deferred IO feature"):
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/focal/commit/?h=Ubuntu-azure-5.4.0-1022.22&id=0a14dbaa0736a6021c02e74d42cf3a7ca5438da6

We should include the patch only if the kernel also includes
a4ddb11d297e ("video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver"

Now I'm seeing a hang/panic issue when hibernating the VM ("5.4.0-1022-azure #22-Ubuntu"):
[ 67.736061] ------------[ cut here ]------------
[ 67.736068] WARNING: CPU: 5 PID: 1358 at kernel/workqueue.c:3040 __flush_work+0x1b5/0x1d0
[ 67.736068] Modules linked in: xt_owner iptable_security xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper joydev hid_generic hyperv_fb cfbfillrect hid_hyperv intel_rapl_perf serio_raw hyperv_keyboard pata_acpi hv_netvsc hv_balloon hid cfbimgblt pci_hyperv cfbcopyarea hv_utils pci_hyperv_intf sch_fq_codel drm drm_panel_orientation_quirks i2c_core ip_tables x_tables autofs4
[ 67.736088] CPU: 5 PID: 1358 Comm: bash Not tainted 5.4.0-1022-azure #22-Ubuntu
[ 67.736089] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 67.736091] RIP: 0010:__flush_work+0x1b5/0x1d0
[ 67.736092] Code: f0 eb e3 4d 8b 7c 24 20 e9 f3 fe ff ff 8b 0b 48 8b 53 08 83 e1 08 48 0f ba 2b 03 80 c9 f0 e9 4f ff ff ff 0f 0b e9 68 ff ff ff <0f> 0b 45 31 f6 e9 5e ff ff ff e8 ec e0 fd ff 66 66 2e 0f 1f 84 00
[ 67.736095] RSP: 0018:ffffa7ce8a8ffb78 EFLAGS: 00010246
[ 67.736096] RAX: 0000000000000000 RBX: ffff8be3621f02a0 RCX: 0000000000000000
[ 67.736096] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8be3621f02a0
[ 67.736097] RBP: ffffa7ce8a8ffbf0 R08: 0000000000000000 R09: 00000000ff010101
[ 67.736098] R10: ffff8be363f7a320 R11: 0000000000000001 R12: ffff8be3621f02a0
[ 67.736098] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffffbc390fd1
[ 67.736099] FS: 00007f6df35fe740(0000) GS:ffff8be375d40000(0000) knlGS:0000000000000000
[ 67.736100] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 67.736100] CR2: 0000561eef2c1b50 CR3: 0000000e40a14004 CR4: 00000000001706e0
[ 67.736102] Call Trace:
[ 67.736108] __cancel_work_timer+0x107/0x180
[ 67.736119] cancel_delayed_work_sync+0x13/0x20
[ 67.736121] hvfb_suspend+0x48/0x80 [hyperv_fb]
[ 67.736122] vmbus_suspend+0x2a/0x40
[ 67.736125] dpm_run_callback+0x5b/0x150
[ 67.736127] __device_suspend_noirq+0x9e/0x2f0
[ 67.736128] dpm_suspend_noirq+0x101/0x2d0
[ 67.736130] dpm_suspend_end+0x53/0x80
[ 67.736132] hibernation_snapshot+0xd8/0x460
[ 67.736133] hibernate.cold+0x6d/0x1f6
[ 67.736135] state_store+0xde/0xe0
[ 67.736138] kobj_attr_store+0x12/0x20
[ 67.736141] sysfs_kf_write+0x3e/0x50
[ 67.736142] kernfs_fop_write+0xda/0x1b0
[ 67.736145] __vfs_write+0x1b/0x40
[ 67.736147] vfs_write+0xb9/0x1a0
[ 67.736149] ksys_write+0x67/0xe0
[ 67.736150] __x64_sys_...

Read more...

Revision history for this message
Dexuan Cui (decui) wrote :

Unluckily this commit breaks hibernation:
0a14dbaa0736 ("video: hyperv_fb: Fix hibernation for the deferred IO feature"):
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/focal/commit/?h=Ubuntu-azure-5.4.0-1022.22&id=0a14dbaa0736a6021c02e74d42cf3a7ca5438da6

The kernel here doesn't include
a4ddb11d297e ("video: hyperv: hyperv_fb: Support deferred IO for Hyper-V frame buffer driver", so it should not include
0a14dbaa0736 ("video: hyperv_fb: Fix hibernation for the deferred IO feature").

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Dexuan.

Do you agree in reverting this commit?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Dexuan. can you give the steps to reproduce the issue. I couldn't reproduce it on a local hyper-v guest.

Revision history for this message
Dexuan Cui (decui) wrote :

Hi Marcelo, yes, please revert
0a14dbaa0736 ("video: hyperv_fb: Fix hibernation for the deferred IO feature").
No other change is needed.

In the future, when a4ddb11d297e is included, 0a14dbaa0736 should also be included.

Revision history for this message
Dexuan Cui (decui) wrote :

To reproduce the issue, I created a Ubuntu 20.04 VM on Azure (the kernel version was "5.4.0-1022-azure #22-Ubuntu"), and I ran "echo disk > /sys/power/state" in the VM and then checked the Azure serial console of the VM and found the warning in commen #8 and suspending couldn't finish normally (it looks the VM got a fatal page fault error later). I suppose the issue can also repro on a local Hyper-V host.

Revision history for this message
Dexuan Cui (decui) wrote :
Download full text (3.3 KiB)

Detailed steps to repro the issueo on Azure:
1. Create a VM with the image "Ubuntu Server 20.04 LTS - Gen1". Any VM size should be fine. Here I use "Standard E4-2ds_v4 (2 vcpus, 32 GiB memory)".

2. Add an extra disk of 64GB to the VM via Azure portal.

3. Login the VM via ssh and check the kernel version: here I get 5.4.0-1022-azure.

4. In the VM, the 64GB disk can be sdc. Let's create a swap partition in it, i.e. sdc1.

5. mkswap /dev/sdc1
    root@decui-tmp-2004:~# mkswap /dev/sdc1
    Setting up swapspace version 1, size = 64 GiB (68718424064 bytes)
    no label, UUID=544831e4-72ab-4d2c-81aa-6dac3a8e20ad

6. Add the swap partition info into /etc/fstab:
    UUID=544831e4-72ab-4d2c-81aa-6dac3a8e20ad none swap sw 0 0

7. Use "swapon -a; swapon -s" to confirm that the swap partition works.

8. Add the kernel parameter resume= into /etc/default/grub.d/50-cloudimg-settings.cfg:
     GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 earlyprintk=ttyS0 resume=UUID=544831e4-72ab-4d2c-81aa-6dac3a8e20ad ignore_loglevel no_console_suspend"

   Note: here I also add "ignore_loglevel no_console_suspend", which are *required* to see the error messages during hibernation.

9. Comment out the only line in /etc/default/grub.d/40-force-partuuid.cfg:
     ####GRUB_FORCE_PARTUUID=bf00dea3-136e-49cb-a640-0df7ce49d6db
   Note: this step is required, otherwise the generated grub.cfg doesn't contain the "initrd ..." line , which is required for resuming to work.

10. Run "update-grub2; reboot".
     Note: this 'reboot' might be a must, because we'll need to re-generate the initramfs when the running kernel has the resume= parameter.

11. Login the VM again and run "update-initramfs -u".

12. Run "echo disk > /sys/power/state". Note: we'd better run this command from Azure serial console (we need to set a password for root and use that to login via the serial console) so we can easily watch what will be happening.

root@decui-tmp-2004:~# echo disk > /sys/power/state
[ 67.838749] PM: hibernation entry
[ 68.266627] Filesystems sync: 0.041 seconds
[ 68.271740] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 68.281528] OOM killer disabled.
[ 68.286475] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
[ 68.293459] PM: Marking nosave pages: [mem 0x0009f000-0x000fffff]
[ 68.300306] PM: Marking nosave pages: [mem 0x3fff0000-0xffffffff]
[ 68.308250] PM: Basic memory bitmaps created
[ 68.313082] PM: Preallocating image memory... done (allocated 298659 pages)
[ 69.303864] PM: Allocated 1194636 kbytes in 0.98 seconds (1219.01 MB/s)
[ 69.311605] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 69.322486] serial 00:04: disabled
[ 69.345193] ------------[ cut here ]------------
[ 69.345199] WARNING: CPU: 1 PID: 1495 at kernel/workqueue.c:3040 __flush_work+0x1b5/0x1d0
...
[ 70.047238] CPU1 is up
[ 70.054474] hv_utils: KVP IC version 4.0
[ 70.056763] hv_utils: Shutdown IC version 3.2
[ 70.061009] hv_balloon: Using Dynamic Memory protocol version 2.0

It looks the kernel hangs here forever. Normally the VM is expected to save the state to disk and power off and later when we star...

Read more...

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Thanks for the detailed instructions, Dexuan. "ignore_loglevel no_console_suspend" was the missing piece for me.

I'm still running tests but so far the results are good with the reverted commit.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Since this bug was already released I handling the panic fix at https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1891931.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.