[linux-azure] Batch hibernate and resume IO requests

Bug #1904458 reported by Joseph Salisbury
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Marcelo Cerri
Groovy
Fix Released
Undecided
Marcelo Cerri

Bug Description

[Impact]

Microsoft would like to request the following upstream commit in all releases supported on Azure. This commit improves a signification delay in hibernation/resume:

55c4478a8f0e("PM: hibernate: Batch hibernate and resume IO requests")

Details of this commit:
Hibernate and resume process submits individual IO requests for each page of the data, so use blk_plug to improve the batching of these requests.

Testing this change with hibernate and resumes consistently shows merging of the IO requests and more than an order of magnitude improvement in hibernate and resume speed is observed.

One hibernate and resume cycle for 16GB RAM out of 32GB in use takes around 21 minutes before the change, and 1 minutes after the change on a system with limited storage IOPS.

[Test Case]

Follow the steps described here:

https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1880032/comments/14

And compare the time need to save the instance state.

[Where problems could occur]

Considering the size of the patch and the code it touches, the most likely issues are related to hibernation itself. Considering the hibernation isn't officially support, any hibernation issues should even considered regressions.

CVE References

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Targeting only 5.4 and 5.8 since 4.15 will not support hibernation.

no longer affects: linux-azure-4.15 (Ubuntu)
Changed in linux-azure (Ubuntu Focal):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Groovy):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
Changed in linux-azure (Ubuntu Groovy):
status: New → In Progress
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

I tested the change on Azure using Standard L8s_v2 (8 vcpus, 64 GiB memory) VMs using the 5.4 linux-azure kernel on bionic and the 5.8 kernel on groovy.

During my tests the VMs continued to successfully hibernate with the patch and the time needed to save the VM state went down from 25 seconds to around 3 seconds.

description: updated
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Ian May (ian-may)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.4.0-1039.41

---------------
linux-azure (5.4.0-1039.41) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1039.41 -proposed tracker (LP: #1912002)

  * Use Azure host for time keeping in all images (LP: #1896784)
    - hv_utils: return error if host timesysnc update is stale
    - hv_utils: drain the timesync packets on onchannelcallback

  * [linux-azure] Batch hibernate and resume IO requests (LP: #1904458)
    - PM: hibernate: Batch hibernate and resume IO requests

linux-azure (5.4.0-1038.40) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1038.40 -proposed tracker (LP: #1911317)

  [ Ubuntu: 5.4.0-63.71 ]

  * focal/linux: 5.4.0-63.71 -proposed tracker (LP: #1911333)
  * overlay: permission regression in 5.4.0-51.56 due to patches related to
    CVE-2020-16120 (LP: #1900141)
    - ovl: do not fail because of O_NOATIME
  * Focal update: v5.4.79 upstream stable release (LP: #1907151)
    - net/mlx5: Use async EQ setup cleanup helpers for multiple EQs
    - net/mlx5: poll cmd EQ in case of command timeout
    - net/mlx5: Fix a race when moving command interface to events mode
    - net/mlx5: Add retry mechanism to the command entry index allocation
  * Kernel 5.4.0-56 Wi-Fi does not connect (LP: #1906770)
    - mt76: fix fix ampdu locking
  * [Ubuntu 21.04 FEAT] mpt3sas: Request to include the patch set which supports
    topology where zoning is enabled in expander (LP: #1899802)
    - scsi: mpt3sas: Define hba_port structure
    - scsi: mpt3sas: Allocate memory for hba_port objects
    - scsi: mpt3sas: Rearrange _scsih_mark_responding_sas_device()
    - scsi: mpt3sas: Update hba_port's sas_address & phy_mask
    - scsi: mpt3sas: Get device objects using sas_address & portID
    - scsi: mpt3sas: Rename transport_del_phy_from_an_existing_port()
    - scsi: mpt3sas: Get sas_device objects using device's rphy
    - scsi: mpt3sas: Update hba_port objects after host reset
    - scsi: mpt3sas: Set valid PhysicalPort in SMPPassThrough
    - scsi: mpt3sas: Handling HBA vSES device
    - scsi: mpt3sas: Add bypass_dirty_port_flag parameter
    - scsi: mpt3sas: Handle vSES vphy object during HBA reset
    - scsi: mpt3sas: Add module parameter multipath_on_hba
    - scsi: mpt3sas: Bump driver version to 35.101.00.00

  [ Ubuntu: 5.4.0-62.70 ]

  * focal/linux: 5.4.0-62.70 -proposed tracker (LP: #1911144)
  * CVE-2020-28374
    - SAUCE: target: fix XCOPY NAA identifier lookup
  * Packaging resync (LP: #1786013)
    - update dkms package versions

 -- Marcelo Henrique Cerri <email address hidden> Mon, 18 Jan 2021 09:44:59 -0300

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.8.0-1020.22

---------------
linux-azure (5.8.0-1020.22) groovy; urgency=medium

  * groovy/linux-azure: 5.8.0-1020.22 -proposed tracker (LP: #1912236)

  * [linux-azure] Batch hibernate and resume IO requests (LP: #1904458)
    - PM: hibernate: Batch hibernate and resume IO requests

  [ Ubuntu: 5.8.0-41.46 ]

  * groovy/linux: 5.8.0-41.46 -proposed tracker (LP: #1912219)
  * Groovy update: upstream stable patchset 2020-12-17 (LP: #1908555) // nvme
    drive fails after some time (LP: #1910866)
    - Revert "nvme-pci: remove last_sq_tail"
  * initramfs unpacking failed (LP: #1835660)
    - SAUCE: lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
  * overlay: permission regression in 5.4.0-51.56 due to patches related to
    CVE-2020-16120 (LP: #1900141)
    - ovl: do not fail because of O_NOATIME

 -- Kleber Sacilotto de Souza <email address hidden> Tue, 19 Jan 2021 10:09:49 +0100

Changed in linux-azure (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.8.0-1022.24+21.04.2

---------------
linux-azure (5.8.0-1022.24+21.04.2) hirsute; urgency=medium

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 and 455
    - [Config] dkms-versions -- add the 460-server nvidia driver

  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages

linux-azure (5.8.0-1022.24+21.04.1) hirsute; urgency=medium

  * hirsute/linux-azure: 5.8.0-1022.24+21.04.1 -proposed tracker (LP: #1914674)

  * Boot fails: failed to validate module [nls_iso8859_1] BTF: -22
    (LP: #1911359)
    - SAUCE: x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Miscellaneous Ubuntu changes
    - sync dkms nvidia-server 418 and 450 to -release

  [ Ubuntu: 5.8.0-1022.24 ]

  * groovy/linux-azure: 5.8.0-1022.24 -proposed tracker (LP: #1914675)
  * groovy/linux: 5.8.0-43.49 -proposed tracker (LP: #1914689)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * Exploitable vulnerabilities in AF_VSOCK implementation (LP: #1914668)
    - vsock: fix the race conditions in multi-transport support

  [ Ubuntu: 5.8.0-1020.22 ]

  * groovy/linux-azure: 5.8.0-1020.22 -proposed tracker (LP: #1912236)
  * [linux-azure] Batch hibernate and resume IO requests (LP: #1904458)
    - PM: hibernate: Batch hibernate and resume IO requests
  * groovy/linux: 5.8.0-41.46 -proposed tracker (LP: #1912219)
  * Groovy update: upstream stable patchset 2020-12-17 (LP: #1908555) // nvme
    drive fails after some time (LP: #1910866)
    - Revert "nvme-pci: remove last_sq_tail"
  * initramfs unpacking failed (LP: #1835660)
    - SAUCE: lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
  * overlay: permission regression in 5.4.0-51.56 due to patches related to
    CVE-2020-16120 (LP: #1900141)
    - ovl: do not fail because of O_NOATIME

  [ Ubuntu: 5.8.0-1019.21 ]

  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * groovy/linux: 5.8.0-38.43 -proposed tracker (LP: #1911143)
  * CVE-2020-28374
    - SAUCE: target: fix XCOPY NAA identifier lookup
  * Packaging resync (LP: #1786013)
    - update dkms package versions

 -- Seth Forshee <email address hidden> Thu, 11 Feb 2021 14:24:50 -0600

Changed in linux-azure (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.