linux-aws: Xen / hibernation: xen-netfront panic + resume hangs

Bug #1881869 reported by Andrea Righi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Fix Released
High
Andrea Righi
Eoan
Fix Released
High
Andrea Righi
Focal
Fix Released
High
Andrea Righi

Bug Description

[Impact]

During our AWS testing we were able to trigger some hibernation failures in some Xen instance types.

One problem is a kernel panic in the resume callback of the xen-netfront driver. A workaround to this problem is to compile the driver as a module and reload it at resume (we were already doing this reload with the bionic kernel that had this driver compiled as a module, but for some reasons eoan and focal had this statically compiled).

Other issues were showing up as hangs on resume, these seem to be prevented by using the new Xen/hibernation patch set posted by Anchal to the LKML:
https://<email address hidden>/

This new patch set is still being reviewed, but according to our tests it really seems to fix some of these hangs on resume.

In addition to that we can improve hibernation reliability and performance even more by applying the updated swapoff optimization patch (that has been merged upstream).

[Test case]

Create a Xen instance in AWS, hibernate/resume multiple times.

[Fix]

The following set of fixes can be used to improve hibernation performance and reliability:
 - new Xen/hibernation patch set from the LKML (see link above)
 - config change to compile xen-netfront as a module
 - new swapoff optimization patch

[Regression potential]

The xen-netfront config change and the new swapoff optimization patch are pretty safe (one is a config change that affects only the xen-netfront driver, the other is a clean cherry-pick of an upstream commit).

The new Xen/hibernation update is pretty big and the new patches are still under review, however according to our tests it really seems to fix some of the hang issues (it definitely makes things better). Moreover, all the changes are affecting Xen and they are restricted to the hibernation/resume code paths, so, in conclusion, the overall regression potential is minimal.

[See also]

NOTE: the fix mentioned in LP: #1879711 (disable CONFIG_DMA_CMA) was also applied during our tests and it is also required to make hibernation stable in Xen.

Andrea Righi (arighi)
Changed in linux-aws (Ubuntu Eoan):
importance: Undecided → High
Changed in linux-aws (Ubuntu Focal):
importance: Undecided → High
assignee: nobody → Andrea Righi (arighi)
Changed in linux-aws (Ubuntu Eoan):
assignee: nobody → Andrea Righi (arighi)
Changed in linux-aws (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
importance: Undecided → High
Andrea Righi (arighi)
summary: - linux-aws: fix Xen / hibernation issues
+ linux-aws: Xen / hibernation: xen-netfront panic + resume hangs
Andrea Righi (arighi)
Changed in linux-aws (Ubuntu Eoan):
status: New → Confirmed
Changed in linux-aws (Ubuntu Focal):
status: New → Confirmed
Changed in linux-aws (Ubuntu):
status: New → Confirmed
Changed in linux-aws (Ubuntu Eoan):
status: Confirmed → Fix Committed
Changed in linux-aws (Ubuntu Focal):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (33.6 KiB)

This bug was fixed in the package linux-aws - 5.4.0-1018.18

---------------
linux-aws (5.4.0-1018.18) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1016.16 -proposed tracker (LP: #1882686)

  * ASoC/amd: add audio driver for amd renoir (LP: #1881046)
    - [Config] aws: do not enable amd renoir ASoC audio

  * Focal update: v5.4.42 upstream stable release (LP: #1879759)
    - [Config] updateconfigs for CC_HAS_WARN_MAYBE_UNINITIALIZED

  * linux-aws: Xen / hibernation: xen-netfront panic + resume hangs
    (LP: #1881869)
    - Revert "UBUNTU SAUCE [aws]: xen: Only restore the ACPI SCI interrupt in
      xen_restore_pirqs."
    - Revert "UBUNTU SAUCE [aws]: xen: restore pirqs on resume from hibernation."
    - Revert "UBUNTU SAUCE [aws]: block: xen-blkfront: consider new dom0 features
      on restore"
    - Revert "UBUNTU SAUCE [aws]: ACPICA: Enable sleep button on ACPI legacy wake"
    - Revert "UBUNTU SAUCE [aws]: mm: swap: improve swap readahead heuristic"
    - Revert "UBUNTU SAUCE [aws] PM / hibernate: reduce memory pressure during
      image writing"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: close event channels for PIRQs in
      system core suspend callback"
    - Revert "UBUNTU: SAUCE [aws] xen/events: add xen_shutdown_pirqs helper
      function"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: save and restore steal clock"
    - Revert "UBUNTU: SAUCE [aws] xen-time-introduce-xen_-save-restore-
      _steal_clock"
    - Revert "UBUNTU: SAUCE [aws] xen-netfront: add callbacks for PM suspend and
      hibernation support"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: add system core suspend and resume
      callbacks"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: Introduce new function to map
      HYPERVISOR_shared_info on Resume"
    - Revert "UBUNTU: SAUCE: xen-blkfront: Fixed blkfront_restore to remove a call
      to negotiate_mq"
    - Revert "UBUNTU: SAUCE: xen-blkfront: add callbacks for PM suspend and
      hibernation"
    - Revert "UBUNTU: SAUCE: xenbus: add freeze/thaw/restore callbacks support"
    - Revert "UBUNTU: SAUCE: xen/manage: introduce helper function to know the on-
      going suspend mode"
    - Revert "UBUNTU: SAUCE: xen/manage: keep track of the on-going suspend mode"
    - xen/blkfront: fix ring info addressing
    - [Config] aws: compile xen-netfront as module
    - SAUCE: mm: swap: properly update readahead statistics in unuse_pte_range()
    - UBUNTU SAUCE [aws]: mm: swap: increase default swap readahead size
    - [Config] aws: compile xen-netfront as module (update the right config)

  * Restore request-based mode to xen-blkfront for AWS kernels (LP: #1801305)
    - SAUCE: xen/manage: keep track of the on-going suspend mode
    - SAUCE: xenbus: add freeze/thaw/restore callbacks support
    - SAUCE: x86/xen: Introduce new function to map HYPERVISOR_shared_info on
      Resume
    - SAUCE: x86/xen: add system core suspend and resume callbacks
    - SAUCE: genirq: Shutdown irq chips in suspend/resume during hibernation
    - SAUCE: xen-blkfront: add callbacks for PM suspend and hibernation
    - SAUCE: xen-netfront: add callbacks for PM suspend and hibernation
    - SAUCE: xen/time: introduce xen_{save,restore}...

Changed in linux-aws (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (35.7 KiB)

This bug was fixed in the package linux-aws - 5.3.0-1030.32

---------------
linux-aws (5.3.0-1030.32) eoan; urgency=medium

  * eoan/linux-aws: 5.3.0-1030.32 -proposed tracker (LP: #1885768)

  * aws: disable CONFIG_DMA_CMA (LP: #1879711)
    - [Config] aws: disable CONFIG_DMA_CMA

linux-aws (5.3.0-1029.31) eoan; urgency=medium

  * Binder and ashmem drivers are missing from AWS kernel (LP: #1876165)
    - [Config] enable binder and ashmem as modules

  * linux-aws: Xen / hibernation: xen-netfront panic + resume hangs
    (LP: #1881869)
    - Revert "UBUNTU SAUCE [aws]: xen: Only restore the ACPI SCI interrupt in
      xen_restore_pirqs."
    - Revert "UBUNTU SAUCE [aws]: xen: restore pirqs on resume from hibernation."
    - Revert "UBUNTU SAUCE [aws]: ACPICA: Enable sleep button on ACPI legacy wake"
    - Revert "UBUNTU SAUCE [aws]: mm: swap: improve swap readahead heuristic"
    - Revert "UBUNTU SAUCE [aws] PM / hibernate: reduce memory pressure during
      image writing"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: close event channels for PIRQs in
      system core suspend callback"
    - Revert "UBUNTU: SAUCE [aws] xen/events: add xen_shutdown_pirqs helper
      function"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: save and restore steal clock"
    - Revert "UBUNTU: SAUCE [aws] xen-time-introduce-xen_-save-restore-
      _steal_clock"
    - Revert "UBUNTU: SAUCE [aws] xen-netfront: add callbacks for PM suspend and
      hibernation support"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: add system core suspend and resume
      callbacks"
    - Revert "UBUNTU: SAUCE [aws] x86/xen: Introduce new function to map
      HYPERVISOR_shared_info on Resume"
    - Revert "UBUNTU: SAUCE: xenbus: add freeze/thaw/restore callbacks support"
    - Revert "UBUNTU: SAUCE: xen/manage: introduce helper function to know the on-
      going suspend mode"
    - Revert "UBUNTU: SAUCE: xen/manage: keep track of the on-going suspend mode"
    - xen/blkfront: fix ring info addressing
    - [Config] aws: compile xen-netfront as module
    - mm: swap: properly update readahead statistics in unuse_pte_range()
    - UBUNTU SAUCE [aws]: mm: swap: increase default swap readahead size

  * Restore request-based mode to xen-blkfront for AWS kernels (LP: #1801305)
    - SAUCE: xen/manage: keep track of the on-going suspend mode
    - SAUCE: xenbus: add freeze/thaw/restore callbacks support
    - SAUCE: x86/xen: Introduce new function to map HYPERVISOR_shared_info on
      Resume
    - SAUCE: x86/xen: add system core suspend and resume callbacks
    - SAUCE: genirq: Shutdown irq chips in suspend/resume during hibernation
    - SAUCE: xen-blkfront: add callbacks for PM suspend and hibernation
    - SAUCE: xen-netfront: add callbacks for PM suspend and hibernation
    - SAUCE: xen/time: introduce xen_{save,restore}_steal_clock
    - SAUCE: x86/xen: save and restore steal clock
    - SAUCE: xen: Introduce wrapper for save/restore sched clock offset
    - SAUCE: xen: Update sched clock offset to avoid system instability in
      hibernation

  * Eoan update: upstream stable patchset 2020-06-01 (LP: #1881657)
    - [Config] aws: updateconfigs for CC_HAS_WARN_MAYBE_UNINITIALIZED

  [...

Changed in linux-aws (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-aws - 5.4.0-1020.20

---------------
linux-aws (5.4.0-1020.20) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1020.20 -proposed tracker (LP: #1887058)

  [ Ubuntu: 5.4.0-42.46 ]

  * focal/linux: 5.4.0-42.46 -proposed tracker (LP: #1887069)
  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux-aws (5.4.0-1019.19) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1019.19 -proposed tracker (LP: #1885843)

  [ Ubuntu: 5.4.0-41.45 ]

  * focal/linux: 5.4.0-41.45 -proposed tracker (LP: #1885855)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * CVE-2019-19642
    - kernel/relay.c: handle alloc_percpu returning NULL in relay_open
  * CVE-2019-16089
    - SAUCE: nbd_genl_status: null check for nla_nest_start
  * CVE-2020-11935
    - aufs: do not call i_readcount_inc()
  * ip_defrag.sh in net from ubuntu_kernel_selftests failed with 5.0 / 5.3 / 5.4
    kernel (LP: #1826848)
    - selftests: net: ip_defrag: ignore EPERM
  * Update lockdown patches (LP: #1884159)
    - SAUCE: acpi: disallow loading configfs acpi tables when locked down
  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc
  * Introduce the new NVIDIA 418-server and 440-server series, and update the
    current NVIDIA drivers (LP: #1881137)
    - [packaging] add signed modules for the 418-server and the 440-server
      flavours

 -- Khalid Elmously <email address hidden> Fri, 10 Jul 2020 01:33:58 -0400

Changed in linux-aws (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.