Activity log for bug #1968062

Date Who What changed Old value New value Message
2022-04-06 16:07:43 Francis Ginther bug added bug
2022-04-06 16:07:43 Francis Ginther attachment added serial console log https://bugs.launchpad.net/bugs/1968062/+attachment/5577675/+files/aws-jammy-all-c3.8xlarge-9-1.txt
2022-04-06 16:10:01 Francis Ginther attachment added Last screenshot before hibernation https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1968062/+attachment/5577676/+files/pre-hibernation.04.jpg
2022-04-06 16:10:44 Francis Ginther attachment added First screenshot after resume initiated https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1968062/+attachment/5577677/+files/post-hibernate.01.jpg
2022-04-06 16:13:50 Francis Ginther attachment added Second screenshot after resume initiated https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1968062/+attachment/5577701/+files/post-hibernate.12.jpg
2022-04-06 16:17:15 Francis Ginther attachment added Third screenshot after resume initiated https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1968062/+attachment/5577703/+files/post-hibernate.16.jpg
2022-04-20 21:12:59 Matthew Ruffell bug added subscriber Matthew Ruffell
2022-08-13 07:51:41 Matthew Ruffell bug added subscriber gerald.yang
2022-08-13 07:51:55 Matthew Ruffell summary jammy/linux-aws hibernation timeout on xen instances Jammy / Kinetic: Enable Hibernation for Xen Based Instance Types
2022-08-13 07:52:00 Matthew Ruffell nominated for series Ubuntu Kinetic
2022-08-13 07:52:00 Matthew Ruffell bug task added linux-aws (Ubuntu Kinetic)
2022-08-13 07:52:00 Matthew Ruffell nominated for series Ubuntu Jammy
2022-08-13 07:52:00 Matthew Ruffell bug task added linux-aws (Ubuntu Jammy)
2022-08-13 07:52:08 Matthew Ruffell linux-aws (Ubuntu Jammy): status New In Progress
2022-08-13 07:52:11 Matthew Ruffell linux-aws (Ubuntu Kinetic): status New In Progress
2022-08-13 07:52:14 Matthew Ruffell linux-aws (Ubuntu Jammy): importance Undecided Critical
2022-08-13 07:52:17 Matthew Ruffell linux-aws (Ubuntu Kinetic): importance Undecided Critical
2022-08-13 07:52:28 Matthew Ruffell linux-aws (Ubuntu Jammy): assignee gerald.yang (gerald-yang-tw)
2022-08-13 07:52:31 Matthew Ruffell linux-aws (Ubuntu Kinetic): assignee Matthew Ruffell (mruffell)
2022-08-13 07:53:40 Matthew Ruffell description Hibernation testing of jammy/linux-aws 5.15.0-1003-aws is failing on all xen instance types (c3/c4/i3/m3/m4/r3/r4/t2). The failure happens while attempting to resume from the first attempt to hibernate. Testing on nitro instances types (c5/m5/r5/t3) all pass. After the resume, the system is inaccessible via ssh. The console screenshot does change, but the console log obtained from `aws ec2 get-console-output` does not. [Impact] Hibernation currently fails for all AWS Xen instance types (c3/c4/i3/m3/m4/r3/r4/t2) with all Jammy 5.15 and Kinetic 5.19 linux-aws kernels. When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when processing the rootfs, fails to hibernate, and shuts down. When you start the instance, it starts fresh, and does not resume from the incomplete hibernation image. Networking is also broken, and you cannot ssh in. Upon review of the jammy/linux-aws git log, it appears that the kernel is missing AWS hibernation enablement patches entirely. These need to be included to get hibernation working. [Fix] Hibernation currently works on the Amazon Linux 2 5.15 Kernel: https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline After careful review of the amazon-5.15.y/mainline branch, we have found the below set of patches authored by Amazon AWS Hibernation team to be minimally sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. x86: Disable KASLR when Xen is detected xen: Restore xen-pirqs on resume from hibernation xen-netfront: call netif_device_attach on resume xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. xen: restore pirqs on resume from hibernation. block: xen-blkfront: consider new dom0 features on restore x86: tsc: avoid system instability in hibernation xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq Revert "xen: dont fiddle with event channel masking in suspend/resume" PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA x86/xen: close event channels for PIRQs in system core suspend callback xen/events: add xen_shutdown_pirqs helper function x86/xen: save and restore steal clock xen/time: introduce xen_{save,restore}_steal_clock xen-netfront: add callbacks for PM suspend and hibernation support xen-blkfront: add callbacks for PM suspend and hibernation x86/xen: add system core suspend and resume callbacks x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume xenbus: add freeze/thaw/restore callbacks support xen/manage: introduce helper function to know the on-going suspend mode xen/manage: keep track of the on-going suspend mode These patches will be carried as SAUCE patches, and their subjects marked with "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the repo being the Amazon Linux 2 kernel repo. [Testcase] 1. Log into Amazon EC2. 2. Select Launch Instance. 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. 5. Select your SSH keypair. 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. 8. Create the Instance. SSH in. 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. 10. Start a screen session. Echo some text and then detach with ctrl-d. 11. Log out from instance. 12. In EC2, select "Instance State" > "Hibernate". 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". 14. Start the instance again. 15. SSH in. 16. Attempt to resume screen session with "screen -r". If you are not able to ssh into the instance, hibernation had failed. If ssh works and the screen session is still running, hibernation was successful. Alternatively, the CPC team can run their Hibernation testsuite over Jammy and Kinetic. We have built test kernels for Jammy and Kinetic with the patches, and they are available in the below ppa: If you try and hibernate and resume with the test kernels, hibernation is successful. [Where problems could occur] We are adding a significant amount of code to the Xen subsystem, spread across many commits. This code has not been mainlined, and is instead maintained out of tree by the Amazon AWS Hibernation team. The changes target hibernation, block devices, and clock devices, specific to those used on AWS Xen instances. Most of these patches have been applied to Xenial, Bionic, Focal and other series for a long time, but some patches are new for 5.15 onward. The changes will only target linux-aws to try and limit regression risk to AWS users, and any regressions will be limited to users of Xen based instance types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. If a regression were to occur, the instance would likely fail to hibernate, and at worst, write an incomplete hibernation image to the swapfile. The kernel will see this on start, and instead of resuming from the hibernation image, will start fresh. It is unlikely to cause any filesystem corruption on the rootfs, but any in progress computations at the time of hibernation could be lost. The current broken behaviour breaks networking, and users would have to power cycle the instance a few times before they can ssh in again.
2022-08-13 07:53:50 Matthew Ruffell tags jammy kinetic sts
2022-08-15 04:28:31 Matthew Ruffell description [Impact] Hibernation currently fails for all AWS Xen instance types (c3/c4/i3/m3/m4/r3/r4/t2) with all Jammy 5.15 and Kinetic 5.19 linux-aws kernels. When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when processing the rootfs, fails to hibernate, and shuts down. When you start the instance, it starts fresh, and does not resume from the incomplete hibernation image. Networking is also broken, and you cannot ssh in. Upon review of the jammy/linux-aws git log, it appears that the kernel is missing AWS hibernation enablement patches entirely. These need to be included to get hibernation working. [Fix] Hibernation currently works on the Amazon Linux 2 5.15 Kernel: https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline After careful review of the amazon-5.15.y/mainline branch, we have found the below set of patches authored by Amazon AWS Hibernation team to be minimally sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. x86: Disable KASLR when Xen is detected xen: Restore xen-pirqs on resume from hibernation xen-netfront: call netif_device_attach on resume xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. xen: restore pirqs on resume from hibernation. block: xen-blkfront: consider new dom0 features on restore x86: tsc: avoid system instability in hibernation xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq Revert "xen: dont fiddle with event channel masking in suspend/resume" PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA x86/xen: close event channels for PIRQs in system core suspend callback xen/events: add xen_shutdown_pirqs helper function x86/xen: save and restore steal clock xen/time: introduce xen_{save,restore}_steal_clock xen-netfront: add callbacks for PM suspend and hibernation support xen-blkfront: add callbacks for PM suspend and hibernation x86/xen: add system core suspend and resume callbacks x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume xenbus: add freeze/thaw/restore callbacks support xen/manage: introduce helper function to know the on-going suspend mode xen/manage: keep track of the on-going suspend mode These patches will be carried as SAUCE patches, and their subjects marked with "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the repo being the Amazon Linux 2 kernel repo. [Testcase] 1. Log into Amazon EC2. 2. Select Launch Instance. 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. 5. Select your SSH keypair. 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. 8. Create the Instance. SSH in. 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. 10. Start a screen session. Echo some text and then detach with ctrl-d. 11. Log out from instance. 12. In EC2, select "Instance State" > "Hibernate". 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". 14. Start the instance again. 15. SSH in. 16. Attempt to resume screen session with "screen -r". If you are not able to ssh into the instance, hibernation had failed. If ssh works and the screen session is still running, hibernation was successful. Alternatively, the CPC team can run their Hibernation testsuite over Jammy and Kinetic. We have built test kernels for Jammy and Kinetic with the patches, and they are available in the below ppa: If you try and hibernate and resume with the test kernels, hibernation is successful. [Where problems could occur] We are adding a significant amount of code to the Xen subsystem, spread across many commits. This code has not been mainlined, and is instead maintained out of tree by the Amazon AWS Hibernation team. The changes target hibernation, block devices, and clock devices, specific to those used on AWS Xen instances. Most of these patches have been applied to Xenial, Bionic, Focal and other series for a long time, but some patches are new for 5.15 onward. The changes will only target linux-aws to try and limit regression risk to AWS users, and any regressions will be limited to users of Xen based instance types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. If a regression were to occur, the instance would likely fail to hibernate, and at worst, write an incomplete hibernation image to the swapfile. The kernel will see this on start, and instead of resuming from the hibernation image, will start fresh. It is unlikely to cause any filesystem corruption on the rootfs, but any in progress computations at the time of hibernation could be lost. The current broken behaviour breaks networking, and users would have to power cycle the instance a few times before they can ssh in again. [Impact] Hibernation currently fails for all AWS Xen instance types (c3/c4/i3/m3/m4/r3/r4/t2) with all Jammy 5.15 and Kinetic 5.19 linux-aws kernels. When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when processing the rootfs, fails to hibernate, and shuts down. When you start the instance, it starts fresh, and does not resume from the incomplete hibernation image. Networking is also broken, and you cannot ssh in. Upon review of the jammy/linux-aws git log, it appears that the kernel is missing AWS hibernation enablement patches entirely. These need to be included to get hibernation working. [Fix] Hibernation currently works on the Amazon Linux 2 5.15 Kernel: https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline After careful review of the amazon-5.15.y/mainline branch, we have found the below set of patches authored by Amazon AWS Hibernation team to be minimally sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. xen: Restore xen-pirqs on resume from hibernation xen-netfront: call netif_device_attach on resume xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. xen: restore pirqs on resume from hibernation. block: xen-blkfront: consider new dom0 features on restore x86: tsc: avoid system instability in hibernation xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq Revert "xen: dont fiddle with event channel masking in suspend/resume" PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA x86/xen: close event channels for PIRQs in system core suspend callback xen/events: add xen_shutdown_pirqs helper function x86/xen: save and restore steal clock xen/time: introduce xen_{save,restore}_steal_clock xen-netfront: add callbacks for PM suspend and hibernation support xen-blkfront: add callbacks for PM suspend and hibernation x86/xen: add system core suspend and resume callbacks x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume xenbus: add freeze/thaw/restore callbacks support xen/manage: introduce helper function to know the on-going suspend mode xen/manage: keep track of the on-going suspend mode These patches will be carried as SAUCE patches, and their subjects marked with "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the repo being the Amazon Linux 2 kernel repo. [Testcase] 1. Log into Amazon EC2. 2. Select Launch Instance. 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. 5. Select your SSH keypair. 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. 8. Create the Instance. SSH in. 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. 10. Start a screen session. Echo some text and then detach with ctrl-d. 11. Log out from instance. 12. In EC2, select "Instance State" > "Hibernate". 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". 14. Start the instance again. 15. SSH in. 16. Attempt to resume screen session with "screen -r". If you are not able to ssh into the instance, hibernation had failed. If ssh works and the screen session is still running, hibernation was successful. Alternatively, the CPC team can run their Hibernation testsuite over Jammy and Kinetic. We have built test kernels for Jammy and Kinetic with the patches, and they are available in the below ppa: https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test If you try and hibernate and resume with the test kernels, hibernation is successful. [Where problems could occur] We are adding a significant amount of code to the Xen subsystem, spread across many commits. This code has not been mainlined, and is instead maintained out of tree by the Amazon AWS Hibernation team. The changes target hibernation, block devices, and clock devices, specific to those used on AWS Xen instances. Most of these patches have been applied to Xenial, Bionic, Focal and other series for a long time, but some patches are new for 5.15 onward. The changes will only target linux-aws to try and limit regression risk to AWS users, and any regressions will be limited to users of Xen based instance types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. If a regression were to occur, the instance would likely fail to hibernate, and at worst, write an incomplete hibernation image to the swapfile. The kernel will see this on start, and instead of resuming from the hibernation image, will start fresh. It is unlikely to cause any filesystem corruption on the rootfs, but any in progress computations at the time of hibernation could be lost. The current broken behaviour breaks networking, and users would have to power cycle the instance a few times before they can ssh in again.
2022-08-17 18:25:53 Tim Gardner linux-aws (Ubuntu Jammy): status In Progress Fix Committed
2022-08-17 18:25:58 Tim Gardner linux-aws (Ubuntu Kinetic): status In Progress Fix Committed
2022-08-27 09:09:47 Matthew Ruffell tags jammy kinetic sts jammy kinetic sts verification-done-jammy
2022-08-31 08:38:44 Launchpad Janitor linux-aws (Ubuntu Jammy): status Fix Committed Fix Released
2022-08-31 08:38:44 Launchpad Janitor cve linked 2021-33061
2022-09-26 14:48:16 Launchpad Janitor linux-aws (Ubuntu Kinetic): status Fix Committed Fix Released