linux-aws-5.19 hibernation tasks sometimes fail to freeze

Bug #2006620 reported by Francis Ginther
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hibernation on AWS instances with jammy/5.19.0-1019-aws sometimes fails due to the following failure to freeze:

Feb 1 01:09:05 ip-172-31-54-178 kernel: [ 443.247854] PM: hibernation: hibernation entry
Feb 1 01:09:05 ip-172-31-54-178 kernel: [ 443.347353] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Feb 1 01:09:05 ip-172-31-54-178 kernel: [ 443.347355] sched_clock: Marking unstable (442909362062, 1007864825)<-(443748056670, -400707172)
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.940489] Filesystems sync: 0.022 seconds
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.940492] Freezing user space processes ... (elapsed 0.001 seconds) done.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.941611] OOM killer disabled.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.943036] PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.943039] PM: hibernation: Marking nosave pages: [mem 0x0009f000-0x000fffff]
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.943041] PM: hibernation: Marking nosave pages: [mem 0xbffe8000-0xffffffff]
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.943950] PM: hibernation: Basic memory bitmaps created
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 443.943961] PM: hibernation: Preallocating image memory
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 630.782421] PM: hibernation: Allocated 9655951 pages for snapshot
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 630.782424] PM: hibernation: Allocated 38623804 kbytes in 186.83 seconds (206.73 MB/s)
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 630.782426] Freezing remaining freezable tasks ...
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.789826] Freezing of tasks failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0):
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792830] task:kswapd0 state:D stack: 0 pid: 328 ppid: 2 flags:0x00004000
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792833] Call Trace:
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792835] <TASK>
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792837] __schedule+0x248/0x5d0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792842] schedule+0x58/0x100
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792844] io_schedule+0x46/0x80
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792846] blk_mq_get_tag+0x117/0x2e0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792852] ? destroy_sched_domains_rcu+0x40/0x40
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792857] __blk_mq_alloc_requests+0xc4/0x1e0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792859] blk_mq_get_new_requests+0xce/0x190
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792861] blk_mq_submit_bio+0x1e6/0x430
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792864] __submit_bio+0xf6/0x190
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792866] submit_bio_noacct_nocheck+0xc2/0x120
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792869] submit_bio_noacct+0x1c5/0x540
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792871] ? sio_write_complete+0x1f0/0x1f0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792875] submit_bio+0x47/0xf0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792877] __swap_writepage+0x157/0x570
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792879] swap_writepage+0x2f/0x80
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792880] pageout+0xe2/0x2f0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792883] shrink_page_list+0x60b/0xc80
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792885] shrink_inactive_list+0x1bc/0x4d0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792886] shrink_lruvec+0x2f5/0x450
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792888] shrink_node_memcgs+0x166/0x1d0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792890] shrink_node+0x156/0x5a0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792891] ? __schedule+0x250/0x5d0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792893] balance_pgdat+0x37b/0x880
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792894] ? zone_watermark_ok_safe+0x4f/0x100
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792899] ? balance_pgdat+0x880/0x880
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792900] kswapd+0x10c/0x1c0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792901] ? balance_pgdat+0x880/0x880
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792903] kthread+0xd1/0xf0
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792906] ? kthread_complete_and_exit+0x20/0x20
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792909] ret_from_fork+0x22/0x30
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792913] </TASK>
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792921]
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 650.792922] Restarting kernel threads ... done.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.516499] PM: hibernation: Basic memory bitmaps freed
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.516502] OOM killer enabled.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.516502] Restarting tasks ... done.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.516740] sched_clock: Marking stable (650508475881, 1007864825)->(651346297015, 170043691)
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.626777] PM: hibernation: hibernation exit
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.670368] systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.672610] systemd[1]: snapd.service: Killing process 986 (snapd) with signal SIGABRT.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.719887] systemd[1]: snapd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.719895] systemd[1]: snapd.service: Failed with result 'watchdog'.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.720923] systemd[1]: snapd.service: Consumed 1.714s CPU time.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.796678] systemd[1]: Starting /usr/lib/ec2-hibinit-agent/hibinit-resume...
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.797487] systemd[1]: systemd-hibernate.service: Main process exited, code=exited, status=1/FAILURE
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.797650] systemd[1]: systemd-hibernate.service: Failed with result 'exit-code'.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.798075] systemd[1]: Failed to start Hibernate.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.800047] systemd[1]: Dependency failed for System Hibernation.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.800082] systemd[1]: hibernate.target: Job hibernate.target/start failed with result 'dependency'.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.800130] systemd[1]: systemd-hibernate.service: Consumed 36.142s CPU time.
Feb 1 01:12:33 ip-172-31-54-178 kernel: [ 651.806905] systemd[1]: Stopped target Sleep.

Hibernation testing was performed across 93 instance types with 10 runs each. Each run consists of two hibernation and resume cycles while running a memory allocator. This issue was seen in about 25 of those 930 runs. It was observed on c5.12xlarge, c5d.12xlarge, m5a.large, m5a.xlarge, m5a.2xlarge, m5ad.xlarge, m5ad.2xlarge, r5a.xlarge, r5a.2xlarge, t3a.medium, t3a.xlarge, and t3a.2xlarge.

Revision history for this message
Francis Ginther (fginther) wrote :

Here is the full syslog from which the portion in the bug description was extracted from.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.