[SRU] Hibernation events sometimes missed on repeated attempts

Bug #1864045 reported by Francis Ginther
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
acpid (Ubuntu)
Confirmed
Undecided
Unassigned
Bionic
Incomplete
Undecided
Unassigned
Eoan
Won't Fix
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Bionic
Incomplete
Undecided
Unassigned
Eoan
Won't Fix
Undecided
Unassigned

Bug Description

When testing hibernation / resume on AWS with 5.0 or 5.3 kernels on bionic (using acpid 1:2.0.28-1ubuntu1), we sometimes see failure with repeated attempts. The first attempt will always be triggered, but the next attempt may not. The result is the agent never triggers the hibernation process and the instance will be forced to shutdown after a timeout period.

Two workarounds have been identified. The first is to restart acpid during the resume handler. The second is to use the latest upstream acpid (as of Feb 1, 2020). This second workaround indicates there may be a patch missing in the acpid in bionic (1:2.0.28-1ubuntu1) to work with the 5.0+ kernels.

To reproduce this problem:

1) Launch an c4, c5, m4, m5, r4, r5 instance type with a 5.0 or 5.3 kernel on a bionic image with on-demand hibernation support enabled.
2) Hibernate and resume the instance, ensuring the system is fully resumed afterward and the swap file has been removed.
3) Hibernate and resume another time. The hibernate should be triggered immediately and the instance should become unresponsive as it saves state to disk.
4) Resume the instance, it should come back with the same processes running.
5) Repeat 3) - 4) as necessary.
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-0edf3b95e26a682df
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-west-2a
Ec2InstanceType: m4.large
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
Package: acpid 1:2.0.28-1ubuntu1
PackageArchitecture: amd64
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: User Name 5.0.0-1025.28-aws 5.0.21
Tags: bionic ec2-images
Uname: Linux 5.0.0-1025-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy lxd netdev plugdev sudo video
_MarkForUpload: True

Revision history for this message
Francis Ginther (fginther) wrote : Dependencies.txt

apport information

tags: added: apport-collected bionic ec2-images
description: updated
Revision history for this message
Francis Ginther (fginther) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Balint Reczey (rbalint) wrote : Re: Hibernation events sometimes missed on repeated attempts

@fginther Are newer releases also affected?

By latest acpid do you mean the package in Focal or upstream's latest commit/release?

Changed in acpid (Ubuntu):
status: New → Incomplete
Changed in acpid (Ubuntu Bionic):
status: New → In Progress
Changed in acpid (Ubuntu Eoan):
status: New → Incomplete
Revision history for this message
Francis Ginther (fginther) wrote :

@rbalint,

In our testing, we used the lastest acpid from debian git, https://salsa.debian.org/debian/acpid:

last commit
commit 2c2edfc267a69258cff406bd663b305fbee35187 (HEAD -> master, tag: debian/2.0.32-1, origin/master, origin/HEAD)
Author: Josue Ortega <email address hidden>
Date: Sun Aug 18 18:42:41 2019 -0600

    Set release to unstable 2.0.32-1

tag: debian/2.0.32-1

This corresponds to this commit: https://salsa.debian.org/debian/acpid/-/commit/2c2edfc267a69258cff406bd663b305fbee35187

Balint Reczey (rbalint)
Changed in acpid (Ubuntu Bionic):
assignee: nobody → Balint Reczey (rbalint)
Revision history for this message
Balint Reczey (rbalint) wrote :

Marked Focal as fixed since it has the acpid version tested to be OK, working on finding the changes needed to be SRUd.

summary: - Hibernation events sometimes missed on repeated attempts
+ [SRU] Hibernation events sometimes missed on repeated attempts
Changed in acpid (Ubuntu):
status: Incomplete → Fix Released
Balint Reczey (rbalint)
Changed in acpid (Ubuntu):
status: Fix Released → Confirmed
Changed in acpid (Ubuntu Eoan):
status: Incomplete → Confirmed
Changed in acpid (Ubuntu Bionic):
status: In Progress → Incomplete
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1864045

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Changed in linux (Ubuntu Eoan):
status: New → Incomplete
Revision history for this message
Balint Reczey (rbalint) wrote :

I have tried the 'fixed' acpid version and also tried various versions with 5.0 and 5.3 kernels, but the issue does not seem to be fixed.

I've backported acpid 1:2.0.32-1ubuntu1 in ppa:rbalint/scratch2 which is practically the same as 2.0.32-1 compiled on Ubuntu, but the second hibernation attempt still fails.

Package versions used:
linux-aws-wip/5.3.0.1012.13
acpid/1:2.0.32-1ubuntu1~18.04.0~rbalint3 from ppa:rbalint/scratch2
ec2-hibinit-agent/1.0.0-0ubuntu8~18.04.0~rbalint3 from ppa:rbalint/scratch2

Instance type: c4.large

Vanilla Bionic, i.e. kernel 4.15.0-1060-aws and Bionic's acpid hibernates twice without any issue.

Could you please add detailed reproduction steps that show how the new acpid was used to fix the issue?

For now it looks like a kernel regression and bisecting the Linux commit changing the behaviour would be highly useful.

Revision history for this message
Andrea Righi (arighi) wrote :

@rbalint if you can reproduce the problem easily, it would be interesting to monitor the received ACPI events via acpi_listen.

What I see during my tests is that acpi_listen is always showing the sleep events, meaning that the kernel receives them correctly at least, and then the failure happens in the delivery of these sleep events to the proper user-space daemon (acpid). So my guess is that something wrong is happening in the communication between kernel and user-space to deliver these events.

Just to make sure, when you say "the second hibernation attempt still fails" you mean that the system is still up & running (you can still ssh on it) and the sleep event is lost / not delivered properly, right?

Revision history for this message
Balint Reczey (rbalint) wrote : Re: [Bug 1864045] Re: [SRU] Hibernation events sometimes missed on repeated attempts

> @rbalint if you can reproduce the problem easily, it would be
> interesting to monitor the received ACPI events via acpi_listen.

I believe the reproduction is very easy for everyone.

I could not reproduce newer acpid fixing the problem OTOH. I'd like to
have the reproduction steps for that experiment.

> What I see during my tests is that acpi_listen is always showing the
> sleep events, meaning that the kernel receives them correctly at least,
> and then the failure happens in the delivery of these sleep events to
> the proper user-space daemon (acpid). So my guess is that something
> wrong is happening in the communication between kernel and user-space to
> deliver these events.

Since only the kernel changed it may be a regression in the kernel or
a change in kernel's behaviour that is still valid but acpid somehow
breaks.
Please either provide the reproduction steps where a different acpid
fixes the issue or point at the change in the kernel to which acpid
should adapt.
I believe this can be found by bisecting the kernel, but I don't have
the setup to do it efficiently myself.

> Just to make sure, when you say "the second hibernation attempt still
> fails" you mean that the system is still up & running (you can still ssh
> on it) and the sleep event is lost / not delivered properly, right?

Yes, exactly. I can still log in back to the system after a few seconds.

Revision history for this message
Andrea Righi (arighi) wrote :

@rbalint unfortunately bisecting the kernel is not a trivial task... there are many changes between the stock 4.15 and the 5.0 kernels and the process is probably going to take a long time. I'll check if it's possible to identify only a subset of potential commits that might have caused this problem.

The steps that I've used to verify that the problem was fixed (or at least it thought it was fixed) were pretty easy: I got acpid from https://salsa.debian.org/debian/acpid.git (version 2.0.32-1), recompiled it, moved it to /usr/sbin/acpid (replacing the stock 2.0.28) and then tested multiple hibernate/resume cycles via the AWS APIs.

With this "custom" acpid I wasn't able to trigger any failure. It's also worth mentioning that I was using 5.0.0-1019-aws. I'll repeat my tests again with the latest bionic aws 5.0 kernel and will check if I can reproduce the failures.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in acpid (Ubuntu Eoan):
status: Confirmed → Won't Fix
Changed in linux (Ubuntu Eoan):
status: Incomplete → Won't Fix
Balint Reczey (rbalint)
Changed in acpid (Ubuntu Bionic):
assignee: Balint Reczey (rbalint) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.