Comment 100 for bug 971061

juanmanuel (rockerito99) wrote :

I made further tests, and can now RULE OUT completely Intel Rapid Start from being a factor. So please IGNORE my previous post.

But,
I found the cause of the problem!!!!!, (but not the solution... yet):

I found that the problem manifests when doing many things (16 or more) that would produce an ACPI event, while the laptop is suspended. If so, the motherboard's Embedded Controller stops sending events.

If, while in sleep, you unplug or plug the laptop 8 times, then the problem manifests.
Each time that you plug or unplug the laptop's PSU, a pair of ADP1 (AC power event) BAT1 (Battery event) events are produced (signals _Q51 or _Q52 (AC) and _Q53 and _Q54 (battery) of the DSDT are executed, which notify the devices ADP1, BAT1 and the CPUs). So that is two GPE events per plug or unplug, for a total of 16 if you plug/unplug 8 times.

Also, if the laptop is unplugged while suspended, an event is produced each time that the battery percentage changes (_Q53 and _Q54). So, if you leave it unplugged and suspended for 8 hours or more (assuming 2% battery drop per hour), the problem then manifests. If it was plugged, and it charges 16% or more of battery, the problem manifests.

LID close/open events count too.

The only way that you can ensure that the problem will not manifest, is by having the battery at 100% before suspending, and leaving it plugged in, and without opening/closing the lid too many times. In my tests, no matter how many days suspended, this never produced "the lid-not-detected-battery-not-detected-syndrome", which can only be gotten out of by turning the computer off, unplugging it, and hitting the reset button through the small hole in the back.

This is clearly a problem with the embedded controller trying to report AC/Battery/LID events while the computer is asleep, and having some kind of internal buffer get full (16 events).

REPRODUCING THE PROBLEM QUICKLY:
The easier way to reproduce the problem, is:
     1) Suspend the computer
     2) Unplug PSU, plug, unplug, plug, unplug, plug, unplug, plug (8 actions each produces 2 GPE events).
     3) Resume. Then, you can see that plugging or unplugging the PSU isn't detected, and LID close is ignored.

I ruled out the following things, which DON'T produce an effect:

     1) Battery Life Extension enabled or disabled. Crossing 80% battery charge while sleep. No effect.
     2) Intel Rapid Start. Enabled or disabled, partition or no partition. intel_irst module and wakeup_time and wakeup_events parameters don't have an effect.
     3) How many suspends without restarting doesn't have an effect.
     4) Wifi on or off. Downloading or not.
     5) FAN spinning with CPU temp above 70, no effect. (doesn't help nor make it worse).
     6) acpi_osi="!Windows 2012" acpi_osi="Windows 2009", etc. No effect.
     7) Changing /sys/module/acpi/parameters/ec_delay or ec_storm_threshold no effect (though interesting to research further, only tried 20ms and 2048 respectively; the defaults are 500ms and 8 respectively).

I still haven't found a workaround or solution. I studied the DSDT of my NP530U3C, looking for a way to disable or mask the EmbeddedController events (AC, battery, LID). I overwrote a method through /sys/kernel/debug/acpi/custom (the \_SB.LID0._LID method) to try different things, like toggling the ECON variable (Embedded Controller ON) but no effect.

I also compared the embedded controller ram region while the laptop was in the problem state, and while it was problem-free, by inserting the "ec_sys" module and looking at the output of
              hexdump -Cv /sys/kernel/debug/ec/ec0/io
before and after. I could see there that the lid and AC and battery charge are the correct values, its just that no GPE event is generated when they change.

I currently have two hypothesis:
       1) It might be that the events should get masked out before going to sleep, so that the embedded controller ignores them during sleep and this is not happening for some reason.
       2) It might be that if the embedded controller has too many events queued, then when resuming some kind of race condition occurs in a way that a lock or a queue gets messed up and only a reset-through-small-hole-in-the-back can fix. Maybe the Linux kernel should wait a bit after resuming before serving those events?

TL;DR: The embedded controller gets messed up after 16 GPE events (AC, or battery % drop, or LID) during sleep, and when it comes back doesn't report events anymore. The DSDT signals _QXX are not received anymore.

The problem never manifests if I suspend with:
     1) 100% charge
     2) PSU Plugged in and left that way.
     3) Don't close/open the lid
no matter how many hours or days, then the problem will never manifest, since % of battery doesn't change and no lid events are produced during sleep (note that a lid event takes about 10 seconds to be produced after closing/opening the lids, by design).

If I unplug it, and battery % changes less than 16% during sleep, then the 'problem state' is not produced in the laptop.