As of kernel 4.3-rc1 system will not stay in S3 suspend

Bug #1504584 reported by Doug Smythies
44
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Critical
linux (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Note: this bug report is just for local tracking of an upstream issue:
Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html

also copied below:

This started somewhere between Kernel 4.2 and 4.3-rc1,
but I only noticed it a day ago.

The first S3 suspend after a fresh boot works fine.
Thereafter, suspends simply resume again immediately.

I get the following errors on my console:

[ 152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail
[ 152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
[ 152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11
[ 152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11
[ 152.697306] PM: Some devices failed to suspend, or early wake event detected

The issue is not limited to my normal way of doing suspend, using "pm-suspend".
It also happens using the "echo mem > /sys/power/state" method.

The kernel was bisected, and the result was double checked by clean compiles
of the first bad commit and the immediately preceding commit. Bisect results
copied below:

$ git bisect good
dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit
commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2
Author: John Harrison <email address hidden>
Date: Fri May 29 17:43:39 2015 +0100

    drm/i915: Add explicit request management to i915_gem_init_hw()

    Now that a single per ring loop is being done for all the different
    intialisation steps in i915_gem_init_hw(), it is possible to add proper request
    management as well. The last remaining issue is that the context enable call
    eventually ends up within *_render_state_init() and this does its own private
    _i915_add_request() call.

    This patch adds explicit request creation and submission to the top level loop
    and removes the add_request() from deep within the sub-functions.

    v2: Updated for removal of batch_obj from add_request call in previous patch.

    For: VIZ-5115
    Signed-off-by: John Harrison <email address hidden>
    Reviewed-by: Tomas Elf <email address hidden>
    Signed-off-by: Daniel Vetter <email address hidden>

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1504584

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Doug Smythies (dsmythies) wrote :

You don't need the logs. I already figured out the root issue, and already submitted it upstream.
apport-collect has never worked for me. setting to "Triaged'

Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
In , Jairo-daniel-miramontes-caton (jairo-daniel-miramontes-caton) wrote :

> This started somewhere between Kernel 4.2 and 4.3-rc1,
> but I only noticed it a day ago.
>
> The first S3 suspend after a fresh boot works fine.
> Thereafter, suspends simply resume again immediately.
>
> I get the following errors on my console:
>
> [ 152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail
> [ 152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
> [ 152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11
> [ 152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11
> [ 152.697306] PM: Some devices failed to suspend, or early wake event detected
>
> The issue is not limited to my normal way of doing suspend, using "pm-suspend".
> It also happens using the "echo mem > /sys/power/state" method.
>
> The kernel was bisected, and the result was double checked by clean compiles
> of the first bad commit and the immediately preceding commit. Bisect results
> copied below:
>
> $ git bisect good
> dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit
> commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2
> Author: John Harrison <John.C.Harrison at Intel.com>
> Date: Fri May 29 17:43:39 2015 +0100
>
> drm/i915: Add explicit request management to i915_gem_init_hw()
>
> Now that a single per ring loop is being done for all the different
> intialisation steps in i915_gem_init_hw(), it is possible to add proper request
> management as well. The last remaining issue is that the context enable call
> eventually ends up within *_render_state_init() and this does its own private
> _i915_add_request() call.
>
> This patch adds explicit request creation and submission to the top level loop
> and removes the add_request() from deep within the sub-functions.
>
> v2: Updated for removal of batch_obj from add_request call in previous patch.
>
> For: VIZ-5115
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Reviewed-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
>
> :040000 040000 789c630ff3f5f07238a5df1bde79187c6c1251d0 2da3f7e20e2642d8eebd9f72528923c2ac53a8cb M drivers

Revision history for this message
In , Jairo-daniel-miramontes-caton (jairo-daniel-miramontes-caton) wrote :

This bug was created for tracking purposes, was reported to the intel gfx list, refeer to http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

As discussed please follow up on the m-l with a link to each regression tracking bug you create so that the links go both ways. Thanks.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

Please use links that contain the Message-ID so that it's easier to find the messages in email. Please reference the original report.

Like this: http://mid.gmane.org/002301d1025d$d5765090$8062f1b0$@net

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

John, ideas?

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

Additional information:

After the first resume from suspend, the processor is in a bizarre state, where it will not go below 2.4 GHz, even though every CPU is asking for a pstate of 16 (the minimum for my processor). This has been tested several times, on both the preceding (good) and first bad kernels using both methods of suspend.

My processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

Example (no load):

pstate being asked for:

# rdmsr --bitfield 15:8 -d -a 0x199
16
16
16
16
16
16
16
16

pstate that I am getting:

# rdmsr --bitfield 15:8 -d -a 0x198
24
24
24
24
24
24
24
24

CPU freqs:

# grep MHz /proc/cpuinfo
cpu MHz : 2400.054
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.789
cpu MHz : 2399.789
cpu MHz : 2399.921
cpu MHz : 2399.921

Changed in linux:
importance: Unknown → High
status: Unknown → Confirmed
Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

Did you double check the bisect by running both dc4be6071a24 and dc4be6071a24^ ?

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

(In reply to Jani Nikula from comment #6)
> Did you double check the bisect by running both dc4be6071a24 and
> dc4be6071a24^ ?

Yes, of course, and I said so in my initial e-mail.
Truth be known, this was my second bisection, as I must have made a mistake in my first attempt, because the double check failed.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

(In reply to Doug Smythies from comment #7)
> (In reply to Jani Nikula from comment #6)
> > Did you double check the bisect by running both dc4be6071a24 and
> > dc4be6071a24^ ?
>
> Yes, of course, and I said so in my initial e-mail.
> Truth be known, this was my second bisection, as I must have made a mistake
> in my first attempt, because the double check failed.

I asked, because I suspected the bisect result might be wrong. And the symptoms in comment #5 seem odd.

Please try two things: First, run dc4be6071a24 and try suspend/resume several times, and see if it's 100% reproducible or not. Second, attach dmesg with drm.debug=14 module parameter set (for the failing case).

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

Created attachment 118873
requested dmesg with drm.debug=14

Possibly relevant excerpt:

[ 399.518389] [drm] stuck on render ring
[ 399.518686] [drm] GPU HANG: ecode 6:0:0xfeffffff, reason: Ring hung, action: reset
[ 399.518686] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 399.518686] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 399.518687] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 399.518687] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 399.518687] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 399.518699] [drm:i915_reset_and_wakeup] resetting chip
[ 399.518724] i915 0000:00:02.0: GEM idle failed, resume might fail
[ 399.518737] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
[ 399.518739] dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -11
[ 399.518741] PM: Device 0000:00:02.0 failed to suspend async: error -11
[ 399.518804] PM: Some devices failed to suspend, or early wake event detected

an edited /sys/class/drm/card0/error will be attached in a moment.

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

Created attachment 118874
an edited version of the file the previous attachment asked for

I edited just to remove many lines of 0's.

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

(In reply to Jani Nikula from comment #8)
> First, run dc4be6071a24 and try suspend/resume
> several times, and see if it's 100% reproducible or not.

Yes, it happens every time. To say 100%, I would have to have a sample space of about 1000 attempts. I did not do that many.

Changed in linux:
status: Confirmed → Incomplete
Revision history for this message
In , Doug Smythies (dsmythies) wrote :

Additional information:

After a fresh boot with the bad kernel, turbostat shows: Pkg%pc6 = 97.84%; PkgWatt 4.01; CorWatt 0.28; GFXWatt 0.23.

Then after the first suspend resume, turbostat shows: Pkg%pc6 = 0.00%; PkgWatt 10.08; CorWatt 3.04; GFXWatt 3.51.

PkgTmp goes up by more than 10 degrees.

The system is idle is both cases: 0.03% busy.

penalvch (penalvch)
tags: added: bisect-done kernel-bug-exists-upstream
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux:
status: Incomplete → Confirmed
Revision history for this message
In , Crux-f (crux-f) wrote :

I'm having the same problem with 4.3, skipped few versions before that. However, I seem to have found an earlier bug that could be related, at least it sounds similar: https://bugs.freedesktop.org/show_bug.cgi?id=90253

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

This issue persists though kernel 4.4-rc8.

Revision history for this message
In , Tigran Gabrielyan (tigrangab) wrote :

Bug is still valid in 4.4 release.

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

Yes, the bug persists through kernel 4.4.

Having isolated the issue down to the exact causal commit, I do not know what else I can do to move this one along.

Revision history for this message
In , Tigran Gabrielyan (tigrangab) wrote :

Just an update: bug still exists in 4.5 rc5.

Revision history for this message
Christian Dysthe (christian-dysthe) wrote :

I'm seeing this bug on my Thinkpad X1 Carbon with 16.04 on top of kernel 4.4.0-8-generic #23-Ubuntu SMP

Revision history for this message
Karma Dorje (taaroa) wrote :
Revision history for this message
Karma Dorje (taaroa) wrote :
Download full text (6.2 KiB)

$ uname -a
Linux d0rje 4.4.0-12-generic #28-Ubuntu SMP Wed Mar 9 00:33:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

9 16:19:58 d0rje kernel: [46297.093984] ------------[ cut here ]------------
Mar 9 16:19:58 d0rje kernel: [46297.094018] WARNING: CPU: 0 PID: 0 at /build/linux-H9bHY_/linux-4.4.0/drivers/gpu/drm/i915/intel_display.c:11386 intel_check_page_flip+0xfb/0x110 [i915]()
Mar 9 16:19:58 d0rje kernel: [46297.094020] Kicking stuck page flip: queued at 2785183, now 2785187
Mar 9 16:19:58 d0rje kernel: [46297.094021] Modules linked in: bnep drbg ansi_cprng ctr ccm bluetooth uvcvideo arc4 videobuf2_vmalloc iwldvm intel_rapl videobuf2_memops snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic videobuf2_v4l2 videobuf2_core x86_pkg_temp_thermal v4l2_common intel_powerclamp snd_hda_intel snd_hda_codec videodev mac80211 snd_hda_core media coretemp snd_hwdep kvm_intel kvm snd_pcm snd_seq_midi irqbypass crct10dif_pclmul crc32_pclmul snd_seq_midi_event snd_rawmidi iwlwifi snd_seq snd_seq_device snd_timer asus_nb_wmi asus_wmi mxm_wmi sparse_keymap cfg80211 aesni_intel snd aes_x86_64 lrw soundcore mei_me gf128mul mei glue_helper shpchp ablk_helper lpc_ich cryptd joydev input_leds mac_hid wmi serio_raw ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack bbswitch(OE) ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack parport_pc ppdev iptable_filter ip_tables x_tables lp parport autofs4 hid_generic usbhid hid i915 psmouse i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci atl1c video fjes
Mar 9 16:19:58 d0rje kernel: [46297.094075] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 4.4.0-11-generic #26-Ubuntu
Mar 9 16:19:58 d0rje kernel: [46297.094076] Hardware name: ASUSTeK Computer Inc. K53SD/K53SD, BIOS K53SD.205 03/06/2012
Mar 9 16:19:58 d0rje kernel: [46297.094077] 0000000000000086 edf7387dbd188491 ffff880246c03d60 ffffffff813e1ec3
Mar 9 16:19:58 d0rje kernel: [46297.094079] ffff880246c03da8 ffffffffc029da50 ffff880246c03d98 ffffffff8107fe12
Mar 9 16:19:58 d0rje kernel: [46297.094081] ffff88024564a000 ffff880240fb5000 ffff88024564a1a8 0000000000000000
Mar 9 16:19:58 d0rje kernel: [46297.094083] Call Trace:
Mar 9 16:19:58 d0rje kernel: [46297.094084] <IRQ> [<ffffffff813e1ec3>] dump_stack+0x63/0x90
Mar 9 16:19:58 d0rje kernel: [46297.094091] [<ffffffff8107fe12>] warn_slowpath_common+0x82/0xc0
Mar 9 16:19:58 d0rje kernel: [46297.094093] [<ffffffff8107feac>] warn_slowpath_fmt+0x5c/0x80
Mar 9 16:19:58 d0rje kernel: [46297.094110] [<ffffffffc0227d5e>] ? __intel_pageflip_stall_check+0xfe/0x110 [i915]
Mar 9 16:19:58 d0rje kernel: [46297.094126] [<ffffffffc02408fb>] intel_check_page_flip+0xfb/0x110 [i915]
Mar 9 16:19:58 d0rje kernel: [46297.094138] [<ffffffffc01c634b>] ironlake_irq_handler+0x3db/0xc30 [i915]
Mar 9 16:19:58 d0rje kernel: [46297.094141] [<ffffffff813ead49>] ? timerqueue_add+0x59/0xb0
Mar 9 16:19:58 d0rje kernel: [46297.094143]...

Read more...

Revision history for this message
In , Tigran Gabrielyan (tigrangab) wrote :

Bug still exists in 4.6 rc1. Is there anything I can provide to help with this issue?

Revision history for this message
In , Doug Smythies (dsmythies) wrote :

This issue no longer occurs on my computer.

The a fresh install on linux was done, and now the system is using systemd whereas previously it was not. I am not certain systemd is the difference, it is just my best guess.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

(In reply to tigrangab from comment #18)
> Bug still exists in 4.6 rc1. Is there anything I can provide to help with
> this issue?

Did you bisect this to the same commit reported in comment #0?

Changed in linux:
status: Confirmed → Incomplete
Changed in linux:
importance: High → Critical
Changed in linux:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.