e1000e on Alder Lake does not resume successfully

Bug #1983404 reported by Jason Gunthorpe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

This is a Dell Precision 3460 workstation (Alder Lake CPU) with latest BIOS and Intel AMT turned off in the BIOS.

When enabling the Power -> 'Automatic Suspend' setting the e1000e wired ethernet is always broken on wake up. rmmoding and modprobing the e1000e driver will eventually heal it.

On resume the kernel logs:

Aug 01 14:41:01 wakko kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
Aug 01 14:41:01 wakko kernel: ata8: SATA link down (SStatus 4 SControl 300)
Aug 01 14:41:01 wakko kernel: ata5: SATA link down (SStatus 4 SControl 300)
Aug 01 14:41:01 wakko kernel: ata7: SATA link down (SStatus 4 SControl 300)
Aug 01 14:41:01 wakko kernel: ata6: SATA link down (SStatus 4 SControl 300)
Aug 01 14:41:05 wakko kernel: e1000e 0000:00:1f.6 enp0s31f6: Error reading PHY register
Aug 01 14:41:05 wakko kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Aug 01 14:41:05 wakko kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp0s31f6: link becomes ready
Aug 01 14:41:05 wakko kernel: kauditd_printk_skb: 15 callbacks suppressed
Aug 01 14:41:08 wakko kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                TDH <1>
                                TDT <5>
                                next_to_use <5>
                                next_to_clean <1>
                              buffer_info[next_to_clean]:
                                time_stamp <10017d3a0>
                                next_to_watch <1>
                                jiffies <10017d680>
                                next_to_watch.status <0>
                              MAC Status <40080283>
                              PHY Status <796d>
                              PHY 1000BASE-T Status <3800>
                              PHY Extended Status <3000>
                              PCI Status <10>

And then it just loops repeatedly resetting and failing the adaptor.

I see some upstream chatter on this but the most likely fix from commit 1866aa0d0d64 ("e1000e: Fix possible HW unit hang after an s0ix exit") is already in this ubuntu kernel.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-5.15.0-43-generic 5.15.0-43.46
ProcVersionSignature: Ubuntu 5.15.0-43.46-generic 5.15.39
Uname: Linux 5.15.0-43-generic x86_64
ApportVersion: 2.20.11-0ubuntu82.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: jgg 2250 F.... pulseaudio
 /dev/snd/pcmC0D3p: jgg 2250 F...m pulseaudio
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Tue Aug 2 14:37:16 2022
InstallationDate: Installed on 2022-06-25 (38 days ago)
InstallationMedia: Ubuntu 22.04 LTS "Jammy Jellyfish" - Release amd64 (20220419)
MachineType: Dell Inc. Precision 3460
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-43-generic root=UUID=4734cc66-fe16-4d2f-ba78-727597040f9a ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-43-generic N/A
 linux-backports-modules-5.15.0-43-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.3
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/21/2022
dmi.bios.release: 1.3
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.3.72
dmi.board.name: 08WXMX
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 3
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.3.72:bd06/21/2022:br1.3:svnDellInc.:pnPrecision3460:pvr:rvnDellInc.:rn08WXMX:rvrA01:cvnDellInc.:ct3:cvr:sku0AC7:
dmi.product.family: OptiPlex
dmi.product.name: Precision 3460
dmi.product.sku: 0AC7
dmi.sys.vendor: Dell Inc.

Revision history for this message
Jason Gunthorpe (jgunthorpe) wrote :
Revision history for this message
Jason Gunthorpe (jgunthorpe) wrote :

This may be the missing patch. Willing the try a build with it at least.

commit b49feacbeffc7635cc6692cbcc6a1eae2c17da6f
Author: Sasha Neftin <email address hidden>
Date: Sun May 8 10:09:05 2022 +0300

    e1000e: Enable GPT clock before sending message to CSME

    On corporate (CSME) ADL systems, the Ethernet Controller may stop working
    ("HW unit hang") after exiting from the s0ix state. The reason is that
    CSME misses the message sent by the host. Enabling the dynamic GPT clock
    solves this problem. This clock is cleared upon HW initialization.

    Fixes: 3e55d231716e ("e1000e: Add handshake with the CSME to support S0ix")
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214821
    Reviewed-by: Dima Ruinskiy <email address hidden>
    Signed-off-by: Sasha Neftin <email address hidden>
    Tested-by: Chia-Lin Kao (AceLan) <email address hidden>
    Tested-by: Naama Meir <email address hidden>
    Signed-off-by: Tony Nguyen <email address hidden>

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.