systemd-networkd: Lost carrier e1000

Bug #1832101 reported by Poloskey
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Just installed a fresh installation of ubuntu 19 on our server and we're having a lot of carrier-lost erros.

There is no really good indicaton why this is hapening.

Tried the following : disabled ipv6 (just to be sure), on nic set GSO,GRO,TSO to off.
removed netplan, switched to systemd-networkd and added configurewithoutcarrier to true, the option ignorecarrierlost is not accepted by systemd.

Description: Ubuntu 19.04
Release: 19.04

Kernel Log

Jun 8 12:11:17 mc2 kernel: [30304.199050] ------------[ cut here ]------------
Jun 8 12:11:17 mc2 kernel: [30304.199052] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Jun 8 12:11:17 mc2 kernel: [30304.199062] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x221/0x230
Jun 8 12:11:17 mc2 kernel: [30304.199062] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_rapl_perf intel_wmi_thunderbolt wmi_bmof intel_pch_thermal mac_hid acpi_pad ip6t_REJE
CT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_limit xt_tcpudp xt_addrtype sch_fq_codel xt_conntrack ip6
table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip_
tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 e1000e
 nvme ahci i2c_i801 libahci nvme_core wmi video pinctrl_cannonlake pinctrl_intel
Jun 8 12:11:17 mc2 kernel: [30304.199080] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 5.0.0-16-generic #17-Ubuntu
Jun 8 12:11:17 mc2 kernel: [30304.199081] Hardware name: Gigabyte Technology Co., Ltd. B360 HD3P-LM/B360HD3PLM-CF, BIOS F4 HZ 04/30/2019
Jun 8 12:11:17 mc2 kernel: [30304.199082] RIP: 0010:dev_watchdog+0x221/0x230
Jun 8 12:11:17 mc2 kernel: [30304.199083] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 9a 92 f0 00 01 e8 13 38 fc ff 89 d9 4c 89 ee 48 c7 c7 68 5e da 95 48 89 c2 e8 71 f1 79 ff
 <0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Jun 8 12:11:17 mc2 kernel: [30304.199083] RSP: 0018:ffff93adff203e68 EFLAGS: 00010286
Jun 8 12:11:17 mc2 kernel: [30304.199084] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Jun 8 12:11:17 mc2 kernel: [30304.199084] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff93adff216440
Jun 8 12:11:17 mc2 kernel: [30304.199084] RBP: ffff93adff203e98 R08: 0000000000000001 R09: 0000000000000c61
Jun 8 12:11:17 mc2 kernel: [30304.199085] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
Jun 8 12:11:17 mc2 kernel: [30304.199085] R13: ffff93adec480000 R14: ffff93adec4804c0 R15: ffff93adeca74080
Jun 8 12:11:17 mc2 kernel: [30304.199086] FS: 0000000000000000(0000) GS:ffff93adff200000(0000) knlGS:0000000000000000
Jun 8 12:11:17 mc2 kernel: [30304.199086] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 8 12:11:17 mc2 kernel: [30304.199086] CR2: 00007f7783ea4300 CR3: 000000043920e003 CR4: 00000000003606e0
Jun 8 12:11:17 mc2 kernel: [30304.199087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 8 12:11:17 mc2 kernel: [30304.199087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 8 12:11:17 mc2 kernel: [30304.199087] Call Trace:
Jun 8 12:11:17 mc2 kernel: [30304.199088] <IRQ>
Jun 8 12:11:17 mc2 kernel: [30304.199090] ? pfifo_fast_enqueue+0x120/0x120
Jun 8 12:11:17 mc2 kernel: [30304.199091] call_timer_fn+0x30/0x130
Jun 8 12:11:17 mc2 kernel: [30304.199092] run_timer_softirq+0x3e4/0x420
Jun 8 12:11:17 mc2 kernel: [30304.199093] ? ktime_get+0x3c/0xa0
Jun 8 12:11:17 mc2 kernel: [30304.199095] ? lapic_next_deadline+0x26/0x30
Jun 8 12:11:17 mc2 kernel: [30304.199096] ? clockevents_program_event+0x93/0xf0
Jun 8 12:11:17 mc2 kernel: [30304.199097] __do_softirq+0xdc/0x2f3
Jun 8 12:11:17 mc2 kernel: [30304.199099] irq_exit+0xc0/0xd0
Jun 8 12:11:17 mc2 kernel: [30304.199099] smp_apic_timer_interrupt+0x79/0x140
Jun 8 12:11:17 mc2 kernel: [30304.199100] apic_timer_interrupt+0xf/0x20
Jun 8 12:11:17 mc2 kernel: [30304.199101] </IRQ>
Jun 8 12:11:17 mc2 kernel: [30304.199102] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Jun 8 12:11:17 mc2 kernel: [30304.199103] Code: ff e8 87 36 87 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 ba 65 8d ff fb 66 0f 1f 44 00 00 <
45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Jun 8 12:11:17 mc2 kernel: [30304.199103] RSP: 0018:ffffbd934635be60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Jun 8 12:11:17 mc2 kernel: [30304.199104] RAX: ffff93adff222d80 RBX: ffffffff96153c80 RCX: 000000000000001f
Jun 8 12:11:17 mc2 kernel: [30304.199104] RDX: 00001b8fbf04fa77 RSI: 00000000238e38e3 RDI: 0000000000000000
Jun 8 12:11:17 mc2 kernel: [30304.199104] RBP: ffffbd934635bea0 R08: 0000000000000000 R09: 0000000000022640
Jun 8 12:11:17 mc2 kernel: [30304.199105] R10: 00006352c2dadb3e R11: ffff93adff221c04 R12: ffff93adff22d800
Jun 8 12:11:17 mc2 kernel: [30304.199105] R13: 0000000000000006 R14: ffffffff96153ed8 R15: ffffffff96153ec0
Jun 8 12:11:17 mc2 kernel: [30304.199106] cpuidle_enter+0x17/0x20
Jun 8 12:11:17 mc2 kernel: [30304.199108] call_cpuidle+0x23/0x40
Jun 8 12:11:17 mc2 kernel: [30304.199109] do_idle+0x23a/0x280
Jun 8 12:11:17 mc2 kernel: [30304.199110] cpu_startup_entry+0x1d/0x20
Jun 8 12:11:17 mc2 kernel: [30304.199111] start_secondary+0x1ab/0x200
Jun 8 12:11:17 mc2 kernel: [30304.199112] secondary_startup_64+0xa4/0xb0
Jun 8 12:11:17 mc2 kernel: [30304.199113] ---[ end trace c329b3b416c2fdf3 ]---
Jun 8 12:11:17 mc2 kernel: [30304.199121] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
Jun 8 12:11:21 mc2 kernel: [30308.410432] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 9 03:15 seq
 crw-rw---- 1 root audio 116, 33 Jun 9 03:15 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.10-0ubuntu27
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 19.04
HibernationDevice:

IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Gigabyte Technology Co., Ltd. B360 HD3P-LM
Package: linux (not installed)
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.0.0-16-generic root=UUID=272b751b-985b-4a98-b5a3-635ea2ec16c5 ro netcfg/do_not_use_netplan=true nomodeset consoleblank=0
ProcVersionSignature: Ubuntu 5.0.0-16.17-generic 5.0.8
RelatedPackageVersions:
 linux-restricted-modules-5.0.0-16-generic N/A
 linux-backports-modules-5.0.0-16-generic N/A
 linux-firmware 1.178.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: disco
Uname: Linux 5.0.0-16-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 04/30/2019
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F4 HZ
dmi.board.asset.tag: Default string
dmi.board.name: B360HD3PLM-CF
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: Default string
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF4HZ:bd04/30/2019:svnGigabyteTechnologyCo.,Ltd.:pnB360HD3P-LM:pvrDefaultstring:rvnGigabyteTechnologyCo.,Ltd.:rnB360HD3PLM-CF:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: Default string
dmi.product.name: B360 HD3P-LM
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1832101/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Poloskey (craftplaza) wrote :

Did resolve this error by disabling GSO,GRO,TSO offload and ,RX,TX checksum check

ethtool -K eno1 gso off gro off tso off rx off tx off

affects: ubuntu → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1832101

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: disco
Revision history for this message
Poloskey (craftplaza) wrote : CRDA.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Poloskey (craftplaza) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : Lspci.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : ProcEnviron.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : ProcModules.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : UdevDb.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote : WifiSyslog.txt

apport information

Revision history for this message
Poloskey (craftplaza) wrote :

issue only occurs when there is a high netwerk load.

Only disableling rx tx checksum check did not resolve the issue.

There is also a older bug for the e1000 with almost the same issue.
So seems that this issue is a long and known issue in earlier versions.

Also there is a simulair bug (broadcom) where the solution is to disable hi-dma.
That option cannot be change on the e1000, so could not try that.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does disabling TSO help?

Revision history for this message
Poloskey (craftplaza) wrote :

yes and no.

I did disabled tso, but still had some problems, after that i enabled tso and disabled rx/tx checksum
That also did not resolve the issue, so as last resort i disabled
GSO,GRO,TSO offload and ,RX,TX checksum check

ethtool -K <nic> gso off gro off tso off rx off tx off

Revision history for this message
Shivaram Lingamneni (slingamn) wrote :

I'm having this issue as well, with the 5.0.0-16-generic kernel on amd64 and an Intel I219-V NIC using the e1000e driver. I have a test case that can reproduce it somewhat reliably (involving a lot of concurrent back-and-forth chatter on my LAN).

I was able to reproduce it even after this command:

ethtool -K enp0s31f6 gso off gro off tso off rx off tx off

although anecdotally, disabling those features makes reproduction harder.

After the original oops with backtrace (similar to the one posted above), the problem presents simply as:

Jun 14 03:54:43 good-fortune systemd-networkd[7815]: enp0s31f6: Lost carrier
Jun 14 03:54:43 good-fortune systemd-networkd[7815]: enp0s31f6: DHCP lease lost

with no further backtraces.

Revision history for this message
Poloskey (craftplaza) wrote :

Can you check the nic settings and see if hidma is fixed or not? (ethtool -k <nic>)

If it is not fixed, then try disable it ethtool -K <nic> highdma off

https://lauri.xn--vsandi-pxa.com/2016/02/fixing-broadcom-bcm5762-on-ubuntu.html

Another option is to disable netplan and switch to networkd-systemd so you can set
(this is the first thing i tried)

[NETWORK]
IgnoreCarrierLoss=true
ConfigureWithoutCarrier=true

Revision history for this message
Poloskey (craftplaza) wrote :

I also installed networkd-dispatcher and added in the directory routable.d a file with

#!/bin/sh
ethtool -K eno1 gso off gro off tso off rx off tx off

so i don't have to manual change the settings when rebooting.

Revision history for this message
Poloskey (craftplaza) wrote :

Today two times cost carrier. Seems that Shivaram Lingamneni is right.
With the changed settings there are still lost carriers, but not that often.

Revision history for this message
Shivaram Lingamneni (slingamn) wrote :

ethtool reports "highdma: on [fixed]", and won't let me switch it off.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Poloskey (craftplaza) wrote :

i'm testing with settings ethtool -K eno1 gso off gro off tso off rx off tx off

Few days ago my provider replaced all hardware, and attach network to another switch port.
Also network cable is replaced

Cannot test kernel at this moment. System is already live.

Maybe Shivaram Lingamneni can run some tests

Revision history for this message
Shivaram Lingamneni (slingamn) wrote :

It's challenging for me to test mainline kernels because it means losing access to ZFS, but I can probably do this later this week.

Revision history for this message
John Doe (jdoefp) wrote :

I might have the same issue, I'm not sure. I have a Ubuntu 19.04 server with an onboard e1000e interface and a usb ax88179_178a interface, routing traffic between the two.

The e1000e interface drops a couple of times a day and I see "Detected Hardware Unit Hang" / "Reset adapter unexpectedly" messages in syslog. I don't know what triggers that, there's no significant traffic in my graphs.

I can trigger the e1000e to drop reliably, but the trigger doesn't make sense:

I can send as much traffic as I want from ax88179_178a through e1000e and vice versa. I can send as much traffic as I want to the server itself from the e1000e. If I send any moderate amount of traffic to the server from the ax88179_178a, then the e1000e drops. For example, apt-get will cause the e1000e to drop.

The server is fine, the ax88179_178a is fine, the apt-get or wget or whatever will complete and eventually my TCP session will usually recover after the e1000e resets. If the files are large enough, this may happen several times. It doesn't seem to matter if it's writing to disk or to /dev/null.

I have swapped out the switch, the cable, and the USB ethernet interface (both for another ax88179_178a and some other chipset I don't remember.) I have tried kernels from the mainline ppa. I have tried every combination of ethtool knob-twiddling. No luck.

The issue first appeared for me after upgrading from 18.04.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please provide the output of `lspci -nn`.

Revision history for this message
John Doe (jdoefp) wrote :

Ah, sorry, it's also an I219-V

00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5904] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620 [8086:5916] (rev 02)
00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model [8086:1911]
00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21)
00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31] (rev 21)
00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-LP CSME HECI #1 [8086:9d3a] (rev 21)
00:17.0 SATA controller [0106]: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] [8086:9d03] (rev 21)
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 [8086:9d10] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Intel(R) 100 Series Chipset Family LPC Controller/eSPI Controller - 9D4E [8086:9d4e] (rev 21)
00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-LP PMC [8086:9d21] (rev 21)
00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-LP SMBus [8086:9d23] (rev 21)
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (4) I219-V [8086:15d8] (rev 21)

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

1) Please try disabling TSO. Multiple users reported positive results after TSO is disabled.

2) Please try mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.2-rc5/

3) Please try out-of-tree e1000e, which has some extra sauce not in upstream:
https://sourceforge.net/projects/e1000/files/e1000e%20stable/

Revision history for this message
John Doe (jdoefp) wrote :

I have already tried mainline kernels, and every combination of ethtool knob-twiddling. No luck.

Disabling TSO *may* make the 'random' e1000e resets happen less frequently (it's hard to say), but wget/apt-get *on the other interface* reliably trigger the e1000e to hang and reset regardless of settings.

I will try the out-of-tree driver this evening.

Revision history for this message
John Doe (jdoefp) wrote :

Sorry for the delay, I was busy.

The out-of-tree e1000e driver does not resolve the issue.

Running, for example, "wget http://ping.online.net/1000Mo.dat -O /dev/null" from the host, with all traffic on the non-e1000e interface, and no traffic on the e1000e interface, causes the e1000e interface to reset.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does any previous kernel work without this issue?

Revision history for this message
John Doe (jdoefp) wrote :

For me personally, the issue only appeared after upgrading from 18.04 -> 19.04. Based on dpkg.log, the last working kernel I had was either 4.15.0-45.48 or 4.15.0.43.45.

I haven't tried downgrading the kernel yet: the machine is headless and manual interventions are a pain.

Revision history for this message
Shivaram Lingamneni (slingamn) wrote :

Is there any additional information we should collect in order to report this to LKML?

Revision history for this message
John Doe (jdoefp) wrote :

The issue I mentioned above appears to be caused by the ax88179_178a (and in turn triggering an issue in the e1000e driver) -- I can reproduce it with just two ax88179_178a. It might be a memory leak, I'm not sure.

When I modify the configuration to use two vlans on the e1000e instead of a second interface, I still get device resets, but that *may* be resolved by turning off options with ethtool. I'm waiting to see.

tl;dr computers are terrible, everything sucks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.