igb driver does not initialize network cards - no link intel I210 rev03

Bug #1370018 reported by wege on 2014-09-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Tim Gardner
Trusty
High
Tim Gardner
Utopic
High
Tim Gardner

Bug Description

Hi,

our new server installed with ubuntu 14.04.1 lts updated to the latest upstream kernel (Ubuntu 3.13.0-35.62-generic 3.13.11.6), is equipped with two Intel Corporation I210 Gigabit Network Connection (rev 03) nics. The correct network driver is loaded, but the link to the network switch is not always (most of the times) established. Randomly one, two, all or none of the nics will have a network link established (means lights on the nic are dark). This happens with cold or soft boots.

Attached you will find an apport-report from our system.

We have compiled the latest intel igb-driver downloaded from the intel website (igb-5.2.9.4) and installed it manually in the system.
# cp igb.ko /lib/modules/3.13.0-35-generic/kernel/drivers/net/ethernet/intel/igb/igb.ko
# update-initramfs -c -k all

!! Using the latest intel driver the problem seems to be solved !!

If I can supply any further information, please let me know

Thanks in advance
---
ApportVersion: 2.14.1-0ubuntu3.4
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
IwConfig:
 eth0 no wireless extensions.

 eth1 no wireless extensions.

 lo no wireless extensions.
MachineType: FUJITSU PRIMERGY TX2540 M1
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=de_AT:de
 TERM=xterm
 PATH=(custom, no user)
 LANG=de_AT.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-35-generic root=UUID=0e2cf726-90e1-4868-8cc3-ae28d4d512d0 ro nomdmonddf nomdmonisw
ProcVersionSignature: Ubuntu 3.13.0-35.62-generic 3.13.11.6
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-35-generic N/A
 linux-backports-modules-3.13.0-35-generic N/A
 linux-firmware 1.127.5
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty trusty trusty
Uname: Linux 3.13.0-35-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 07/15/2014
dmi.bios.vendor: FUJITSU // American Megatrends Inc.
dmi.bios.version: V4.6.5.4 R1.12.0 for D3099-B1x
dmi.board.name: D3099-B1
dmi.board.vendor: FUJITSU
dmi.board.version: S26361-D3099-B12 WGS03 GS01
dmi.chassis.asset.tag: cpsvs3
dmi.chassis.type: 17
dmi.chassis.vendor: FUJITSU
dmi.chassis.version: TX2540M1F5
dmi.modalias: dmi:bvnFUJITSU//AmericanMegatrendsInc.:bvrV4.6.5.4R1.12.0forD3099-B1x:bd07/15/2014:svnFUJITSU:pnPRIMERGYTX2540M1:pvrGS01:rvnFUJITSU:rnD3099-B1:rvrS26361-D3099-B12WGS03GS01:cvnFUJITSU:ct17:cvrTX2540M1F5:
dmi.product.name: PRIMERGY TX2540 M1
dmi.product.version: GS01
dmi.sys.vendor: FUJITSU

wege (wg-mail) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1370018

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Tim Gardner (timg-tpi) on 2014-09-16
Changed in linux (Ubuntu Trusty):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Utopic):
assignee: nobody → Tim Gardner (timg-tpi)
status: Incomplete → In Progress
Tim Gardner (timg-tpi) wrote :
wege (wg-mail) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected trusty
description: updated
wege (wg-mail) wrote : BootDmesg.txt

apport information

apport information

wege (wg-mail) wrote : Lspci.txt

apport information

wege (wg-mail) wrote : Lsusb.txt

apport information

apport information

apport information

apport information

wege (wg-mail) wrote : UdevDb.txt

apport information

wege (wg-mail) wrote : UdevLog.txt

apport information

apport information

wege (wg-mail) wrote :

Hello Tim,

Tested both kernels (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.6-trusty/ http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.2-utopic/) same result.

Cold boot:
kernel 3.16.2 no link on both nics

Cold boot:
kernel 3.13.11 no link on second nic

Attached the result for 3.13.11.

wege (wg-mail) wrote :

Attached the result for 3.16.2

Tim Gardner (timg-tpi) wrote :

How about http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.17-rc5-utopic/ ? If its not working there, then this seems like an upstream Intel bug.

Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Changed in linux (Ubuntu Utopic):
importance: Undecided → Medium
tags: added: kernel-key
Changed in linux (Ubuntu Trusty):
importance: Medium → High
Changed in linux (Ubuntu Utopic):
importance: Medium → High
wege (wg-mail) wrote :

So the last recommended kernel seems to be ok (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.17-rc5-utopic/) - survived coldboot - softboot - softboot - coldboot

All nics allways initialized.

So there seems to be two working configurations now:
- kernel 3.17-rc5-utopic
- standard ubuntu kernel 3.13.0-35-generic with latest intel driver (igb-5.2.9.4) installed manually

btw not a kernel related problem but the system is always waiting for network and another 60 seconds for the network on startup ... might be related

wege (wg-mail) wrote :

read the post before again - attachmend should have been called noerror_317rc5.txt ;)

also I should mention that if the link is not established at boot time (with the faulty kernels), the link can be enforced by unpluging and reinsertion of the network plug ... by only unloading the igb module and loading it again no change happened with the state of the links ...

Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Committed
wege (wg-mail) wrote :
Download full text (4.1 KiB)

Did a little more testing with the new kernels sent

Network configuration:
eth0 static IP, ping to alias eth0:0
eth1 dynamic, ping to eth1

                  3.13.0-37 3.16.0-16
                  eth0 eth1 eth0 eth1
coldboot y5 y6 y1 y
unplug/plug 1 y5 y6 n2 y
unplug/plug 2 y5 y7 - -
softboot y5 y6 y3 y
unplug/plug 1 n8 n8 n4 y
unplug/plug 2 n9 n9 n4 y

(y1) long time needed for first ping on eth0 + nic errors (only RX no TX + lost) ... see attachement: error_316016.txt
(n2),(n4) No light at nic (eth0), ping is working !!, mii-tool no-link
(y3) all lights on, nic errors (only RX no TX + lost)
(y5) lost packets on eth0
(y6) no TX count on eth1 (only 2 packets)
(y7) no-light on eth1, no TX count on eth1, ping is working ... see attachement: error_313037.txt
(n8) no-light on eth0 (plugged eth1 first), no ping on eth0 and eth1 !!
(n9) no-light on eth0 (plugged eth0 first), no ping on eth0 and eth1 !!

-------------------------------------------------------------------------------------------------------------------------

Did Testing again with standard ubuntu kernel 3.13.0-35-generic to check unplug/plug
with latest intel driver (igb-5.2.9.4) installed manually - driver is also screwed up (even worse)

                  3.13.0-35
                  eth0 eth1
coldboot y10 y
unplug/plug 1 n11 n11
unplug/plug 2 n12 n12
softboot y13 y13
unplug/plug 1 n14 n14
unplug/plug 2 n15 n15

(y10) No TX count on eth0, nic errors (only RX no TX + lost)
(n11) no ping on eth0 and eth1, no light on eth1 (plugged eth0 first)
(n12) no ping on eth0 and eth1, no light on eth1 (plugged eth1 first)
(y13) lost packets on eth0, no TX count on eth1
(n14) no ping on eth0 and eth1, no light on eth0 (plugged eth0 first)
(n15) no ping on eth0 and eth1, no light on eth0 (plugged eth1 first)

hope the output format ist posted with fixed spacing ...
?field.comment=Did a little more testing with the new kernels sent

Network configuration:
eth0 static IP, ping to alias eth0:0
eth1 dynamic, ping to eth1

                  3.13.0-37 3.16.0-16
                  eth0 eth1 eth0 eth1
coldboot y5 y6 y1 y
unplug/plug 1 y5 y6 n2 y
unplug/plug 2 y5 y7 - -
softboot y5 y6 y3 y
unplug/plug 1 n8 n8 n4 y
unplug/plug 2 n9 n9 n4 y

(y1) long time needed for first ping on eth0 + nic errors (only RX no TX + lost) ... see attachement: error_316016.txt
(n2),(n4) No light at nic (eth0), ping is working !!, mii-tool no-link
(y3) all lights on, nic errors (only RX no TX + lost)
(y5) lost packets on eth0
(y6) no TX count on eth1 (only 2 packets)
(y7) no-light on eth1, no TX count on eth1, ping is working ... see attachement: error_313037.txt
(n8) no-light on eth0 (plugged eth1 first), no ping on eth0 and eth1 !!
(n9) no-light on eth0 (plugged eth0 first), no ping on eth0 and eth1 !!

...

Read more...

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.16.0-16.22

---------------
linux (3.16.0-16.22) utopic; urgency=low

  [ Andy Whitcroft ]

  * Revert "SAUCE: x86/xen: Fix setup of 64bit kernel pagetables"
  * [Config] tools -- only build common tools when enabled
  * [Config] follow rename of DEB_BUILD_PROFILES

  [ Tim Gardner ]

  * [Debian] set do_*_tools after stage1 or bootstrap is determined
    - LP: #1370211
  * Release Tracking Bug
    - LP: #1370535

  [ Upstream Kernel Changes ]

  * x86/xen: don't copy bogus duplicate entries into kernel page tables
  * blk-merge: fix blk_recount_segments
    - LP: #1359146
  * igb: bring link up when PHY is powered up
    - LP: #1370018
  * igb: remove unnecessary break after goto
    - LP: #1370018
  * igb: remove unnecessary break after return
    - LP: #1370018
  * igb: Add message when malformed packets detected by hw
    - LP: #1370018
  * igb: bump igb version to 5.2.13
    - LP: #1370018
 -- Tim Gardner <email address hidden> Tue, 16 Sep 2014 10:19:04 -0600

Changed in linux (Ubuntu Utopic):
status: Fix Committed → Fix Released
wege (wg-mail) wrote :

Please send me a link where I can download 3.16.0-16.22 (amd64 + headers) so I can test it - thx

wege (wg-mail) wrote :

Thank you for the links, but it seems to me that kernel is quite the same as that one I tested before. The boot issue seems to be fixed but the driver has a lot more issues as described in post #21 (please "download the full text" because the posted version looks a little bit confusing).

Three main issues are remaining:
- massive plug/unplug problems
- lost packets
- wrong packet counter on the interface

I am wondering if the problems are caused by the revision of that network card or "is it known" that the driver/nic is having that sort of problems?
This is a build in dual-nic in a FUJITSU PRIMERGY TX2540 M1 server, with an approval for SLES and RHEL ...

wege (wg-mail) wrote :

After further investigation ;) and testing the problem seems a little bit more clear.

We've installed windows 2012r2 on the same server to be sure not having a potential hardware problem - !! On windows the same problem exists !!

Than we changed the switch. At first, there was a netgear 8port unmanaged switch to a netgear 8+2port managed switch. No problem on windows neither a problem on linux with the managed switch (Release Notes Intel Driver - potential problem with unamanaged switches).

No plug/unplug problem, no lost packets and not a wrong packet counter anymore.

So all fine ? ... allmost

The wrong packet counter and some of the misbehaviour on plug/unplug and also lost packets seems to come FROM THE SAME NETWORK ON BOTH INTERFACES !!

Problem configuration 1:
eth0 192.168.86.11/24
eth0:0 172.22.1.81/24

dhcp:
eth1 172.22.1. 243/24

ping test to 172.22.1.81 and 172.22.1. 243
------------------------------------------------------------------
Problem configuration 2:
eth0 172.22.1.81/24

dhcp:
eth1 172.22.1. 243/24

ping test to 172.22.1.81 and 172.22.1. 243
------------------------------------------------------------------
Working configuration:
eth0 192.168.86.11/24

dhcp:
eth1 172.22.1. 243/24

ping test to 192.168.86.11 and 172.22.1. 243

... so there seems to be a general bug left in network stack/logic of the kernel ...

wege (wg-mail) wrote :

Additional information:

An overnight ping test to both interfaces (with dedicated ip's and network) whows up a light packet loss.

root@cpsvs3:~ # ifconfig
eth0 Link encap:Ethernet Hardware Adresse 90:1b:0e:31:32:72
          inet Adresse:172.22.1.81 Bcast:172.22.1.255 Maske:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
          RX-Pakete:186373 Fehler:0 Verloren:2082 Überläufe:0 Fenster:0
          TX-Pakete:59162 Fehler:0 Verloren:0 Überläufe:0 Träger:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX-Bytes:18359285 (18.3 MB) TX-Bytes:7070501 (7.0 MB)
          Speicher:dfa00000-dfa7ffff

eth1 Link encap:Ethernet Hardware Adresse 90:1b:0e:31:30:1b
          inet Adresse:192.168.2.11 Bcast:192.168.2.255 Maske:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
          RX-Pakete:183393 Fehler:0 Verloren:2082 Überläufe:0 Fenster:0
          TX-Pakete:56443 Fehler:0 Verloren:0 Überläufe:0 Träger:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX-Bytes:18154822 (18.1 MB) TX-Bytes:6639390 (6.6 MB)
          Speicher:df900000-df97ffff

wege (wg-mail) wrote :

ping itself shows up no packet loss ...

64 bytes from 192.168.2.11: icmp_seq=51761 ttl=64 time=0.344 ms
64 bytes from 192.168.2.11: icmp_seq=51762 ttl=64 time=0.269 ms
64 bytes from 192.168.2.11: icmp_seq=51763 ttl=64 time=0.269 ms
64 bytes from 192.168.2.11: icmp_seq=51764 ttl=64 time=0.270 ms
^C
--- 192.168.2.11 ping statistics ---
51764 packets transmitted, 51764 received, 0% packet loss, time 51763026ms
rtt min/avg/max/mdev = 0.112/0.285/0.597/0.039 ms

64 bytes from 172.22.1.81: icmp_seq=51773 ttl=64 time=0.291 ms
64 bytes from 172.22.1.81: icmp_seq=51774 ttl=64 time=0.241 ms
64 bytes from 172.22.1.81: icmp_seq=51775 ttl=64 time=0.223 ms
64 bytes from 172.22.1.81: icmp_seq=51776 ttl=64 time=0.301 ms
^C
--- 172.22.1.81 ping statistics ---
51776 packets transmitted, 51776 received, 0% packet loss, time 51775100ms
rtt min/avg/max/mdev = 0.098/0.302/2.344/0.041 ms

Tim Gardner (timg-tpi) wrote :

Marking invalid since the issue does not appear to be related to the driver.

Changed in linux (Ubuntu Trusty):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers