Intel x520 NIC's (ixgbe) stop working in 12.10, 13.04, 13.10

Bug #1245938 reported by Fernando Sclavo on 2013-10-29
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Quantal
Medium
Unassigned
Raring
Medium
Unassigned
Saucy
Medium
Unassigned
Trusty
Medium
Unassigned

Bug Description

We have a server (Dell R715) with two Intel x520 NIC's. If we run Ubuntu 12.04 on it, the NIC's works flawlessly (with stock kernel driver or with Intel compiled one), but if we upgrade release to 12.10, 13.04 or 13.10, the NIC's stop working: either stock or Intel drivers fails with error:

[ 226.395766] Intel(R) 10 Gigabit PCI Express Network Driver - version 3.18.7
[ 226.395770] Copyright (c) 1999-2013 Intel Corporation.
[ 226.395980] ixgbe: probe of 0000:22:00.0 failed with error -5
[ 226.396092] ixgbe: probe of 0000:22:00.1 failed with error -5
[ 226.396203] ixgbe: probe of 0000:23:00.0 failed with error -5
[ 226.396311] ixgbe: probe of 0000:23:00.1 failed with error -5

I contacted Intel developers and they responded:

"Hey Fernando,
We (ixgbe) only returns EIO (error 5) for a couple of reasons.
 1) When we fail to io map (ioremap)
 2) If the eeprom checksum is incorrect.
 3) If the MAC address from the checksum is invalid

Reasons 2 and 3 are related to the NIC's eeprom so if they worked with another system they should still be fine now. If you really wanted to verify you could try out the NIC's on a known good system again to see if the eeprom somehow got corrupted.
That pretty much leaves us with ioremap returning an error. I'm not at all sure why your Ubuntu release would not like the way we are calling ioremap, but it might give you a place to start looking in Ubuntu changes.
Thanks,
-Don"

If the server boot with kernel 3.2.0-55 (from grub menu) both NIC's works fine.

Please let me know how can I help!

Regards

Fernando

Fernando Sclavo (fsclavo) wrote :
Fernando Sclavo (fsclavo) wrote :
Fernando Sclavo (fsclavo) wrote :
Fernando Sclavo (fsclavo) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: saucy
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key quantal raring
Joseph Salisbury (jsalisbury) wrote :

Hi Fernando,

We can perform a kernel bisect to identify the commit that introduced this regression. However, I'd first like to have you test the latest 3.11 stable and 3.12-rc7 mainline kernels. Can you download the following kernels and see if they also exhibit the bug:

3.11.6: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11.6-saucy/
3.12-rc7: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc7-saucy/

Thanks in advance!

tags: added: performing-bisect
Fernando Sclavo (fsclavo) wrote :

Hi Joseph

Tried suggested kernels, but both hangs on:
Loading Linux 3.11.6-031106-generic
Loading initial ramdisk <-

I'm forgetting something? Just downloaded .deb packages and installed them with "sudo dpkg -i package_name.deb" without errors (only a warning about a missing bnx2 firmware).

Thanks

Joseph Salisbury (jsalisbury) wrote :

Thant should be all that is required. Just install the linux-image .deb package.

Could you give 3.11.5 a try to see if this is another new issue? 3.11.5 can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11.5-saucy/

Fernando Sclavo (fsclavo) wrote :

No luck!
3.11.5 keeps hanged on "Loading initial ramdisk..." too

Joseph Salisbury (jsalisbury) wrote :

It looks like you were able to boot 3.11.0-12.19-generic due to the dmesg.log attached in comment #1. That kernel is based off of upstream 3.11.3. Can you confirm that 3.11.3 boots:

3.11.3: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11.3-saucy/

Fernando Sclavo (fsclavo) wrote :

Unfortunately 3.11.3 also doens't boot. Every upgrade/update installed kernel boots ok, but not package installed one.
I don't know how to debug why kernels doesn't boot, if you give me some tips I can give it a try.

Joseph Salisbury (jsalisbury) wrote :

Hmm, that's strange that they don't boot. Can you also test the latest Trusty kernel from:
https://launchpad.net/ubuntu/+source/linux/3.11.0-12.19/+build/5088396

There are some hints on how to get further debug info from a boot failure at:

https://wiki.ubuntu.com/DebuggingKernelBoot

As mentioned on the wiki, it would be great if you can attach a log file which may have captured any messages you see. If you are unable to capture a log file, a digital photo will work just as well. As a last resort you can even copy messages down by hand.

Fernando Sclavo (fsclavo) wrote :

Yes! Kernel 3.11.0-12 boots!

idsuser@suricata:~$ uname -a
Linux suricata 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

But, same bug with ixgbe:

[ 14.424573] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.13.10-k
[ 14.424575] ixgbe: Copyright (c) 1999-2013 Intel Corporation.
[ 14.424778] ixgbe: probe of 0000:22:00.0 failed with error -5
[ 14.424887] ixgbe: probe of 0000:22:00.1 failed with error -5
[ 14.424995] ixgbe: probe of 0000:23:00.0 failed with error -5
[ 14.425100] ixgbe: probe of 0000:23:00.1 failed with error -5

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. There must be a Saucy patch(s) in Ubuntu to allow your system to boot that is missing from Mainline.

Can you also give the 3.5 final kernel a shot, since you see this in 12.10. If we can't get that kernel to boot, we can just bisect with Ubuntu kernels instead of upstream kernels.

3.5 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-quantal/

Fernando Sclavo (fsclavo) wrote :

No Joseph, 3.5.0-030500-generic doen't boot. It hangs (as others) with:

Loading Linux 3.5.0-030500-generic ...
Loading initial ramdisk ...

Joseph Salisbury (jsalisbury) wrote :

It's probably best to bisect with the Ubuntu kernels, since we are unable to get the upstream kernels to boot. Can you test these early Quantal kernels:

v3.4.0-1.2: https://launchpad.net/ubuntu/+source/linux/3.4.0-1.2/+build/3454906
v3.5.0-1.1: https://launchpad.net/ubuntu/+source/linux/3.5.0-1.1/+build/3588755

Fernando Sclavo (fsclavo) wrote :

Joseph, here are the results (both kernel boots ok):

Kernel 3.4.0-1.2: NIC's are working.
Kernel 3.5.0-1.1: NIC's wasn't work. Same error than before (ixgbe: probe of 0000:22:00.0 failed with error -5)

Joseph Salisbury (jsalisbury) wrote :

So it looks like we are getting closer. We just need to narrow down the versions a little more. Can you test the following kernel:

3.4.0-5.11: https://launchpad.net/ubuntu/+source/linux/3.4.0-5.11/+build/3550103

Fernando Sclavo (fsclavo) wrote :

Hi Joseph.
Kernel 3.4.0-5.11 also fails with "error -5"
We are a little bit closer

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. Lets try in the middle of those versions:

v3.4.0-2.6: https://launchpad.net/ubuntu/+source/linux/3.4.0-2.6/+build/3495113

Fernando Sclavo (fsclavo) wrote :

Yes! 3.4.0.-2.6 worked fine!

Joseph Salisbury (jsalisbury) wrote :

Great, we are getting closer. Can you now try v3.4.0-3.8:

https://launchpad.net/ubuntu/+source/linux/3.4.0-3.8/+build/3525013

Fernando Sclavo (fsclavo) wrote :

3.4.0-3.8 also works fine!

Joseph Salisbury (jsalisbury) wrote :

Can you next test 3.4.0-4.9:
https://launchpad.net/ubuntu/+source/linux/3.4.0-4.9/+build/3548008

If 3.4.0-4.9 is good, then test 3.4.0-4.10:
https://launchpad.net/ubuntu/+source/linux/3.4.0-4.10/+build/3548900

These last two test should let us know that last good kernel and first bad kernel. We can then use these two kernel versions to perform a bisect to identify the exact commit that introduced this bug.

Fernando Sclavo (fsclavo) wrote :

Joseph, 3.4.0-4.9 ins't good: NICs (ixgbe) fails with this kernel with same error code: -5

Joseph Salisbury (jsalisbury) wrote :

Thanks for the feedback. It looks like we now have the last good and first bad kernel versions. I'll start a bisect between v3.4.0-3.8 and v3.4.0-4.9 and post a test kernel shortly.

Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v3.4.0-3.8 and v3.4.0-4.9. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
44962d369d481b0d33bc9d98da33cbf803aff4ac

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

Sorry for the delay Joseph, I was out of office. Just tested 3.4.0-3.9 and driver fails.
I'll wait for your next kernel to test

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
8b2f1712862739b3636e5448c89be193718bf59d

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

It failed too with same error

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
44b7400e6724f9a238f62bbeffd462a449d9a518

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

Joseph, kernel 3.4.0-3.9~lp1245938Commit44b7400 also has the bug.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
58d5086c98b45a062c6058b5a6398fbbb42603f1

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

Kernel 3.4.0-3.8~lp1245938Commit58d5086 also fails with same error.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
49cc9a18182f7940a89f997b103e3b52e810ef06

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

Kernel 3.4.0-3-generic_3.4.0-3.8~lp1245938Commit49cc9a18 also fails Joseph.

thomas955 (thoehlig) wrote :

Hi,

i also have the problem with newer Kernels. Im on DEBIAN wheezy / jessie.
Wheezy with kernel 3.2 works fine but with newer Kernel 3.9 or 3.10-3 there is always the probe of ... failed -5.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
3ab9eb93bbb892fc154e35f13970744187402056

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Fernando Sclavo (fsclavo) wrote :

Joseph, kernel 3.4.0-3-generic_3.4.0-3.8~lp1245938Commit3ab9eb9 doesn't fails, it works ok!

thomas955 (thoehlig) wrote :

Hi Joseph,

uname -a
Linux production01 3.4.0-3-generic #8~lp1245938Commit3ab9eb9 SMP Thu Nov 21 18:07:11 UTC 2013 x86_64 GNU/Linux

cat /proc/version
Linux version 3.4.0-3-generic (root@gomeisa) (gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1) ) #8~lp1245938Commit3ab9eb9 SMP Thu Nov 21 18:07:11 UTC 2013

works for me too!

lspci
22:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
22:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)

dmesg | grep ixgbe
http://pastie.org/8506784

Joseph Salisbury (jsalisbury) wrote :

The bisect indicated the following commit as the first bad commit:
49cc9a18182f7940a89f997b103e3b52e810ef06

I built a test kernel with this commit reverted.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not?

thomas955 (thoehlig) wrote :

Hi Joseph,

The Kernel works for me.

What did you revert in detail?
Thank you in advance
Thomas

uname -a
Linux production01 3.5.0-44-generic #67~lp1245938Commit49cc9a18Reverted SMP Mon Nov 25 19:30:21 UTC x86_64 GNU/Linux

Fernando Sclavo (fsclavo) wrote :

Confirmed Joseph, kernel 3.5.0-44-generic_3.5.0-44.67~lp1245938Commit49cc9a18Reverted works fine!

Changed in linux (Ubuntu Saucy):
importance: Undecided → Medium
Changed in linux (Ubuntu Raring):
importance: Undecided → Medium
Changed in linux (Ubuntu Quantal):
importance: Undecided → Medium
Changed in linux (Ubuntu Saucy):
status: New → Confirmed
Changed in linux (Ubuntu Quantal):
status: New → Confirmed
Changed in linux (Ubuntu Raring):
status: New → Confirmed
Bjorn Helgaas (bjorn-helgaas) wrote :

Fernando, can you attach the dmesg log and "lspci -vv" output from the newest working kernel, so we can compare them with those from the non-working 3.11 kernel?

thomas955 (thoehlig) wrote :
thomas955 (thoehlig) wrote :
thomas955 (thoehlig) wrote :
thomas955 (thoehlig) wrote :

Is there a way to add multiple files on launchpad?

Thomas

Fernando Sclavo (fsclavo) wrote :
Fernando Sclavo (fsclavo) wrote :
Joseph Salisbury (jsalisbury) wrote :

We received a patch from upstream:
http://www.spinics.net/lists/linux-pci/msg26805.html

I built a mainline kernel with this patch, which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not?

thomas955 (thoehlig) wrote :

Hi Joseph
tank btw so far for ur help

uname -a
Linux production01 3.13.0-031300rc1-generic #201311291222 SMP Fri Nov 29 17:25:01 UTC 2013 x86_64 GNU/Linux
WORKING :-D

Thomas

p.s.
little problem with your Kernel:

 dpkg -i linux-image-3.13.0-031300rc1-generic_3.13.0-031300rc1.201311291222_amd64.deb
(Reading database ... 149718 files and directories currently installed.)
Preparing to replace linux-image-3.13.0-031300rc1-generic 3.13.0-031300rc1.201311291222 (using linux-image-3.13.0-031300rc1-generic_3.13.0-031300rc1.201311291222_amd64.deb) ...
Done.
Unpacking replacement linux-image-3.13.0-031300rc1-generic ...
Examining /etc/kernel/postrm.d .
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
Setting up linux-image-3.13.0-031300rc1-generic (3.13.0-031300rc1.201311291222) ...
Running depmod.
update-initramfs: deferring update (hook will be called later)
Not updating initrd symbolic links since we are being updated/reinstalled
(3.13.0-031300rc1.201311291222 was configured last, according to dpkg)
Not updating image symbolic links since we are being updated/reinstalled
(3.13.0-031300rc1.201311291222 was configured last, according to dpkg)
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
run-parts: executing /etc/kernel/postinst.d/dkms 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
Error! Bad return status for module build on kernel: 3.13.0-031300rc1-generic (x86_64)
Consult /var/lib/dkms/openvswitch/1.9.3+git20131029/build/make.log for more information.
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
update-initramfs: Generating /boot/initrd.img-3.13.0-031300rc1-generic
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 3.13.0-031300rc1-generic /boot/vmlinuz-3.13.0-031300rc1-generic
Generating grub.cfg ...
Found linux image: /boot/vmlinuz-3.13.0-031300rc1-generic
Found initrd image: /boot/initrd.img-3.13.0-031300rc1-generic
Found linux image: /boot/vmlinuz-3.11-2-amd64
Found initrd image: /boot/initrd.img-3.11-2-amd64
Found linux image: /boot/vmlinuz-3.10-3-amd64
Found initrd image: /boot/initrd.img-3.10-3-amd64
Found linux image: /boot/vmlinuz-3.9-1-amd64
Found initrd image: /boot/initrd.img-3.9-1-amd64
Found linux image: /boot/vmlinuz-3.5.0-44-generic
Found initrd image: /boot/initrd.img-3.5.0-44-generic
Found linux image: /boot/vmlinuz-3.4.0-3-generic
Found initrd image: /boot/initrd.img-3.4.0-3-generic
Found linux image: /boot/vmlinuz-3.2.0-4-amd64
Found initrd image: /boot/initrd.img-3.2.0-4-amd64
done

Ill attach lspci -vv and dmesg in next two posts

Thank you much!

thomas955 (thoehlig) wrote :
thomas955 (thoehlig) wrote :
thomas955 (thoehlig) wrote :

Hi Joseph,
i ment
thank you :-)

Im on atm if theres some irc may we can meet.
Thomas

thomas955 (thoehlig) wrote :

some more info if needed

Fernando Sclavo (fsclavo) wrote :

Joseph, unfortunately kernel 3.13 doesn't boot on our server, like others before hangs on "loading ramdisk"

thomas955 (thoehlig) wrote :

Hi Fernando,

i think wer the only guys on the debian/ubuntu world with this problem :-)
i got same problem first time. i firstly installed all packages on my server with dpkg -i *.deb and ran in the same problem that my kernel wasnt able to boot. (there wasnt an option to start with 3.13 kernel)

try to remove the x86/64 headers/images files again and try to install only the kernel files that match 4u.
Hope this will helps.
Thomas

thomas955 (thoehlig) wrote :

And have a look if all appropriate file are @
/boot/

for me vmlinux-* was the missing one.
May have a look at my error report after my post scritpum. (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1245938/comments/52)

Joseph Salisbury (jsalisbury) wrote :

@Fernando, I build a Quantal test kernel with the patch from upstream which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not?

We should probably open another bug for the unable to boot newer kernels issue.

Fernando Sclavo (fsclavo) wrote :

Joseph: kernel 3.5.0-44_3.5.0-44.67~lp1245938v2Patched works ok!

Thomas: apparently the only way to boot some kernels is installing the "extra" package. For some other kernels this isn't required.

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing, Fernando. I'll let the upstream developer know that his patch fixes the issue. We should look at your other issue that prevents newer kernels from booting, since this patch from upstream will land in mainline first and may not end up in the 3.5 kernel tree since it requires a bunch of other prereq commits.

Can you open a new bug for the boot issue?

Joseph Salisbury (jsalisbury) wrote :

The patch author will submit his patch and cc stable:

https://lkml.org/lkml/2013/12/3/812

Joseph Salisbury (jsalisbury) wrote :

I built another test kernel with a patch from upstream:

 http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not?

thomas955 (thoehlig) wrote :

Hi Joseph,

not working for me :-(
Linux production01 3.6.0-030600rc6-generic #201312111606 SMP Wed Dec 11 21:09:39 UTC 2013 x86_64 GNU/Linux

dmesg | grep ixgbe
[ 1.997512] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.9.15-k
[ 1.997515] ixgbe: Copyright (c) 1999-2012 Intel Corporation.
[ 1.997700] ixgbe: probe of 0000:22:00.0 failed with error -5
[ 1.997815] ixgbe: probe of 0000:22:00.1 failed with error -5

Joseph Salisbury (jsalisbury) wrote :

Looks like I built the wrong branch from
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git

I'll build it again, this time using branch: for-pci-3.14

Joseph Salisbury (jsalisbury) wrote :

I rebuilt the test kernel from upstream:

 http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you test that kernel and report back if it has the bug or not?

kybe (kyrre-begnum) wrote :

I can confirm the same behaviour on a Dell r815 with a Intel x520 network card. They work fine with the 3.2 kernel but not with the current 3.8 one in Ubuntu. @Joseph: I was not able to boot from your 3.13 kernel from this source:

 http://kernel.ubuntu.com/~jsalisbury/lp1245938

Fernando Sclavo (fsclavo) wrote :

Same here, headers package won't install and kernel doens't boot.

thomas955 (thoehlig) wrote :

Hi,

@ Joseph: ill test your Kernel later today or tomorrow

FYI
Workaround: If you need to boot a "new" Kernel you can try to add

pci=realloc=off

to your grub config.

tags: added: bios-outdated-3.2.1
Joseph Salisbury (jsalisbury) wrote :

@kybe @Fernando,

Can you post what error messages you were seeing when trying to install my kernel?

@thomas955,
Were you able to install my test kernel?

thomas955 (thoehlig) wrote :

Hi Joseph,

with bit delay, happy new year.

I tested your Kernel, and it works (FYI: i disabled my grub command line parameter pci=realloc=off)

some infos:

Linux production01 3.13.0-031300rc2-generic #201312121210 SMP Thu Dec 12 17:12:35 UTC 2013 x86_64 GNU/Linux

modinfo ixgbe
filename: /lib/modules/3.13.0-031300rc2-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
version: 3.15.1-k

lspci -v
22:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
 Subsystem: Intel Corporation Ethernet Server Adapter X520-2
 Flags: bus master, fast devsel, latency 0, IRQ 60
 Memory at e3600000 (64-bit, non-prefetchable) [size=512K]
 I/O ports at bcc0 [size=32]
 Memory at e37f8000 (64-bit, non-prefetchable) [size=16K]
 Expansion ROM at e3700000 [disabled] [size=512K]
 Capabilities: [40] Power Management version 3
 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
 Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
 Capabilities: [a0] Express Endpoint, MSI 00
 Capabilities: [e0] Vital Product Data
 Capabilities: [100] Advanced Error Reporting
 Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-9c-f7-70
 Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
 Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
 Kernel driver in use: ixgbe

Thomas

Fernando Sclavo (fsclavo) wrote :

Joseph, I haven't problem installing those kernels, the issue is none of these boots.

Joseph Salisbury (jsalisbury) wrote :

@Fernando, @kybe,

Did you test my kernel with pci=realloc=off ? Or did you leave it at the default of on ?

thomas955 (thoehlig) wrote :

Hi,

any news?

Joseph Salisbury (jsalisbury) wrote :

I built one more test kernel, from the latest upstream git tree[0]. This kernel can be downloaded from:
 http://kernel.ubuntu.com/~jsalisbury/lp1245938

Can you also give this kernel a test, so we can provide feedback to upstream?

[0] git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git

Fernando Sclavo (fsclavo) wrote :

Joseph, I tried 3.14.0-031400rc1-generic and fails (same "error -5")

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Fernando. Did you get that kernel from the kernel-ppa, or from the link I posted in comment #76?

Fernando Sclavo (fsclavo) wrote :

The kernel I tried was downloaded from your link Joseph, not from ppa.

thomas955 (thoehlig) wrote :

hey,

some time is gone....

Ive tried new 14 LTS.
Bug seems to´ve be gone. All my hardware is working.

Can you gimme some infos that our work wasnt helpless? Was or is it simply accident or fortune that we had this issue?

I read yout conversation with the guy hes in charge with the dev of ixbe but im not quite sure if you realy solve the probelm yourselfs.

Thanks by the way for the help all the way!
For me i can say .... ubuntu 14 server LTS no problems. If u need further infos ill help u!

Hi there.

I had to switch to a earlier kernel and the systems went into production. From then on i didn’t have the equipment to test more. I will try to test the 14.04 kernel as soon as a machine becomes avaiable.

On 30. juli 2014, at 01:38, thomas955 <email address hidden> wrote:

> hey,
>
> some time is gone....
>
> Ive tried new 14 LTS.
> Bug seems to´ve be gone. All my hardware is working.
>
> Can you gimme some infos that our work wasnt helpless? Was or is it
> simply accident or fortune that we had this issue?
>
> I read yout conversation with the guy hes in charge with the dev of ixbe
> but im not quite sure if you realy solve the probelm yourselfs.
>
> Thanks by the way for the help all the way!
> For me i can say .... ubuntu 14 server LTS no problems. If u need further infos ill help u!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1245938
>
> Title:
> Intel x520 NIC's (ixgbe) stop working in 12.10, 13.04, 13.10
>
> Status in “linux” package in Ubuntu:
> Confirmed
> Status in “linux” source package in Quantal:
> Confirmed
> Status in “linux” source package in Raring:
> Confirmed
> Status in “linux” source package in Saucy:
> Confirmed
> Status in “linux” source package in Trusty:
> Confirmed
>
> Bug description:
> We have a server (Dell R715) with two Intel x520 NIC's. If we run
> Ubuntu 12.04 on it, the NIC's works flawlessly (with stock kernel
> driver or with Intel compiled one), but if we upgrade release to
> 12.10, 13.04 or 13.10, the NIC's stop working: either stock or Intel
> drivers fails with error:
>
> [ 226.395766] Intel(R) 10 Gigabit PCI Express Network Driver - version 3.18.7
> [ 226.395770] Copyright (c) 1999-2013 Intel Corporation.
> [ 226.395980] ixgbe: probe of 0000:22:00.0 failed with error -5
> [ 226.396092] ixgbe: probe of 0000:22:00.1 failed with error -5
> [ 226.396203] ixgbe: probe of 0000:23:00.0 failed with error -5
> [ 226.396311] ixgbe: probe of 0000:23:00.1 failed with error -5
>
> I contacted Intel developers and they responded:
>
> "Hey Fernando,
> We (ixgbe) only returns EIO (error 5) for a couple of reasons.
> 1) When we fail to io map (ioremap)
> 2) If the eeprom checksum is incorrect.
> 3) If the MAC address from the checksum is invalid
>
> Reasons 2 and 3 are related to the NIC's eeprom so if they worked with another system they should still be fine now. If you really wanted to verify you could try out the NIC's on a known good system again to see if the eeprom somehow got corrupted.
> That pretty much leaves us with ioremap returning an error. I'm not at all sure why your Ubuntu release would not like the way we are calling ioremap, but it might give you a place to start looking in Ubuntu changes.
> Thanks,
> -Don"
>
> If the server boot with kernel 3.2.0-55 (from grub menu) both NIC's
> works fine.
>
> Please let me know how can I help!
>
> Regards
>
> Fernando
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1245938/+subscriptions

Fernando Sclavo (fsclavo) wrote :

Unfortunately, I upgraded server to 14.04 but bug is still there:

dmesg:

[ 35.012522] ixgbe 0000:22:00.1: Multiqueue Enabled: Rx Queue count = 32, Tx Queue count = 32
[ 35.012653] ixgbe 0000:22:00.1: PCI Express bandwidth of 32GT/s available
[ 35.012655] ixgbe 0000:22:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[ 35.012990] ixgbe 0000:22:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: G18786-003
[ 35.012992] ixgbe 0000:22:00.1: 90:e2:ba:20:b8:05
[ 35.014727] ixgbe 0000:22:00.1: Intel(R) 10 Gigabit Network Connection
[ 35.014908] ixgbe: probe of 0000:23:00.0 failed with error -5
[ 35.015031] ixgbe: probe of 0000:23:00.1 failed with error -5

uname -a:
Linux suricata 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

thomas955 (thoehlig) wrote :

Hi,

additional i did a few "system" upgrades too. (may thats why its working for me)

My system : dell poweredge R815
BIOS version 3.2.1
Firmware: 1.96 (Build 01)

uname -a
Linux hostname 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

now some ethtool infos:
ethtool -i 10gbiface
driver: ixgbe
version: 3.15.1-k
firmware-version: 0x546c0001
bus-info: 0000:22:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

If you also have a dell try to upgrade the firmware.
p.s.
you can write me an email, i have some "helpfull" upgrade - links :-D

Fernando Sclavo (fsclavo) wrote :

No luck. All server firmware upgraded to SUU 14.07 (newest), and Ubuntu updated as well, but still not working. Some info:

dmesg:

[ 35.012522] ixgbe 0000:22:00.1: Multiqueue Enabled: Rx Queue count = 32, Tx Queue count = 32
[ 35.012653] ixgbe 0000:22:00.1: PCI Express bandwidth of 32GT/s available
[ 35.012655] ixgbe 0000:22:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
[ 35.012990] ixgbe 0000:22:00.1: MAC: 2, PHY: 15, SFP+: 6, PBA No: G18786-003
[ 35.012992] ixgbe 0000:22:00.1: 90:e2:ba:20:b8:05
[ 35.014727] ixgbe 0000:22:00.1: Intel(R) 10 Gigabit Network Connection
[ 35.014908] ixgbe: probe of 0000:23:00.0 failed with error -5
[ 35.015031] ixgbe: probe of 0000:23:00.1 failed with error -5

uname -a:
Linux suricata 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

This bug was nominated against a series that is no longer supported, ie saucy. The bug task representing the saucy nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Saucy):
status: Confirmed → Won't Fix
Joseph Salisbury (jsalisbury) wrote :

This bug was nominated against a series that is no longer supported, ie raring. The bug task representing the raring nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Raring):
status: Confirmed → Won't Fix
Joseph Salisbury (jsalisbury) wrote :

This bug was nominated against a series that is no longer supported, ie quantal. The bug task representing the quantal nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Quantal):
status: Confirmed → Won't Fix
thomas955 (thoehlig) wrote :

Hi

did you find out to solve your problem?
Im running fine wiht my configuration @ post https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1245938/comments/83 .
I tryed some newer kernels also and never ran in this problem again untill now :-D.

If you need further help we can try.
Greetings

Joseph Salisbury (jsalisbury) wrote :

This issue was discussed upstream[0], but a permanent fix has not been implemented as of yet. A similar bug was also opened bug 1363313

I'll ping upstream regarding this issue.

[0] https://lkml.org/lkml/2014/1/10/401

Joseph Salisbury (jsalisbury) wrote :

Can you also confirm if this bug still exists in the latest upstream kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.17-rc5-utopic/

tags: removed: performing-bisect
Joseph Salisbury (jsalisbury) wrote :

Upstream has requested some additional data to help resolve this issue[0]. They would also like us to open an upstream bug for additional tracking.

For regressions, it's helpful if you can attach dmesg logs from working and non-working kernels that are as close together as possible. Is it possible to collect the additional dmesg output?

[0] https://lkml.org/lkml/2014/9/25/432

thomas955 (thoehlig) wrote :

Hi Joseph,

im in heavy work atm. I will try do this @ weekend.
Should i post all the information right here or anywhere else?

Maybe i have a server that still has the problem. Ill determine this and send you appropriate data from working and non working server.

Joseph Salisbury (jsalisbury) wrote :

Thanks, Thomas. Posting here should be fine.

Download full text (3.3 KiB)

Hi Thomas.
Unfortunately isn't working for me. I tried upgrading firmware, kernel and
even NIC drivers with no luck: always fail with same "error 5" message.
Right now, I'm running server with most recent 3.2 kernel.

Regards

2014-07-31 4:45 GMT-03:00 thomas955 <email address hidden>:

> Hi,
>
> additional i did a few "system" upgrades too. (may thats why its working
> for me)
>
> My system : dell poweredge R815
> BIOS version 3.2.1
> Firmware: 1.96 (Build 01)
>
> uname -a
> Linux hostname 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> now some ethtool infos:
> ethtool -i 10gbiface
> driver: ixgbe
> version: 3.15.1-k
> firmware-version: 0x546c0001
> bus-info: 0000:22:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> If you also have a dell try to upgrade the firmware.
> p.s.
> you can write me an email, i have some "helpfull" upgrade - links :-D
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1245938
>
> Title:
> Intel x520 NIC's (ixgbe) stop working in 12.10, 13.04, 13.10
>
> Status in “linux” package in Ubuntu:
> Confirmed
> Status in “linux” source package in Quantal:
> Confirmed
> Status in “linux” source package in Raring:
> Confirmed
> Status in “linux” source package in Saucy:
> Confirmed
> Status in “linux” source package in Trusty:
> Confirmed
>
> Bug description:
> We have a server (Dell R715) with two Intel x520 NIC's. If we run
> Ubuntu 12.04 on it, the NIC's works flawlessly (with stock kernel
> driver or with Intel compiled one), but if we upgrade release to
> 12.10, 13.04 or 13.10, the NIC's stop working: either stock or Intel
> drivers fails with error:
>
> [ 226.395766] Intel(R) 10 Gigabit PCI Express Network Driver - version
> 3.18.7
> [ 226.395770] Copyright (c) 1999-2013 Intel Corporation.
> [ 226.395980] ixgbe: probe of 0000:22:00.0 failed with error -5
> [ 226.396092] ixgbe: probe of 0000:22:00.1 failed with error -5
> [ 226.396203] ixgbe: probe of 0000:23:00.0 failed with error -5
> [ 226.396311] ixgbe: probe of 0000:23:00.1 failed with error -5
>
> I contacted Intel developers and they responded:
>
> "Hey Fernando,
> We (ixgbe) only returns EIO (error 5) for a couple of reasons.
> 1) When we fail to io map (ioremap)
> 2) If the eeprom checksum is incorrect.
> 3) If the MAC address from the checksum is invalid
>
> Reasons 2 and 3 are related to the NIC's eeprom so if they worked with
> another system they should still be fine now. If you really wanted to
> verify you could try out the NIC's on a known good system again to see if
> the eeprom somehow got corrupted.
> That pretty much leaves us with ioremap returning an error. I'm not at
> all sure why your Ubuntu release would not like the way we are calling
> ioremap, but it might give you a place to start looking in Ubuntu changes.
> Thanks,
> -Don"
>
> If the server boot with kernel 3.2.0-55 (from grub menu) both NIC's
> works fine.
>
> Please let me know how can I help!
>
> Reg...

Read more...

thomas955 (thoehlig) wrote :

Hi Fernando,

well solve this prob 4 u too im sure.
plz we need some more infos (and your kernel 3.2 seems to be quite old).

so plz do some dmesg verbose and you can do lspci (i dont know if this helps) and plz uname -a.
We will fix this im sure with the realy nice help of Joseph!
Plz post this infos and ill try do this @ weekend!

Greets!
Thomas

hoorid (horridguy123123) wrote :

System:
PowerEdge R920
BIOS: 1.3.2
 X540-AT2 (rev 01) Firmware: 16.0.24

Linux 3.13.0-45-generic #74-Ubuntu SMP Tue Jan 13 19:36:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

This gives the same problem, "failed with error -5"

I can confirm that "pci=realloc=off" makes the errors go away and the card starts to work

hoorid (horridguy123123) wrote :

Alright I solved it for me without "pci=realloc=off" or any patches with a Poweredge R920

In BIOS under "Integrated Devices"

"SR-IOV Global Enable" = ENABLE

Bjorn Helgaas (bjorn-helgaas) wrote :

From the Linux kernel point of view, changing the BIOS option is not a good fix, and "pci=realloc=off" is just a workaround and not a real fix either. Linux should be able to work even without that, or at least give meaningful error messages.

The original problem appears to be that:

  - BIOS allocated space for the normal PCI BARs 0, 2 and 4, but not for the SR-IOV VF BARs,
  - Linux tried to allocate space for the SR-IOV BARs, but failed, and
  - Linux removed even the space allocated for the normal PCI BARs

Here's the initial state Linux found:

  pci 0000:22:00.0: [8086:10fb] type 00 class 0x020000
  pci 0000:22:00.0: reg 0x10: [mem 0xe9400000-0xe947ffff 64bit] # BAR 0
  pci 0000:22:00.0: reg 0x18: [io 0xdcc0-0xdcdf] # BAR 2
  pci 0000:22:00.0: reg 0x20: [mem 0xe95f8000-0xe95fbfff 64bit] # BAR 4
  pci 0000:22:00.0: reg 0x30: [mem 0xe9500000-0xe957ffff pref] # ROM BAR
  pci 0000:22:00.0: reg 0x184: [mem 0x00000000-0x00003fff 64bit] # SR-IOV BAR
  pci 0000:22:00.0: reg 0x190: [mem 0x00000000-0x00003fff 64bit] # SR-IOV BAR

After trying to move things around to provide space for the SR-IOV BARs:

  pci 0000:22:00.0: BAR 0: can't assign mem (size 0x80000) # BAR 0
  pci 0000:22:00.0: BAR 4: can't assign mem (size 0x4000) # BAR 4
  pci 0000:22:00.0: BAR 7: can't assign mem (size 0x100000) # SR-IOV BAR
  pci 0000:22:00.0: BAR 10: can't assign mem (size 0x100000) # SR-IOV BAR

This may have been fixed by changes in the resource assignment code. I don't remember similar recent problem reports.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers