karmic/lucid installation slow on "detecting network hardware" with bnx2x

Bug #415353 reported by Mattias Wadenstein on 2009-08-18
46
This bug affects 4 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Undecided
Unassigned
Lucid
Medium
Stefan Bader
linux (Ubuntu)
Medium
Unassigned
Lucid
Medium
Unassigned

Bug Description

Binary package hint: linux-image-2.6.31-5-generic

When installing some of my new bl460c g6 blades with bnx2x 10GE interfaces, the step "detecting network hardware" takes a really long time. This is probably due to the installer scanning all interfaces several times, and these blades have 8 of them (each physical interface is split into 4 virtual one, each exposed as a nic by the kernel), together with a 50-second timeout in the driver.

50 seconds times 8 interfaces times 3-4 probing rounds makes for a total delay of 20-30 minutes, which is a bit painful.

The delay seems to be in "[bnx2x_wait_ramrod:6741(ethX)]timeout polling for state c0000 on IDX [1]". Attached is a sample screen of the installer messages.

I am booting a slightly modified (bnx2x dependencies added to the initrd) karmic netboot installer as of last week, will try the latest version as soon as #360966 gets fixed.

----------------------------------------------------------------------------

SRU Justification (iniramfs-tools):

[Impact]

The libcrc32c module depends in a non-detectable way on the crypto/crc32c module. If any driver (like the bnx2x) depends on libcrc32c there will be problems while loading the driver without the rootfs already attached. In the bnx2x case at least this delays the boot process by 3-4 minutes while the driver waits and retries the modprobe of libcrc32c.

[Fix]

This is a backport from Debian which is contained in newer releases. It adds a checking function which looks for libcrc32c being in the initramfs and if it is there manually adds crc32c.

[Testcase]

Before this change or if there is no driver required that pulls in libcrc32c
#> gunzip -c <initrd> | cpio -t | grep crc32
would show either nothing or libcrc32c alone. After the change it shows either none or both crc32 modules.

[Regression Potential]

Low. Even if the crc32c module would be added incorrectly, it would not be used. The package could fail to build (though test build was done in a current Lucid environment) or the generation of initramfs could fail to unrelated reasons.

Mattias Wadenstein (maswan) wrote :
Mattias Wadenstein (maswan) wrote :

Of interest could be that the current Squeeze installer kernel is significantly faster, by not having this 50 second timeout, but then that is a fairly ancient 2.6.26-2 with other issues.

Jeremy Foshee (jeremyfoshee) wrote :

Hi Mattias,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 415353

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Micheal Waltz (ecliptik) wrote :

I'm having the same exact issue with the latest Ubuntu 10.04 netboot environment.

Kernel is 2.6.32-16-generic downloaded from http://us.archive.ubuntu.com/ubuntu/dists/lucid/main/installer-amd64/current/, firmware is present in netboot environment but it keeps cycling through with the same messages in the attached screenshot and never pulls up an interface.

I will download the Alpha 3 ISO and test the upstream kernel as well and provide feedback.

Micheal Waltz (ecliptik) wrote :

Tested with 3/23/2010 Lucid netboot images and installation eventually does work, but it must cycle through all NICs (8 total) which takes approx 20-30 minutes. I'm attaching a screenshot of the messages displayed when the host finally runs through all the interfaces and brings up the "choose interface" menu in the debian-installer.

Micheal Waltz (ecliptik) wrote :

This is working marginally better with the 2.6.32-20 kernel, it cycles through eth0-eth4 relatively quickly with a 2 second delay between them. When it gets to eth5 though it pauses for 100 seconds between the remaining ones. I've attached a screenshot of the detection times.

Mattias Wadenstein (maswan) wrote :

Trying out today's maverick installer, 2.6.35-14-generic is much better, only about 5 seconds per interface. Still takes a bit over a minute to loop through all interfaces twice in the installer, but a minute and a half is much better than the 20-30 minutes of karmic and lucid.

Timo Aaltonen (tjaalton) wrote :

confirming on lucid, takes ~30s every boot.

summary: - karmic installation slow on "detecting network hardware" with bnx2x
+ karmic/lucid installation slow on "detecting network hardware" with
+ bnx2x
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu development release http://cdimage.ubuntu.com/daily-live/current/ . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
Timo Aaltonen (tjaalton) wrote :

Reopening, since it's still happening.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Expired → Confirmed
Changed in linux (Ubuntu Lucid):
importance: Undecided → Medium
status: New → Confirmed
Timo Aaltonen (tjaalton) wrote :

This is fixed in maverick, but still happening on the latest lucid image.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Lucid):
status: Confirmed → Triaged
Stefan Bader (smb) wrote :

I am a bit confused by the comments: The last comments about the long delays (20-30 min!) was made with 2.6.32-20, a pre-release kernel. There was a fix about lost MSI-X interrupts in 2.6.32-23.37 and two more in the current proposed kernel (2.6.32-26.46), but those seem to address a netpoll crash and a hang.
Timo's comment is that it takes 30s(!) on every boot, which would be the good case that Mattias mentions (5s per interface with 8 interfaces would be 40s which is still a while but likely related to sequential probing with timeouts. If I got the right define the wait_ramrod message tells that it waited for 5s for the state to be FP_HALTED).
So, anyway, it somehow looks to me like Timo is saying Lucid is at least as fixed as Maverick now. Am I misreading things?

Stefan Bader (smb) on 2010-11-10
Changed in linux (Ubuntu Lucid):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Mattias Wadenstein (maswan) wrote :

On current lucid (2.6.32-25.45) I still see:

[ 13.372492] type=1505 audit(1289993385.731:12): operation="profile_load" pid
=1048 name="/usr/sbin/tcpdump"
[ 38.500547] bnx2x: gave up waiting for init of module libcrc32c.
[ 38.500552] bnx2x: Unknown symbol crc32c
[ 68.455130] bnx2x: gave up waiting for init of module libcrc32c.
[ 68.455134] bnx2x: Unknown symbol crc32c
[ 98.409621] bnx2x: gave up waiting for init of module libcrc32c.
[ 98.409625] bnx2x: Unknown symbol crc32c
[ 128.364169] bnx2x: gave up waiting for init of module libcrc32c.
[ 128.364173] bnx2x: Unknown symbol crc32c
[ 158.320638] bnx2x: gave up waiting for init of module libcrc32c.
[ 158.320642] bnx2x: Unknown symbol crc32c
[ 188.273120] bnx2x: gave up waiting for init of module libcrc32c.
[ 188.273124] bnx2x: Unknown symbol crc32c
[ 218.227617] bnx2x: gave up waiting for init of module libcrc32c.
[ 218.227621] bnx2x: Unknown symbol crc32c
[ 218.238058] ipmi message handler version 39.2
[ 218.239593] hpilo 0000:01:04.2: PCI INT B -> GSI 22 (level, low) -> IRQ 22
[ 218.251950] Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.52.
1 (2009/08/12)
[ 218.252184] alloc irq_desc for 28 on node -1
[ 218.252187] alloc kstat_irqs on node -1
[ 218.252196] bnx2x 0000:02:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
[ 218.252205] bnx2x 0000:02:00.0: setting latency timer to 64

When booting, so it takes quite some time then to get networking up.

Stefan Bader (smb) wrote :

Hm, did I miss that before? This looks to me like the problem changed a bit and the main problem is now some module dependency... Will have a look there.

Stefan Bader (smb) wrote :

Matthias, that was from a netbook image, right? Roughly this sounds like libcrc32c is not in the bunch of modules the installer image pulls in.

Stefan Bader (smb) wrote :

Doh! I mean netboot not netbook.

Mattias Wadenstein (maswan) wrote :

It is from a real system originally installed from the netboot installer.

Also, it does eventually manage to load it, but it takes quite a while. I don't really know why that is.

I'll try to get a look at the current lucid netboot images too, but that's slightly more tricky since I've run out of uninstalled blades in my bladecenter now. But a couple of them aren't quite in production yet, so maybe I'll manage to boot one of those. :)

Stefan Bader (smb) wrote :

Ah, that at least changes some of the assumptions (and places to look for). Cause the error message clearly say the bnx2x module gets loaded, then the kernel tries to load libcrc32c as well (because then bnx2x module uses a function from that) but fails. Assuming you used the netboot installer, I looked in its initrd and saw that module. But maybe there is something wrong on the installed system. Could you try a:

gunzip -c /boot/initrd.img-$(uname -r)|cpio -t|egrep 'bnx2x|libcrc32'

I am wondering whether the initramfs contains only the bnx2x module but not the libcrc32c one. And whether the network driver tries to initialize before the real root fs is mounted. But if both modules are missing, that would be a false lead. At least the boot seems to serialize on that module being loadable. Just not sure yet why.

Mattias Wadenstein (maswan) wrote :

maswan@bobo:~$ gunzip -c /boot/initrd.img-$(uname -r)|cpio -t|egrep 'bnx2x|libcrc32'
50910 blocks
maswan@bobo:~$

Any other ideas?

The filesystems do get mounted before the wait, not just root but also:

[ 11.470113] EXT4-fs (dm-3): mounted filesystem with ordered data mode
[ 12.150636] EXT4-fs (dm-4): mounted filesystem with ordered data mode
[ 12.860956] EXT4-fs (dm-2): mounted filesystem with ordered data mode

Stefan Bader (smb) wrote :

Not really an idea, but could you get me a full dmesg when booting with "debug initcall_debug" instead of the "quiet" as kernel boot options?

Stefan Bader (smb) on 2010-12-14
Changed in linux (Ubuntu Lucid):
status: Triaged → Incomplete
Timo Aaltonen (tjaalton) wrote :

ok, here you go!

Stefan Bader (smb) wrote :

Hm, weird. I believe the interesting part of the above dmesg is this:

...
[ 14.824188] calling libcrc32c_mod_init+0x0/0x2c [libcrc32c] @ 804
[ 14.831660] calling init_ipmi_si+0x0/0x290 [ipmi_si] @ 1190
[ 14.831689] IPMI System Interface driver.
[ 14.831694] ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca2, slave address 0x20, irq 0
...
[ 44.811059] bnx2x: gave up waiting for init of module libcrc32c.
[ 44.811064] bnx2x: Unknown symbol crc32c
[ 74.734389] bnx2x: gave up waiting for init of module libcrc32c.
[ 74.734393] bnx2x: Unknown symbol crc32c
[ 104.659566] bnx2x: gave up waiting for init of module libcrc32c.
[ 104.659571] bnx2x: Unknown symbol crc32c
[ 104.660725] calling crc32c_mod_init+0x0/0x12 [crc32c] @ 1295
[ 104.660881] initcall crc32c_mod_init+0x0/0x12 [crc32c] returned 0 after 145 usecs
[ 104.666435] calling ipsec_pfkey_init+0x0/0x85 [af_key] @ 1696
[ 104.666439] NET: Registered protocol family 15
[ 104.666454] initcall ipsec_pfkey_init+0x0/0x85 [af_key] returned 0 after 12 usecs
[ 104.667996] initcall libcrc32c_mod_init+0x0/0x2c [libcrc32c] returned 0 after 87957379 usecs
[ 104.672636] calling bnx2x_init+0x0/0x98 [bnx2x] @ 804
[ 104.672639] Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.52.1 (2009/08/12)
[ 104.672900] alloc irq_desc for 28 on node -1
...

This seems strange as ipmi seems to have successfully loaded libcrc32c *before* bnx2x fail three times with 30s timout between trials. And then finally it gets the library loaded.

Stefan Bader (smb) wrote :

Seems I misread the data. The first statement about lincrc32c actually says it is loaded and its init function is being called. But that does not complete within 30s, so it is retried. Discussing this with Andy, there seems to be a patch that fixed calling request_module from within a request for a module. I will try to backport this and provide a test kernel for it.

Stefan Bader (smb) wrote :

Alright, there are some 64bit kernels at http://people.canonical.com/~smb/lp415353/. I quick-tested the generic version and it did not seem to blow things up. Though I cannot verify whether there is any change with the bnx2x loading.

Stefan Bader (smb) wrote :
Changed in linux (Ubuntu Lucid):
status: Incomplete → Triaged
status: Triaged → Incomplete
Micheal Waltz (ecliptik) wrote :

Stefan, thank you for the test kernels and I'll try them out on some of our G6 blades here. How about netinstall kernels? We generally see this problem when imaging new hardware and it grows from a 20 minute install to a 2 hour one.

Stefan Bader (smb) wrote :

There are netboot images from the .1 release at http://cdimages.ubuntu.com/netboot/lucid/ and my last feedback was that this was already reducing the delay from hours to minutes. More or less the patch tested now will only affect the module load race which accounts to 1.5minutes delay. The much larger delay caused by something else, should be solved (again, that is the feedback as I seem to remember).

Stefan Bader (smb) wrote :

Err, sorry. Probably wrong link. I thought there was one. Need to find it.

Stefan Bader (smb) wrote :

Maybe the link _is_ correct. Though not all files seem to have been updated. Though at least the pxelinux.0 should be from the last point release.

Timo Aaltonen (tjaalton) wrote :

here's the dmesg output from the test kernel, which means that it doesn't timeout anymore, but there are some errors that probably need other changes.

Micheal Waltz (ecliptik) wrote :

Tested the netboot kernel and initrd.gz from http://archive.ubuntu.com/ubuntu/dists/lucid-proposed/main/installer-amd64/current/images/netboot/ubuntu-installer/amd64/ but it still does the same slow initialization. Attaching dmesg.

Stefan Bader (smb) wrote :

@mwaltz, the delays you see are those in the range of 90 to 100s once at boot. Which I am trying to resolve currently. The first test kernel will get rid of the delay but as it looks in dmesg I need to find and backport changes that prevent having the same module initialized multiple times.

Stefan Bader (smb) wrote :

So the main issue with the patch is actually not the referencing but plainly the locking. Problem is that there are a number of other patches upstream that modified the way one mutex is held. Basically rewriting a lot of code to avoid holding it over a long time. So the patch to fix the "gave up waiting" issue grabs the lock and releases it in one function (where I missed the grad in the backport). But even with that added, nothing would be gained as the mutex is held the whole time and not expected to be released while waiting.

I would really like to avoid pulling in even more patches from upstream. So v2 tries what seemed to have been one of the first approaches but Linus did not like. That is , to release and regrab the mutex before and after waiting. Maybe (but hopefully not) that could open a small window for another race and that is the reason it was frowned upon.

Trying the v2 kernel myself, I did not see something failing (but neither did I with the last one). It probably requires some hardware triggering a module load which in turn requests another module. (Kernels again at http://people.canonical.com/~smb/lp415353/)

Stephan Ruegamer (sadig) wrote :

@Stefan:

the problem here is, that somehow it's now fixed (AFAIK) in the netboot images, but the "real" fix should be in the initramfs-tools
See:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602254

And as I'm writing this, the updated initramfs-tools package went into debian unstable:

Source: initramfs-tools
Binary: initramfs-tools
Architecture: source all
Version: 0.98.8
Distribution: unstable
Urgency: high
Maintainer: Debian kernel team <email address hidden>
Changed-By: maximilian attems <email address hidden>
Description:
 initramfs-tools - tools for generating an initramfs
Closes: 608339 608538 608865
Changes:
 initramfs-tools (0.98.8) unstable; urgency=high
 .
   [ maximilian attems ]
   * [67c1d32] initramfs-tools: MODULES=dep fix for ubifs /
     (Closes: #608339)
   * [ca4d08a] mkinitramfs: Fix noexec /tmp detection for long device
     names. (Closes: #608865) Thanks to Ian Campbell <email address hidden>
 .
   [ Ben Hutchings ]
   * [78d9e04] initramfs-tools: Handle hidden dependency of libcrc32c on
     crc32c. (Closes: #608538)

Regards,

\sh

Stefan Bader (smb) wrote :

Must admit having not looked for a while and being a bit confused about were this stands for Lucid (10.04). The initramfs-tools version would be older there but maybe got fixed anyway. Are the long delays still present?

Michael Heldebrant (mheldebr) wrote :

This also affects my HP blade server on boot with the Lucid 10.04 2.6.32-38 kernel.

tags: added: kernel-da-key
Stefan Bader (smb) wrote :

@Michael, which type of delay do you see exactly? Is the server booted by using netboot images or from a local device? Is it possible to try adding libcrc32 to /etc/initramfs-tools/modules and regenerate the initramfs used? Thanks.

Michael Heldebrant (mheldebr) wrote :

I see a delay of about 3-4 minutes on boot as the blade tries the 8 ethernet interfaces and times out on the missing libcrc32 module on all of them.

The server is booted from a local disk. I can attempt to add libcrc32 to the module list next scheduled reboot.

Stefan Bader (smb) wrote :

Ok, thanks. Don't forget to run update-initramfs -u after adding the module and before reboot. Sorry if that was already clear. Just want to be sure in case it was not.

Michael Heldebrant (mheldebr) wrote :

I added libcrc32a to /etc/initramfs-tools/modules (module is libcrc32a and not libcrc32) and this time during boot I saw an unlogged to the boot or syslog message about a missing symbol in that module and then the same long pause behavior for trying to load the network driver on boot.

Michael Heldebrant (mheldebr) wrote :

Make that libcrc32c for the module name.

Stefan Bader (smb) wrote :

Seems at least this I was able to reproduce in a VM. Though libcrc32c does not show any dependencies, both crc32c and libcrc32c seem to be required in /etc/initramfs-tools/modules. With both in, there is no message and both modules are loaded after boot.

Michael Heldebrant (mheldebr) wrote :

I'll try adding the crc32c module in as well for the next reboot. It won't be until May that I can test this on the server during scheduled maintenance.

Michael Heldebrant (mheldebr) wrote :

I can confirm that adding crc32c as a manual entry in the /etc/initramfs-tools/modules and rebuilding the initramfs with update-initramfs -u does work to fix the problem on Ubuntu 10.04.

Stefan Bader (smb) wrote :

Thank you for confirming Michael.

Given the facts that the issue is not fatal (quite annoying I agree), the proper fixes that are upstream were too invasive, the backport that tries to address it is not too well tested and may introduce other problems, and (last but not least) it is possible to manually work around the issue, I would rather not touch the module loading code.

Not sure whether picking up the change from Debian for initramfs-tools would be acceptable. But usually the older a release the more critical an issue has to be to make changes. So I guess this may be a case where some sort of documentation of the problem and work-around would be preferred option.

Stefan Bader (smb) wrote :

For documentation this was the patch added to initramfs-tools in Debian.

Michael Heldebrant (mheldebr) wrote :

Documentation of the issue seems like a reasonable way forwards, assuming the crc32c module is also pre-installed into the installer initrd and left as a choice for the sysadmin to add it manually.

The first reboot post install would still be affected until adding the module into the the documentation steps could be performed though.

Is there some way to make udev or some other process that is aware of the presence of the affected hardware notify the admin to include the module?

On the other hand this patch really doesn't seem too intrusive as it will only install the module when it is actually needed, this would also create the initrd properly for the first reboot post install.

Stefan Bader (smb) wrote :

AFAICT the kernel-side problem (which caused the even longer delays) is already fixed in 10.04/Lucid. The remaining problem is the shorter (3-4m) delay caused by the crc32c module not being placed into initramfs.

Changed in initramfs-tools (Ubuntu):
status: New → Fix Released
Changed in initramfs-tools (Ubuntu Lucid):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Lucid):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: Incomplete → Fix Released
Stefan Bader (smb) on 2012-06-01
description: updated
description: updated
Stefan Bader (smb) wrote :

Additionally adding the debdiff of the backported changes to initramfs-tools (not sure the bzr branch is cortect).

Stéphane Graber (stgraber) wrote :

Uploaded to lucid-proposed. Thanks

Hello Mattias, or anyone else affected,

Accepted initramfs-tools into lucid-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/initramfs-tools/0.92bubuntu78.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in initramfs-tools (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: verification-needed

The fix for the this bug has been awaiting testing feeback in the -proposed repository for more than 90 days. Please test this fix and update the bug appropriately with the results. In the event that the fix for this bug is still not verified 15 days from now the package will be removed from the -proposed repository.

tags: added: removal-candidate
Brian Murray (brian-murray) wrote :

The version of initramfs-tools in lucid-proposed has been removed as this bug report was not verified in a timely fashion.

tags: removed: verification-needed
tags: removed: removal-candidate
Changed in initramfs-tools (Ubuntu Lucid):
status: Fix Committed → Triaged
Rolf Leggewie (r0lf) wrote :

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in initramfs-tools (Ubuntu Lucid):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.