Mellanox ethernet driver causing page allocation failures

Bug #1158031 reported by Joshua Kugler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

I tried ubuntu-bug but got:

# ubuntu-bug linux
-bash: ubuntu-bug: command not found

We have begun seeing stack dumps such as these in our syslogs as of late:

 kvm: page allocation failure. order:2, mode:0x4020
 Pid: 2135, comm: kvm Not tainted 2.6.32-37-server #81-Ubuntu
 Call Trace:
  <IRQ> [<ffffffff810fc9f7>] __alloc_pages_slowpath+0x4a7/0x590
  [<ffffffffa0178a20>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge]
  [<ffffffff810fcc59>] __alloc_pages_nodemask+0x179/0x180
  [<ffffffff8112fe07>] alloc_pages_current+0x87/0xd0
  [<ffffffffa010d9ff>] mlx4_en_alloc_frag+0x19f/0x1f0 [mlx4_en]
  [<ffffffff8146dc0d>] ? dev_alloc_skb+0x1d/0x40
  [<ffffffffa010eb7e>] mlx4_en_complete_rx_desc+0x10e/0x1b0 [mlx4_en]
  [<ffffffffa010ee13>] mlx4_en_rx_skb+0x1f3/0x430 [mlx4_en]
  [<ffffffffa010f245>] mlx4_en_process_rx_cq+0x1f5/0x4a0 [mlx4_en]
  [<ffffffff813be69a>] ? ata_scsi_qc_complete+0x6a/0x2b0
  [<ffffffff813b6ddd>] ? __ata_qc_complete+0x8d/0x140
  [<ffffffff81081907>] ? insert_work+0x77/0xc0
  [<ffffffff813b896f>] ? ata_qc_complete+0x9f/0x230
  [<ffffffffa010f52f>] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en]
  [<ffffffffa01334a2>] ? mlx4_cq_completion+0x42/0x80 [mlx4_core]
  [<ffffffff814773ff>] net_rx_action+0x10f/0x250
  [<ffffffff8106f287>] __do_softirq+0xb7/0x1f0
  [<ffffffff810c6600>] ? handle_IRQ_event+0x60/0x170
  [<ffffffff810142ac>] call_softirq+0x1c/0x30
  [<ffffffff81015c75>] do_softirq+0x65/0xa0
  [<ffffffff8106f085>] irq_exit+0x85/0x90
  [<ffffffff81560955>] do_IRQ+0x75/0xf0
  [<ffffffff81013ad3>] ret_from_intr+0x0/0x11
  <EOI> [<ffffffffa029f9e5>] ? vcpu_enter_guest+0x1f5/0x4e0 [kvm]
  [<ffffffffa029f9b6>] ? vcpu_enter_guest+0x1c6/0x4e0 [kvm]
  [<ffffffffa029fd4c>] ? __vcpu_run+0x7c/0x350 [kvm]
  [<ffffffffa02a62bd>] ? kvm_arch_vcpu_ioctl_run+0x8d/0x1d0 [kvm]
  [<ffffffffa02952e3>] ? kvm_vcpu_ioctl+0x473/0x5c0 [kvm]
  [<ffffffff8155b71e>] ? _spin_lock+0xe/0x20
  [<ffffffff81097982>] ? futex_wake+0x112/0x130
  [<ffffffff81156ec2>] ? vfs_ioctl+0x22/0xa0
  [<ffffffff81157061>] ? do_vfs_ioctl+0x81/0x410
  [<ffffffff81099fab>] ? sys_futex+0x7b/0x170
  [<ffffffff81157471>] ? sys_ioctl+0x81/0xa0
  [<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b

(there's more, I'll attach a full dump)

We have tried increasing vm.min_free_kbytes, but that has not resolved the problem.

Searching around, I found this: http://code.metager.de/source/history/linux/stable/drivers/net/ethernet/mellanox/

Search for the second and third occurrence of 'page allocation.' The commits mentioned there seem to point directly at the bugs we've been hitting.

Both hashes:

4cce66cdd14aa5006a011505865d932adb49f600
117980c4c994b6fe58e873fe803c9bcdcb4337a3

are in the mainline repository (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/)

I would like to request that these bug fixes (and any other bugs fixes present in the tree) be back-ported to Ubuntu 10.04. Or maybe just the entire current driver?

# cat /proc/version_signature
Ubuntu 2.6.32-37.81-server 2.6.32.49+drm33.21

# lsb_release -rd
Description: Ubuntu 10.04.4 LTS
Release: 10.04

Revision history for this message
Joshua Kugler (jkugler) wrote :
Revision history for this message
Joshua Kugler (jkugler) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1158031

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
Revision history for this message
Joshua Kugler (jkugler) wrote :

# apport-collect 1158031
-bash: apport-collect: command not found

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joshua Kugler (jkugler) wrote :

I did install appport, and tried to run apport-collect. It ran, then started the lynx browser, bounced me around through several redirects, then put me in an OpenID login loop (kept going back to the screen "Launchpad has requested some personal information. Please choose what you would like to share:"). If there is a way to manually collect its log files, please let me know.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.9 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc3-raring/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Joshua Kugler (jkugler) wrote :

We tried to use the new kernel, and it failed. It booted, but ipv6 was enabled, which appeared to make Jenkins only listen on the IPv6 addresses. Once I disabled IPv6, I realized eth3 (our internal Mellanox 10GbE card, ironically enough) was not visible to the system.

It then became apparent that something had activated some firewall rules that weren't activated before (port 8080 wasn't accessible on the internal network). Attempting to flush the iptables rules resulted in a complete locked. Bryon rebooted, I removed the upgraded kernel, and reverted to the default 10.04 kernel. The firewall rules that were blocking access to port 8080 went away when we reverted to the default 10.04 kernel. I wonder if something changed the default INPUT policy to DROP?

I realized later that I did not try to modprobe the module for the Mellanox card, but that did not occur to me, since we had never had to do that manually before.

We really cannot take this system down again. This is a system that costs us several hundred dollars per hour for downtime.

Unless you have other ideas, we will have to find another way to trouble shoot this bug.

tags: added: kernel-unable-to-test-upstream
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.