Lost network in KVM VM / virtio_net page allocation failure

Bug #579276 reported by lhotari
156
This bug affects 25 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Andy Whitcroft
Lucid
Fix Released
Medium
Stefan Bader
Maverick
Fix Released
Medium
Stefan Bader

Bug Description

SRU Justification:

Impact: Under heavy network I/O load virtio-net driver crashes making VM guest unusable.

Testcase: I left a current Lucid VM running two concurrent "scp -r" of > 200 GB from NFS read-only source to a physical remote host overnight. VM quickly started emitting "page allocation errors" in the system log. Next morning when I checked the VM I could still ping it but could not establish an SSH connection.

Fix: This patch from Bruce Rogers at Novell

 * [PATCH] KVM: add schedule check to napi_enable call
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660

Binary package hint: 2.6.32-21-server

I'm seeing similar bugs in a VM running Lucid as there is reported in these Redhat bug reports:
https://bugzilla.redhat.com/show_bug.cgi?id=520119
https://bugzilla.redhat.com/show_bug.cgi?id=554078

The network connection drops in a KVM VM under high load.
"ifdown eth0 ; ifup eth0 " returns the connection.

This is the dmesg error:
[714069.829649] swapper: page allocation failure. order:0, mode:0x20
[714069.829653] Pid: 0, comm: swapper Not tainted 2.6.32-21-server #32-Ubuntu
[714069.829655] Call Trace:
[714069.829657] <IRQ> [<ffffffff810f97de>] __alloc_pages_slowpath+0x56e/0x580
[714069.829674] [<ffffffff810f9961>] __alloc_pages_nodemask+0x171/0x180
[714069.829682] [<ffffffff8112c597>] alloc_pages_current+0x87/0xd0
[714069.829687] [<ffffffff813d6f72>] try_fill_recv+0x182/0x200
[714069.829690] [<ffffffff813d719d>] virtnet_poll+0x10d/0x160
[714069.829700] [<ffffffff810397a9>] ? default_spin_lock_flags+0x9/0x10
[714069.829708] [<ffffffff81470a7f>] net_rx_action+0x10f/0x250
[714069.829713] [<ffffffff8106e257>] __do_softirq+0xb7/0x1e0
[714069.829717] [<ffffffff810c4880>] ? handle_IRQ_event+0x60/0x170
[714069.829722] [<ffffffff810142ec>] call_softirq+0x1c/0x30
[714069.829725] [<ffffffff81015cb5>] do_softirq+0x65/0xa0
[714069.829727] [<ffffffff8106e0f5>] irq_exit+0x85/0x90
[714069.829733] [<ffffffff8155c615>] do_IRQ+0x75/0xf0
[714069.829736] [<ffffffff81013b13>] ret_from_intr+0x0/0x11
[714069.829737] <EOI> [<ffffffff81038acb>] ? native_safe_halt+0xb/0x10
[714069.829746] [<ffffffff8101b68d>] ? default_idle+0x3d/0x90
[714069.829753] [<ffffffff81011e63>] ? cpu_idle+0xb3/0x110
[714069.829757] [<ffffffff8153f47b>] ? rest_init+0x6b/0x80
[714069.829763] [<ffffffff8187adcc>] ? start_kernel+0x368/0x371
[714069.829766] [<ffffffff8187a33a>] ? x86_64_start_reservations+0x125/0x129
[714069.829768] [<ffffffff8187a438>] ? x86_64_start_kernel+0xfa/0x109
[714069.829770] Mem-Info:
[714069.829772] Node 0 DMA per-cpu:
[714069.829775] CPU 0: hi: 0, btch: 1 usd: 0
[714069.829776] Node 0 DMA32 per-cpu:
[714069.829778] CPU 0: hi: 186, btch: 31 usd: 196
[714069.829783] active_anon:109561 inactive_anon:110789 isolated_anon:0
[714069.829784] active_file:7041 inactive_file:13781 isolated_file:0
[714069.829785] unevictable:0 dirty:8681 writeback:0 unstable:0
[714069.829786] free:1367 slab_reclaimable:2798 slab_unreclaimable:1783
[714069.829787] mapped:2113 shmem:83 pagetables:1148 bounce:0
[714069.829789] Node 0 DMA free:4000kB min:60kB low:72kB high:88kB active_anon:5508kB inactive_anon:5668kB active_file:396kB inactive_file:292kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:28kB shmem:0kB slab_reclaimable:8kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:20kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[714069.829799] lowmem_reserve[]: 0 994 994 994
[714069.829802] Node 0 DMA32 free:1468kB min:4000kB low:5000kB high:6000kB active_anon:432736kB inactive_anon:437488kB active_file:27768kB inactive_file:54832kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:0kB dirty:34724kB writeback:0kB mapped:8424kB shmem:332kB slab_reclaimable:11184kB slab_unreclaimable:7124kB kernel_stack:1408kB pagetables:4572kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[714069.829811] lowmem_reserve[]: 0 0 0 0
[714069.829814] Node 0 DMA: 0*4kB 2*8kB 5*16kB 6*32kB 4*64kB 3*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 4000kB
[714069.829823] Node 0 DMA32: 69*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1468kB
[714069.829830] 22928 total pagecache pages
[714069.829832] 2031 pages in swap cache
[714069.829834] Swap cache stats: add 74010, delete 71979, find 90506/93667
[714069.829836] Free swap = 804372kB
[714069.829837] Total swap = 897016kB
[714069.840685] 262139 pages RAM
[714069.840689] 6216 pages reserved
[714069.840690] 22948 pages shared
[714069.840691] 234443 pages non-shared

Version details:
linux-image-2.6.32-21-server
2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
Ubuntu 2.6.32-21.32-server 2.6.32.11+drm33.2
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg:
 [ 4.972735] JBD: barrier-based sync failed on vda1-8 - disabling barriers
 [ 14.180150] eth0: no IPv6 routers present
DistroRelease: Ubuntu 10.04
InstallationMedia: Ubuntu-Server 10.04 "Lucid Lynx" - Beta amd64 (20100406.1)
Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Bochs Bochs
Package: linux (not installed)
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server root=UUID=393ab013-153f-4c5a-ad67-1bd137363e60 ro quiet
ProcEnviron:
 PATH=(custom, no user)
 LANG=fi_FI.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-22.33-server 2.6.32.11+drm33.2
Regression: Yes
Reproducible: Yes
Tags: lucid kconfig regression-release needs-upstream-testing
Uname: Linux 2.6.32-22-server x86_64
UserGroups: adm admin cdrom dialout lpadmin lpadmin plugdev plugdev sambashare scanner
dmi.bios.date: 01/01/2007
dmi.bios.vendor: Bochs
dmi.bios.version: Bochs
dmi.chassis.type: 1
dmi.chassis.vendor: Bochs
dmi.modalias: dmi:bvnBochs:bvrBochs:bd01/01/2007:svnBochs:pnBochs:pvr:cvnBochs:ct1:cvr:
dmi.product.name: Bochs
dmi.sys.vendor: Bochs

Revision history for this message
lhotari (lartsa) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi lhotari,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 579276

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
lhotari (lartsa) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
lhotari (lartsa) wrote : Lspci.txt

apport information

Revision history for this message
lhotari (lartsa) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
lhotari (lartsa) wrote : ProcInterrupts.txt

apport information

Revision history for this message
lhotari (lartsa) wrote : ProcModules.txt

apport information

Revision history for this message
lhotari (lartsa) wrote : UdevDb.txt

apport information

Revision history for this message
lhotari (lartsa) wrote : UdevLog.txt

apport information

Revision history for this message
Sergey Svishchev (svs) wrote :

Quoting Debian bug report:

"> It seems as if Redhat encountered and fixed this bug back in January:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=554078

The original upstream bug fix was:

commit 3161e453e496eb5643faad30fff5a5ab183da0fe
Author: Rusty Russell <email address hidden>
Date: Wed Aug 26 12:22:32 2009 -0700

    virtio: net refill on out-of-memory

which was included in Linux 2.6.31.

However, another fix was needed on top of that:

commit 39d321577405e8e269fd238b278aaf2425fa788a
Author: Herbert Xu <email address hidden>
Date: Mon Jan 25 15:51:01 2010 -0800

    virtio_net: Make delayed refill more reliable"

The latter commit didn't make it into 2.6.32.

Changed in linux (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
tags: added: cherry-pick kernel-net
removed: needs-kernel-logs needs-upstream-testing
Revision history for this message
Robert C. Sheets (rcsheets) wrote :

So is installation of a recent upstream kernel thought to be a workaround for this?

Revision history for this message
Robert C. Sheets (rcsheets) wrote :

Regarding my last comment: I tried it, using the 2010-05-31-lucid mainline kernel build. The issue seemed to take longer to come about, but it still happened eventually.

Revision history for this message
RoyK (roysk) wrote :

I can confirm this on a Lucid VM running in KVM with a Lucid host. This mainly happens if the VM is copying data to/from an NFS share (guest as the NFS client, host as the NFS server). IMHO this should be prioritised higher than 'medium' since it doesn't take more than just minutes on full network use (rsync) to kill the server.

roy

Revision history for this message
RoyK (roysk) wrote :

Please note that this bug eventually kills my VM. It loses network, logs nfs timeouts and won't let anyone login to the console, nor do anything useful. A reboot of the guest fixes this, but since the error occurs after such a short time, this is not even a workaround.

Reversing the client/server roles, with the guest as the server, and the host as the guest, made the host loose its network connection. Attaching with a serial console worked for a little time, until I ran sync, meaning to be followed by a reboot -f, but sync made it hang and I can't ctrl+z/bg the job. I tried to detach and reattach, but now I can't reach the box.

This should be given rather high priority. The operations performed (rsync over nfs) should be considered normal use, nothing fancy.

roy

tags: added: kernel-needs-review
Revision history for this message
William King (quentusrex) wrote :

I can confirm this same issue.

Revision history for this message
Sergey Svishchev (svs) wrote :

Another variant of this bug occurs even with both patches applied -- see recent comments in debbug 592187. Reporter says:

"With 2.6.35-1~experimental.1 and virtio in the guest, 2.6.32-18 in the
host I cannot (yet) trigger the bug."

Revision history for this message
CvB (cvb-kruemel) wrote :

I think I'm seeing this or a similar bug on my lucid VM (on a lucid host), too. (2.6.32-24-server Kernel)

Is there a workaround? Does it help, e.g., to switch from virtio to some other network device?

Revision history for this message
Rupert Hair (rupert-hair) wrote :

Switching to the 'e1000' emulation seems to have worked for us, but it's far from a nice solution.

Rupert

Revision history for this message
Peter Lieven (plieven) wrote :

I can confirm this bug in Ubuntu Lucis LTS 10.04.1 64-bit Server.

This bug seems not to exists in an older kernel from opensuse 11.1 which I use with heavy network i/o load.
Version is: Linux 2.6.27.48-0.2-default

So the bug seems to have been added somewhere in between.

Revision history for this message
Joe Kislo (joe-k12s) wrote :

We use vmware ESXi, and we were crippled by this bug (we had to rebuild several systems back to karmic because they were so unusable). Somewhere in the past month a kernel upgrade seems to have resolved this issue for us. We could reproduce this problem easily:

Remote System:
cat /dev/zero | nc -w 4 -l -p 5000

Vmware lucid system:
nc othersystem 5000 > /path/on/nfs

That could reproduce the issue in a few seconds. Now that runs for 30+mins w/o any issues, and our "real world" scenario use case appears fixed too.

Revision history for this message
Joe Kislo (joe-k12s) wrote :
Download full text (5.7 KiB)

No, I am completely wrong. It still happens all the time. There was a period when it seemed stable. Here is my kernel panic fwiw on lucid linux-image-2.6.32-25-server 2.6.32-25.45:
[747393.713739] swapper: page allocation failure. order:0, mode:0x4020
[747393.713743] Pid: 0, comm: swapper Not tainted 2.6.32-25-server #45-Ubuntu
[747393.713745] Call Trace:
[747393.713746] <IRQ> [<ffffffff810f9a2e>] __alloc_pages_slowpath+0x56e/0x580
[747393.713756] [<ffffffff810f9bb1>] __alloc_pages_nodemask+0x171/0x180
[747393.713760] [<ffffffff8112cae7>] alloc_pages_current+0x87/0xd0
[747393.713763] [<ffffffff81132b27>] new_slab+0x2f7/0x310
[747393.713766] [<ffffffff811353d1>] __slab_alloc+0x201/0x2d0
[747393.713769] [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713772] [<ffffffff811363af>] __kmalloc_node_track_caller+0xaf/0x160
[747393.713774] [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713776] [<ffffffff81468aa0>] __alloc_skb+0x80/0x190
[747393.713778] [<ffffffff81468de6>] __netdev_alloc_skb+0x36/0x60
[747393.713788] [<ffffffffa00063c5>] e1000_alloc_rx_buffers+0x1c5/0x420 [e1000]
[747393.713792] [<ffffffffa0004bae>] e1000_clean_rx_irq+0x3fe/0x530 [e1000]
[747393.713795] [<ffffffff8106d2d8>] ? irq_exit+0x48/0x90
[747393.713799] [<ffffffffa00032c1>] e1000_clean+0x51/0x230 [e1000]
[747393.713802] [<ffffffff8147300f>] net_rx_action+0x10f/0x250
[747393.713806] [<ffffffff81019103>] ? native_sched_clock+0x13/0x60
[747393.713808] [<ffffffff8106d477>] __do_softirq+0xb7/0x1e0
[747393.713811] [<ffffffff81030b22>] ? ack_apic_level+0x82/0x1f0
[747393.713813] [<ffffffff810132ec>] call_softirq+0x1c/0x30
[747393.713815] [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[747393.713817] [<ffffffff8106d315>] irq_exit+0x85/0x90
[747393.713820] [<ffffffff8155f2a5>] do_IRQ+0x75/0xf0
[747393.713822] [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[747393.713823] <EOI> [<ffffffff81037adb>] ? native_safe_halt+0xb/0x10
[747393.713829] [<ffffffff8155ce86>] ? notifier_call_chain+0x16/0x80
[747393.713831] [<ffffffff8101a6ad>] ? default_idle+0x3d/0x90
[747393.713833] [<ffffffff8101a763>] ? c1e_idle+0x63/0x120
[747393.713836] [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[747393.713839] [<ffffffff81541c8b>] ? rest_init+0x6b/0x80
[747393.713843] [<ffffffff8187edcc>] ? start_kernel+0x368/0x371
[747393.713845] [<ffffffff8187e33a>] ? x86_64_start_reservations+0x125/0x129
[747393.713847] [<ffffffff8187e438>] ? x86_64_start_kernel+0xfa/0x109
[747393.713848] Mem-Info:
[747393.713850] Node 0 DMA per-cpu:
[747393.713851] CPU 0: hi: 0, btch: 1 usd: 0
[747393.713853] CPU 1: hi: 0, btch: 1 usd: 0
[747393.713854] Node 0 DMA32 per-cpu:
[747393.713856] CPU 0: hi: 186, btch: 31 usd: 186
[747393.713857] CPU 1: hi: 186, btch: 31 usd: 65
[747393.713858] Node 0 Normal per-cpu:
[747393.713860] CPU 0: hi: 186, btch: 31 usd: 98
[747393.713861] CPU 1: hi: 186, btch: 31 usd: 156
[747393.713865] active_anon:466544 inactive_anon:140501 isolated_anon:0
[747393.713865] active_file:61663 inactive_file:135115 isolated_file:0
[747393.713866] unevictable:0 dirty:75800 writeback:35617 unstable:1118
[747393.713867] free...

Read more...

Revision history for this message
CvB (cvb-kruemel) wrote :

Something must have changed, though. While earlier, I saw this bug whenever I increased network load on the virtio device, with Linux server64 2.6.32-25-server #45-Ubuntu the situation has improved, i.e. the system has not crashed again so far, despite some load tests.

Revision history for this message
lhotari (lartsa) wrote :
Revision history for this message
lhotari (lartsa) wrote :

Changelog of Debian kernel 2.6.32-22 :
   * net/{tcp,udp,llc,sctp,tipc,x25}: Add limit for socket backlog
     (Closes: #592187)

Revision history for this message
lhotari (lartsa) wrote :
Revision history for this message
lhotari (lartsa) wrote :

I hope this fix gets included in 10.04.1 LTS as soon as possible. We haven't been able to upgrade our Ubuntu VMs (running on Linux KVM) to 10.04 because of this bug. I think the priority should be much higher. Could someone assign this bug to someone in the Ubuntu Server team?

Revision history for this message
Sergey Svishchev (svs) wrote :

To summarize:

* This is a "RX lockup" condition -- VM still runs, can send traffic (ARP requests, mostly), but cannot receive replies.

* On Lucid, two backports to 2.6.32 are needed, "virtio_net: Make delayed refill more reliable" and "Add limit for socket backlog". I didn't research which mainline versions include them. Debian bugs #576838 and #592187 discuss each of these backports.

* On Karmic, another backport is needed [1]; not to kernel, but to QEMU: ("Fix a race condition where qemu finds that there are not enough virtio ring buffers available and the guest make more buffers available before qemu can enable notifications")

[1] http://forum.proxmox.com/threads/3117-virtio-net-crashing-(stop-sending-traffic)?p=20247#post20247

Revision history for this message
Sergey Svishchev (svs) wrote :

SuSE added another two patches to their 2.6.32 tree. Quoting first message:

"These are patches which we have found useful for our 2.6.32 based SLES 11 SP1 release.

The first patch ["Make delayed refill more reliable"] is already upstream, but should be included in stable.

The second patch ["If the add_buf operation fails, indicate failure to the caller"]
is a subset of another upstream patch. Again, stable material.

The third patch ["virtio_net: Add schedule check to napi_enable call"]
solves the last remaining issue we saw when testing kvm configurations with the SUSE
certification test suite. Under heavy load, we observed rx stalls (first two patches applied), and this
third patch was crafted to address the issue. Please apply to stable.
I assume this last problem also exists in more recent kernels than 2.6.32, but I haven't validated that."

http://article.gmane.org/gmane.comp.emulators.kvm.devel/53655
http://article.gmane.org/gmane.comp.emulators.kvm.devel/53653
http://article.gmane.org/gmane.comp.emulators.kvm.devel/53654

Revision history for this message
Peter Lieven (plieven) wrote :

Sergey, thank you very much for the summary. From what I can see all patches made it to the stable kernel since 2.6.34, but not the third patch "virtio_net: Add schedule check to napi_enable call". Am I right with that? That might be a reason why I still see the issue with recent kernels (2.6.34+). Any clue why the third patch is not in the stable tree?

Revision history for this message
Sergey Svishchev (svs) wrote :

It's not in any kernel tree, either. Probably needs someone to bring it to lkml@'s attention.

Revision history for this message
Peter Lieven (plieven) wrote :

This is weird. I remember when I last experiment with a vanilla 2.6.34 some time ago, it still crashed under heavy load. 2.6.34 should have all the patches mentioned applied except for the yet unpublished napi fix from SuSe.

I'm currently rebuilding my test setup (binary news spool server) and see if it still crashed reliably - this usually happened
within a few hours.

I'm meanwhile a new server kernel for lucid from the ubuntu kernel git.
It seems that
 - Make delayed refill more reliable (is already applied)

I now manually applied:
 - If the add_buf operation fails, indicate failure to the caller (in vanilla kernel since 2.6.34.2 / 2.6.35)
 - virtio_net: Add schedule check to napi_enable call

And will try if this kernel is running without crashing.

However, I was not able to apply "Add limit for socket backlog" (in vanilla kernel since 2.6.34) and related patches
for all protocols. The ubuntu kernel code seems to differ too significantly. Is there
a backport for lucid scheduled?

Revision history for this message
Sergey Svishchev (svs) wrote :

Right, Lucid kernel package 2.6.32-25.44 includes the "more reliable" fix (see bug 607824).

lhotari, do you use this kernel?

Revision history for this message
lhotari (lartsa) wrote :

Sergey, I'm still running 9.04 (2.6.28-19.24) in production because of this virtio_net stability problem. I'd like to upgrade the VMs to 10.04 LTS after this problem is resolved. I haven't done retesting lately.

Revision history for this message
Peter Lieven (plieven) wrote :

It took about 5 days this time to crash an unpatches Ubuntu LTS 10.04.1 64-bit server. I will now try my home built kernel with the "virtio_net: Add schedule check to napi_enable call" patch included.
If this patch is the final solution can someone help with:
 - getting this patch in the vanilla kernel
 - making it available in the official ubuntu kernel

Revision history for this message
Peter Lieven (plieven) wrote :

It seems that "virtio_net: Add schedule check to napi_enable call" is the final solution to the virtio_net crashes.
I have a newsserver (constantly 300-500mbit throughput) running a modified kernel with this patch for almost one month
now.

Who can help getting this patch into ubuntu-lucid official kernel and in the kernel sources?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This was fixed, but not in Ubuntu yet.

See:

"udp: use limited socket backlog"
http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271096

Bug#576838: virtio network crashes again
starting with comment 184
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592187#184

LP#661212 "crash after kswapd page allocation failure"
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/661212

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Oh, sorry, I hadn't seen this yet:
KVM: add schedule check to napi_enable call
http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660

If you want to get this into Lucid make a debdiff
https://wiki.ubuntu.com/PackagingGuide/Recipes/Debdiff

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Thinking about this a bit more, the "udp: use limited socket backlog" is still necessary since the "page allocation failures" from nic drivers problem affects real hardware with e100, e1000 and e1000e nic drivers.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Debdiff containing two patches:

 * [PATCH] KVM: add schedule check to napi_enable call
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660
  * [PATCH 4/8] udp: use limited socket backlog
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271096

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Oh, I see what you mean by "can't integrate backlog for every protocol".

I rebuilt the debdiff with only the virtio driver patch and deleted the old debdiff with two patches.

Debdiff containing patch:

 * [PATCH] KVM: add schedule check to napi_enable call
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

To get any further with this bug report read
https://wiki.ubuntu.com/StableReleaseUpdates
and try to do what it says.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I did the debdiff, the update to the bug report description for the SRU and subscribed https://launchpad.net/~ubuntu-sru

Now this needs sponsorship.
https://wiki.ubuntu.com/SponsorshipProcess

In the meantime I started this PPA with the patch
https://launchpad.net/~nutznboltz/+archive/lucid-virtio-napi

description: updated
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I'm actively testing two concurrent "scp -r" of > 200 GB from NFS directory to remote host which crashed the VM on the stock kernel.

$ uname -a
Linux dubnium 2.6.32-28-server #55ubuntu1~ppa3~lucid1-Ubuntu SMP Sun Feb 6 01:03:25 UTC 2011 x86_64 GNU/Linux

To test with the PPA run

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:nutznboltz/lucid-virtio-napi
sudo apt-get update
sudo apt-get upgrade
sudo reboot

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

486.5 GB transmitted without locking up.

$ ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:36:1c:fe:1a
          inet addr:192.168.1.105 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::216:36ff:fe1c:fe1a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:81411197 errors:0 dropped:0 overruns:0 frame:0
          TX packets:333395491 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5621876935 (5.6 GB) TX bytes:486567007874 (486.5 GB)

Revision history for this message
Peter Lieven (plieven) wrote :

just for the records. i have a binary newsfeed testserver with the napi patch running stable for 52 days. it really seems that this
was the missing piece! more than 66TB data transferred.

root@ubuntu-newsfeed:~# uptime
 13:00:49 up 52 days, 14:47, 2 users, load average: 2.32, 2.61, 2.69

root@ubuntu-newsfeed:~# ifconfig eth0
eth0 Link encap:Ethernet Hardware Adresse 52:54:00:fe:01:2c
          inet Adresse:x Bcast:x Maske:255.255.255.128
          inet6-Adresse: x/64 Gültigkeitsbereich:Global
          inet6-Adresse: fe80::5054:ff:fefe:12c/64 Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
          RX packets:45763921072 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2912546626 errors:0 dropped:0 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX bytes:66249766548775 (66.2 TB) TX bytes:210223722986 (210.2 GB)

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This patch was posted via E-mail on June 3, 2010
 * [PATCH] KVM: add schedule check to napi_enable call
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660
and it never made it into the upstream kernel.

Shortly after that these two were added to the upstream kernel:

http://kerneltrap.org/mailarchive/linux-netdev/2010/7/3/6280482
commit 1788f49548860fa1c861ee3454d47b466c877e43
Author: Michael S. Tsirkin
Date: Fri Jul 2 16:32:55 2010 +0000

    virtio_net: do not reschedule rx refill forever

    We currently fill all of RX ring, then add_buf
    returns ENOSPC, which gets mis-detected as an out of
    memory condition and causes us to reschedule the work,
    and so on forever. Fix this by oom = err == -ENOMEM;

http://kerneltrap.org/mailarchive/git-commits-head/2010/7/8/42134
commit 58eba97d0774c69b1cf3e5a8ac74419409d1abbf
Author: Rusty Russell
Date: Fri Jul 2 16:34:01 2010 +0000

    virtio_net: fix oom handling on tx

    virtio net will never try to overflow the TX ring, so the only reason
    add_buf may fail is out of memory. Thus, we can not stop the
    device until some request completes - there's no guarantee anything
    at all is outstanding.

    Make the error message clearer as well: error here does not
    indicate queue full.

Did you test with either or both of them?

I was informed that an SRU would not be done unless the patch was in the upstream kernel.

Revision history for this message
Peter Lieven (plieven) wrote :

I tested with kernels that include both of these patches, but they still crashed.
I also think that the both patches you mentioned have been backported to Ubuntu LTS.

How should we proceed? Contact the virtio developers and the developers from Suse
why this patch never went upstream?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I already sent an E-mail to Bruce Rogers of Novell asking about why his patch didn't get into the upstream kernel, see:
https://lists.ubuntu.com/archives/kernel-team/2011-February/014414.html

You are welcome to try reaching out to anyone who might have the answer as to why this patch never made it into upstream.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I have reason to believe that the absence of this patch in upstream kernels is a critical oversight.

I used "apt-add-repository ppa:kernel-ppa/ppa" to put the "Natty" kernel on my Lucid test VM

$ uname -a
Linux dubnium 2.6.38-2-server #29~lucid1-Ubuntu SMP Mon Feb 7 15:09:10 UTC 2011 x86_64 GNU/Linux

The stress test crashed the VM's network driver after copying only 63 GB.

The test consists of running "scp -r /nfs_read_only/1 remote:/dir/1" concurrently with "scp -r /nfs_read_only/2 remote:/dir/2"

The NFS mount options on the client are:
ro,tcp,hard,intr,sloppy,addr=10.1.1.1

Revision history for this message
Peter Lieven (plieven) wrote :

Can you patch the natty kernel with the napi patch to be absolutely sure?

You might also use netcat to transfer files between your boxes or use
iperf. This might reduce the time to crash.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

You can test that too.

First install the tools
apt-get install dpkg-dev python-software-properties
After the tools are installed run
apt-add-repository ppa:kernel-ppa/ppa"
and then edit:
/etc/apt/sources.list.d/kernel-ppa-ppa-lucid.list
Copy this line
deb http://ppa.launchpad.net/kernel-ppa/ppa/ubuntu lucid main
and change the copy of the line to
deb-src http://ppa.launchpad.net/kernel-ppa/ppa/ubuntu lucid main
Run
apt-get update
Then run
apt-get source linux-image-2.6.38-2-server
that will pull down about 96 MB of kernel source package and unpack it via dpkg-source -x out to 578 MB total.

Then apply the patch and run
debuild -i -uc -us -b
to build the unsigned binary deb packages.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Peter, does your news server use NFS?

Revision history for this message
Sergey Svishchev (svs) wrote :

I've seen this happen on servers that run java webapps; it seems that high java heap usage (especially when heap size is close to physical memory size) helps trigger one of aforementioned bugs. Unfortunately, I don't have a simple test case.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Bruce Rogers of Novell replied to my E-mail saying that the patch should have been accepted upstream and it was an oversight.
https://lists.ubuntu.com/archives/kernel-team/2011-February/014428.html

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

The attached debdiff contains my modifications to that with my updated version of the patch from Bruce Rogers of Novell. I had to modify the patch a bit to make it work with 2.6.38 which is what Natty is based on.

I used the Ubuntu Kernel Team Daily Build PPA (which isn't really updated daily) as the starting point
https://launchpad.net/~kernel-ppa/+archive/ppa

I put the results in the PPA
https://launchpad.net/~nutznboltz/+archive/natty-virtio-napi

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

The patched 2.6.38 kernel is running and has not crashed while copying data overnight.

$ uname -a
Linux dubnium 2.6.38-2-server #29~lucid3-Ubuntu SMP Tue Feb 8 21:49:57 UTC 2011 x86_64 GNU/Linux

$ date;ps -eO lstart | grep "scp -r" | egrep -v 'grep|ssh'
Wed Feb 9 05:31:07 EST 2011
 1035 Tue Feb 8 22:42:38 2011 R pts/2 00:08:21 scp -r /vol/ndnp/ndnp_staging/batches/kyu oxygen:/storage/scratch/virtio-net-test/2
 1041 Tue Feb 8 22:42:39 2011 R pts/1 00:08:25 scp -r /vol/ndnp/ndnp_staging/batches/dlc oxygen:/storage/scratch/virtio-net-test/1

$ du -hs /storage/scratch/virtio-net-test/
328G /storage/scratch/virtio-net-test/

Revision history for this message
Peter Lieven (plieven) wrote :

I have not tested with NFS, but my newsserver test was also reliably crashing without the NAPI patch.

I have seen Bruce's response. Will he take care of this patch going upstream?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

If there is a problem fill out complaint form and place it in an envelope addressed to...
http://www.youtube.com/watch?v=gEyFH-a-XoQ#t=1m

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

At this point a representative from the Ubuntu kernel team thanked me for my work in driving this however no evidence exists that the patch has made it into the upstream kernel yet.
https://lists.ubuntu.com/archives/kernel-team/2011-February/014433.html

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
Stefan Bader (smb) wrote :

Having Rusty pick it up should bring it (usually first to linux-next) to Linus tree. As soon as it hits there we can go on with adding it to 10.04 and 10.10. Sorry about the procedure being somewhat tedious, but this makes sure that relevant maintainers have looked at the change and it is being integrated for any newer release. As we see from the fact that this got forgotten so long, it is just too easy to drop things without being anal to a certain degree.

I will try to monitor the tree myself but feel free to nudge us with a reminder to the kernel team mailing list in case this slips by unnoticed (the only good thing about it being marked as stable material is that over time it would come back from the 2.6.32.y longterm tree, but it is understandably something that should better be in sooner than later).

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I started following @Linux_Kernel
http://twitter.com/Linux_Kernel

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

$ git show-branch
[master] Merge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6

$ git log drivers/net/virtio_net.c
commit 3e9d08ec0a68f6faf718d5a7e050fe5ca0ba004f
Author: Bruce Rogers <email address hidden>
Date: Thu Feb 10 11:03:31 2011 -0800

    virtio_net: Add schedule check to napi_enable call

    Under harsh testing conditions, including low memory, the guest would
    stop receiving packets. With this patch applied we no longer see any
    problems in the driver while performing these tests for extended periods
    of time.

    Make sure napi is scheduled subsequent to each napi_enable.

    Signed-off-by: Bruce Rogers <email address hidden>
    Signed-off-by: Olaf Kirch <email address hidden>
    Cc: <email address hidden>
    Signed-off-by: Rusty Russell <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Stefan Bader (smb)
Changed in linux (Ubuntu Lucid):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Maverick):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
status: New → In Progress
importance: Undecided → Medium
Stefan Bader (smb)
Changed in linux (Ubuntu Lucid):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Maverick):
status: In Progress → Fix Committed
Revision history for this message
Andy Whitcroft (apw) wrote :

This is now officially in linus' tree but not yet tagged. Will be in the next Natty upload.

Changed in linux (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Tested Ok on my VM:
$ uname -a
Linux dubnium 2.6.32-29-server #58pre201102150902-Ubuntu SMP Tue Feb 15 10:16:07 UTC 2011 x86_64 GNU/Linux

Revision history for this message
Andy Whitcroft (apw) wrote :

This is now Fix Committed for Natty as we have just rebased to v2.6.38-rc5 mainline which contains this fix.

Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Actually, it's not in proposed yet.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.38-4.31

---------------
linux (2.6.38-4.31) natty; urgency=low

  [ Andy Whitcroft ]

  * add in bugs closed by upstream patches pulled in by rebases
  * rebase to 795abaf1e4e188c4171e3cd3dbb11a9fcacaf505
  * [Config] enable CONFIG_VSX to allow use of vector instuctions
  * resync with maverick 98defa1c5773a3d7e4c524967eb01d5bae035816
  * rebase to mainline v2.6.38-rc5
  * SAUCE: ecryptfs: read on a directory should return EISDIR if not
    supported
    - LP: #719691

  [ Colin Ian King ]

  * SAUCE: Dell All-In-One: Remove need for Dell module alias

  [ Manoj Iyer ]

  * SAUCE: (drop after 2.6.38) add ricoh 0xe823 pci id.
    - LP: #717435

  [ Tim Gardner ]

  * [Config] CONFIG_CRYPTO_CRC32C_INTEL=y

  [ Upstream Kernel Changes ]

  * Quirk to fix suspend/resume on Lenovo Edge 11,13,14,15
    - LP: #702434
  * vfs: fix BUG_ON() in fs/namei.c:1461

  [ Vladislav P ]

  * SAUCE: Release BTM while sleeping to avoid deadlock.
    - LP: #713837

  [ Major Kernel Changes ]

  * rebase from v2.6.38-rc4 to v2.6.38-rc5
    - LP: #579276
    - LP: #715877
    - LP: #713769
  * resync with Maverick Ubuntu-2.6.35-27.47
 -- Andy Whitcroft <email address hidden> Fri, 11 Feb 2011 17:24:09 +0000

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This patch is in 2.6.37.1 as of Thu, 10 Feb 2011 19:03:31 +0000

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

The 2.6.32-29.58 kernel update that recently was pushed out lacks the virtio-net napi patch.
I updated my PPA against the updated kernel.
https://launchpad.net/~nutznboltz/+archive/lucid-virtio-napi

The updated PPA is still compiling; ETA Mar 2, 2011 02:40:00 UTC

To test with my PPA run:
sudo apt-get install python-software-properties
sudo apt-add-repository ppa:nutznboltz/lucid-virtio-napi
sudo apt-get update
sudo apt-get upgrade
sudo reboot

Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Accepted linux into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message
Martin Pitt (pitti) wrote :

Accepted linux into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I had my first Karmic KVM guest encounter this issue today. I'm going to add support for Karmic to my PPA.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted linux-ec2 into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Brad Figg (brad-figg)
tags: added: verification-needed-lucid verification-needed-maverick
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
Steve Conklin (sconklin) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-<release>' to 'verification-done-<release>'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Will the patch be added to 2.6.31?

Meanwhile
$ uname -a
Linux dubnium 2.6.32-30-server #59-Ubuntu SMP Tue Mar 1 22:46:09 UTC 2011 x86_64 GNU/Linux

xxxx 7765 7764 1 17:14 pts/3 00:01:33 scp -r /vol/ndnp/ndnp_staging/batches/kyu oxygen:/storage/scratch/virtio-net-test/2
xxxxx 7766 7765 22 17:14 pts/3 00:18:46 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes oxygen scp -r -t /storage/scratch/virtio-net-test/2
xxx 7771 7770 2 17:14 pts/2 00:01:46 scp -r /vol/ndnp/ndnp_staging/batches/dlc oxygen:/storage/scratch/virtio-net-test/1
xxxxxx 7772 7771 26 17:14 pts/2 00:22:09 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes oxygen scp -r -t /storage/scratch/virtio-net-test/1

55 GB of data copied via "two concurrent scp" test described in previous messages, still going.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

110 GB copied in three hours with no problems.

tags: added: verification-done-lucid
removed: verification-needed-lucid
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I don't use Maverick in my environment and I still use Karmic. I don't know who you are going to get to do the Maverick testing but it ain't me.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

16 people (17 including me) checked the "affects me too" button on this bug report? Will any of them test proposed on Maverick?

The instructions are right here:
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

In bash

sudo -i
cat >> /etc/apt/preferences << EOF
Package: *
Pin: release a=maverick-security
Pin-Priority: 990

Package: *
Pin: release a=maverick-updates
Pin-Priority: 900

Package: *
Pin: release a=maverick-proposed
Pin-Priority: 400
EOF

echo "deb http://archive.ubuntu.com/ubuntu/ maverick-proposed restricted main multiverse universe" > /etc/apt/sources.list.d/proposed.list

apt-get update
apt-get install linux-image-2.6.35-28-server
# or apt-get install linux-image-2.6.35-28-generic if you are testing a Desktop not a server
reboot

Is that really so hard?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Well, I rebuilt Dubnium as Maverick and ran the test:

$ w
 15:12:16 up 1:25, 3 users, load average: 1.94, 1.89, 1.90
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
nutz pts/3 140.147.245.89:S 13:51 1:20m 17:04 15:53 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -- oxyg
nutz pts/4 140.147.245.89:S 13:51 1:20m 14:47 13:43 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -- oxyg
nutz pts/5 140.147.245.89:S 14:48 0.00s 0.59s 0.00s w

$ uname -a
Linux dubnium 2.6.35-28-server #49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 x86_64 GNU/Linux

$ df -hP /storage/scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/storage-scratch 500G 115G 386G 23% /storage/scratch

tags: added: verification-done-maverick
removed: verification-needed-maverick
Revision history for this message
Divinsa Development (dev-divinsa) wrote :

Would love to see a fix for this as well - running over 10 10.04 instances on ec2 and having crashes + hangs often with this bug.

Finding that MTU increase to 9000 speeds up time to failure.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@Divinsa did you test proposed?
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message
AvaCam (cameron-pierce) wrote :

I've installed the proposed kernel onto both a 10.10 and 10.04 VMs. I've just set their MTU's to 9000. What would be a good way to stress test them?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@AvaCam use enough RAM until you start getting "page allocation failure." messages in the system logs.

What is happening is that the network driver needs to have a free page of RAM. It also cannot wait around for a page to become free. It can however, try again later. So if there are no free pages the network driver aborts with a lengthy series of system log messages including "page allocation failure" and the expects to try again later. This works great for, say, e1000, but virtio-net sometimes never gets the retry correct and hangs instead.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@AvaCam Note also that the file system buffer cache uses up RAM so activity that reads in many files while performing network I/O triggers this bug. That is why my test case of running two concurrent recursive "scp -r ..." jobs causes an unpatched virtio-net driver to lock up.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Lucid proposed kernel with virtio-net napi patch passed all of the QA Team's regression testing
https://wiki.ubuntu.com/QATeam/KernelSRU-lucid-2.6.32-30.59

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (8.6 KiB)

This bug was fixed in the package linux - 2.6.32-30.59

---------------
linux (2.6.32-30.59) lucid-proposed; urgency=low

  [ Steve Conklin ]

  * Release Tracking Bug
    - LP: #727336

  [ Tim Gardner ]

  * [Config] CONFIG_IRQ_TIME_ACCOUNTING=n
    - LP: #723819

  [ Upstream Kernel Changes ]

  * virtio_net: Add schedule check to napi_enable call
    - LP: #579276
  * NFS: fix the return value of nfs_file_fsync()
    - LP: #585657
  * block: check for proper length of iov entries earlier in
    blk_rq_map_user_iov(), CVE-2010-4163
    - LP: #721504
    - CVE-2010-4163
  * filter: make sure filters dont read uninitialized memory
    - LP: #721282
    - CVE-2010-4158
  * tty: Make tiocgicount a handler, CVE-2010-4076, CVE-2010-4077
    - LP: #720189
    - CVE-2010-4077
  * staging: usbip: remove double giveback of URB
    - LP: #723819
  * USB: EHCI: ASPM quirk of ISOC on AMD SB800
    - LP: #723819
  * rt2x00: add device id for windy31 usb device
    - LP: #723819
  * ALSA: snd-usb-us122l: Fix missing NULL checks
    - LP: #723819
  * hwmon: (via686a) Initialize fan_div values
    - LP: #723819
  * USB: serial: handle Data Carrier Detect changes
    - LP: #723819
  * USB: CP210x Add two device IDs
    - LP: #723819
  * USB: CP210x Removed incorrect device ID
    - LP: #723819
  * USB: usb-storage: unusual_devs update for Cypress ATACB
    - LP: #723819
  * USB: usb-storage: unusual_devs update for TrekStor DataStation maxi g.u
    external hard drive enclosure
    - LP: #723819
  * USB: usb-storage: unusual_devs entry for CamSport Evo
    - LP: #723819
  * USB: usb-storage: unusual_devs entry for Coby MP3 player
    - LP: #723819
  * USB: serial: Updated support for ICOM devices
    - LP: #723819
  * USB: adding USB support for Cinterion's HC2x, EU3 and PH8 products
    - LP: #723819
  * USB: EHCI: ASPM quirk of ISOC on AMD Hudson
    - LP: #723819
  * USB: EHCI: fix DMA deallocation bug
    - LP: #723819
  * USB: g_printer: fix bug in module parameter definitions
    - LP: #723819
  * USB: io_edgeport: fix the reported firmware major and minor
    - LP: #723819
  * USB: ti_usb: fix module removal
    - LP: #723819
  * USB: Storage: Add unusual_devs entry for VTech Kidizoom
    - LP: #723819
  * USB: ftdi_sio: add ST Micro Connect Lite uart support
    - LP: #723819
  * USB: cdc-acm: Adding second ACM channel support for Nokia N8
    - LP: #723819
  * USB: ftdi_sio: Add VID=0x0647, PID=0x0100 for Acton Research
    spectrograph
    - LP: #723819
  * USB: prevent buggy hubs from crashing the USB stack
    - LP: #723819
  * staging: comedi: add support for newer jr3 1-channel pci board
    - LP: #723819
  * staging: comedi: ni_labpc: Use shared IRQ for PCMCIA card
    - LP: #723819
  * Staging: hv: fix sysfs symlink on hv block device
    - LP: #723819
  * staging: hv: Enable sending GARP packet after live migration
    - LP: #723819
  * hvc_iucv: allocate memory buffers for IUCV in zone DMA
    - LP: #723819
  * iwlagn: enable only rfkill interrupt when device is down
    - LP: #723819
  * ath9k: Fix bug in delimiter padding computation
    - LP: #723819
  * correct vdso version string
    - LP: #723819
  * fix medium error problems with so...

Read more...

Changed in linux (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

What really bothers me is that I can't find anything in writing about why this patch did not make it into Karmic.

If someone would be so kind as to point out a link that explains why this patch did not make it into Karmic I would be grateful.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Never mind, I found it here:
https://wiki.ubuntu.com/KernelTeam/KernelUpdates
``For normal 18-month releases, we will only accept updates to the kernel for 3-4 months after release. At this point we consider the in-development release to be stable enough for testing, and the primary target for fixing bugs.''

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.35-28.49

---------------
linux (2.6.35-28.49) maverick-proposed; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #726796

  [ Colin Ian King ]

  * SAUCE: Dell All-In-One: Remove need for Dell module alias

  [ Manoj Iyer ]

  * SAUCE: add ricoh 0xe823 pci id.
    - LP: #717435

  [ Upstream Kernel Changes ]

  * virtio_net: Add schedule check to napi_enable call
    - LP: #579276
  * mmc: make sdhci work with ricoh mmc controller
    - LP: #717435
  * NFS: fix the return value of nfs_file_fsync()
    - LP: #585657
  * rt2x00: Pad beacon to multiple of 32 bits.
    - LP: #659143
  * rt2x00: Fix firmware loading regression on x86_64.
    - LP: #659143
  * rt2x00: Check for errors from skb_pad() calls
    - LP: #659143
  * block: check for proper length of iov entries earlier in
    blk_rq_map_user_iov(), CVE-2010-4163
    - LP: #721504
    - CVE-2010-4163
  * tty: Make tiocgicount a handler, CVE-2010-4076, CVE-2010-4077
    - LP: #720189
    - CVE-2010-4077
    - CVE-2010-4076
  * rds: Integer overflow in RDS cmsg handling, CVE-2010-4175
    - LP: #721455
    - CVE-2010-4175
 -- Brad Figg <email address hidden> Mon, 28 Feb 2011 13:02:53 -0800

Changed in linux (Ubuntu Maverick):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.