Bug #579276 “Lost network in KVM VM / virtio_net page allocation...” : Bugs : linux package : Ubuntu

Revision history for this message

lhotari (lartsa) wrote on 2010-05-12:

#1

upstream bug report: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576838

Revision history for this message

Jeremy Foshee (jeremyfoshee) wrote on 2010-05-17:

#2

Hi lhotari,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 579276

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags:	added: needs-kernel-logs
tags:	added: needs-upstream-testing
tags:	added: kj-triage
Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: BootDmesg.txt

#3

BootDmesg.txt Edit (21.5 KiB, text/plain)

apport information

tags:	added: apport-collected
description:	updated

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: Lspci.txt

#4

Lspci.txt Edit (4.4 KiB, text/plain)

apport information

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: ProcCpuinfo.txt

#5

ProcCpuinfo.txt Edit (499 bytes, text/plain)

apport information

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: ProcInterrupts.txt

#6

ProcInterrupts.txt Edit (1.2 KiB, text/plain)

apport information

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: ProcModules.txt

#7

ProcModules.txt Edit (615 bytes, text/plain)

apport information

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: UdevDb.txt

#8

UdevDb.txt Edit (53.9 KiB, text/plain)

apport information

Revision history for this message

lhotari (lartsa) wrote on 2010-05-19: UdevLog.txt

#9

UdevLog.txt Edit (122.6 KiB, text/plain)

apport information

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-06-02:

#10

Quoting Debian bug report:

"> It seems as if Redhat encountered and fixed this bug back in January:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=554078

The original upstream bug fix was:

commit 3161e453e496eb5643faad30fff5a5ab183da0fe
Author: Rusty Russell <email address hidden>
Date: Wed Aug 26 12:22:32 2009 -0700

virtio: net refill on out-of-memory

which was included in Linux 2.6.31.

However, another fix was needed on top of that:

commit 39d321577405e8e269fd238b278aaf2425fa788a
Author: Herbert Xu <email address hidden>
Date: Mon Jan 25 15:51:01 2010 -0800

virtio_net: Make delayed refill more reliable"

The latter commit didn't make it into 2.6.32.

Jeremy Foshee (jeremyfoshee) on 2010-06-02

Changed in linux (Ubuntu):
status:	Incomplete → Triaged
importance:	Undecided → Medium
tags:	added: cherry-pick kernel-net removed: needs-kernel-logs needs-upstream-testing

Revision history for this message

Robert C. Sheets (rcsheets) wrote on 2010-06-11:

#11

So is installation of a recent upstream kernel thought to be a workaround for this?

Revision history for this message

Robert C. Sheets (rcsheets) wrote on 2010-06-12:

#12

Regarding my last comment: I tried it, using the 2010-05-31-lucid mainline kernel build. The issue seemed to take longer to come about, but it still happened eventually.

Revision history for this message

RoyK (roysk) wrote on 2010-06-13:

#13

I can confirm this on a Lucid VM running in KVM with a Lucid host. This mainly happens if the VM is copying data to/from an NFS share (guest as the NFS client, host as the NFS server). IMHO this should be prioritised higher than 'medium' since it doesn't take more than just minutes on full network use (rsync) to kill the server.

roy

Revision history for this message

RoyK (roysk) wrote on 2010-06-13:

#14

Please note that this bug eventually kills my VM. It loses network, logs nfs timeouts and won't let anyone login to the console, nor do anything useful. A reboot of the guest fixes this, but since the error occurs after such a short time, this is not even a workaround.

Reversing the client/server roles, with the guest as the server, and the host as the guest, made the host loose its network connection. Attaching with a serial console worked for a little time, until I ran sync, meaning to be followed by a reboot -f, but sync made it hang and I can't ctrl+z/bg the job. I tried to detach and reattach, but now I can't reach the box.

This should be given rather high priority. The operations performed (rsync over nfs) should be considered normal use, nothing fancy.

roy

Jeremy Foshee (jeremyfoshee) on 2010-06-21

tags:

added: kernel-needs-review

Revision history for this message

William King (quentusrex) wrote on 2010-06-25:

#15

I can confirm this same issue.

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-08-10:

#16

Another variant of this bug occurs even with both patches applied -- see recent comments in debbug 592187. Reporter says:

"With 2.6.35-1~experimental.1 and virtio in the guest, 2.6.32-18 in the
host I cannot (yet) trigger the bug."

Revision history for this message

CvB (cvb-kruemel) wrote on 2010-09-26:

#17

I think I'm seeing this or a similar bug on my lucid VM (on a lucid host), too. (2.6.32-24-server Kernel)

Is there a workaround? Does it help, e.g., to switch from virtio to some other network device?

Revision history for this message

Rupert Hair (rupert-hair) wrote on 2010-09-27:

#18

Switching to the 'e1000' emulation seems to have worked for us, but it's far from a nice solution.

Rupert

Revision history for this message

Peter Lieven (plieven) wrote on 2010-10-03:

#19

I can confirm this bug in Ubuntu Lucis LTS 10.04.1 64-bit Server.

This bug seems not to exists in an older kernel from opensuse 11.1 which I use with heavy network i/o load.
Version is: Linux 2.6.27.48-0.2-default

So the bug seems to have been added somewhere in between.

Revision history for this message

Joe Kislo (joe-k12s) wrote on 2010-10-23:

#20

We use vmware ESXi, and we were crippled by this bug (we had to rebuild several systems back to karmic because they were so unusable). Somewhere in the past month a kernel upgrade seems to have resolved this issue for us. We could reproduce this problem easily:

Remote System:
cat /dev/zero | nc -w 4 -l -p 5000

Vmware lucid system:
nc othersystem 5000 > /path/on/nfs

That could reproduce the issue in a few seconds. Now that runs for 30+mins w/o any issues, and our "real world" scenario use case appears fixed too.

Revision history for this message

Joe Kislo (joe-k12s) wrote on 2010-11-01:

#21

Download full text (5.7 KiB)

No, I am completely wrong. It still happens all the time. There was a period when it seemed stable. Here is my kernel panic fwiw on lucid linux-image-2.6.32-25-server 2.6.32-25.45:
[747393.713739] swapper: page allocation failure. order:0, mode:0x4020
[747393.713743] Pid: 0, comm: swapper Not tainted 2.6.32-25-server #45-Ubuntu
[747393.713745] Call Trace:
[747393.713746] <IRQ> [<ffffffff810f9a2e>] __alloc_pages_slowpath+0x56e/0x580
[747393.713756] [<ffffffff810f9bb1>] __alloc_pages_nodemask+0x171/0x180
[747393.713760] [<ffffffff8112cae7>] alloc_pages_current+0x87/0xd0
[747393.713763] [<ffffffff81132b27>] new_slab+0x2f7/0x310
[747393.713766] [<ffffffff811353d1>] __slab_alloc+0x201/0x2d0
[747393.713769] [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713772] [<ffffffff811363af>] __kmalloc_node_track_caller+0xaf/0x160
[747393.713774] [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713776] [<ffffffff81468aa0>] __alloc_skb+0x80/0x190
[747393.713778] [<ffffffff81468de6>] __netdev_alloc_skb+0x36/0x60
[747393.713788] [<ffffffffa00063c5>] e1000_alloc_rx_buffers+0x1c5/0x420 [e1000]
[747393.713792] [<ffffffffa0004bae>] e1000_clean_rx_irq+0x3fe/0x530 [e1000]
[747393.713795] [<ffffffff8106d2d8>] ? irq_exit+0x48/0x90
[747393.713799] [<ffffffffa00032c1>] e1000_clean+0x51/0x230 [e1000]
[747393.713802] [<ffffffff8147300f>] net_rx_action+0x10f/0x250
[747393.713806] [<ffffffff81019103>] ? native_sched_clock+0x13/0x60
[747393.713808] [<ffffffff8106d477>] __do_softirq+0xb7/0x1e0
[747393.713811] [<ffffffff81030b22>] ? ack_apic_level+0x82/0x1f0
[747393.713813] [<ffffffff810132ec>] call_softirq+0x1c/0x30
[747393.713815] [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[747393.713817] [<ffffffff8106d315>] irq_exit+0x85/0x90
[747393.713820] [<ffffffff8155f2a5>] do_IRQ+0x75/0xf0
[747393.713822] [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[747393.713823] <EOI> [<ffffffff81037adb>] ? native_safe_halt+0xb/0x10
[747393.713829] [<ffffffff8155ce86>] ? notifier_call_chain+0x16/0x80
[747393.713831] [<ffffffff8101a6ad>] ? default_idle+0x3d/0x90
[747393.713833] [<ffffffff8101a763>] ? c1e_idle+0x63/0x120
[747393.713836] [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[747393.713839] [<ffffffff81541c8b>] ? rest_init+0x6b/0x80
[747393.713843] [<ffffffff8187edcc>] ? start_kernel+0x368/0x371
[747393.713845] [<ffffffff8187e33a>] ? x86_64_start_reservations+0x125/0x129
[747393.713847] [<ffffffff8187e438>] ? x86_64_start_kernel+0xfa/0x109
[747393.713848] Mem-Info:
[747393.713850] Node 0 DMA per-cpu:
[747393.713851] CPU 0: hi: 0, btch: 1 usd: 0
[747393.713853] CPU 1: hi: 0, btch: 1 usd: 0
[747393.713854] Node 0 DMA32 per-cpu:
[747393.713856] CPU 0: hi: 186, btch: 31 usd: 186
[747393.713857] CPU 1: hi: 186, btch: 31 usd: 65
[747393.713858] Node 0 Normal per-cpu:
[747393.713860] CPU 0: hi: 186, btch: 31 usd: 98
[747393.713861] CPU 1: hi: 186, btch: 31 usd: 156
[747393.713865] active_anon:466544 inactive_anon:140501 isolated_anon:0
[747393.713865] active_file:61663 inactive_file:135115 isolated_file:0
[747393.713866] unevictable:0 dirty:75800 writeback:35617 unstable:1118
[747393.713867] free...

No, I am completely wrong.  It still happens all the time.  There was a period when it seemed stable.  Here is my kernel panic fwiw on lucid linux-image-2.6.32-25-server       2.6.32-25.45:
[747393.713739] swapper: page allocation failure. order:0, mode:0x4020
[747393.713743] Pid: 0, comm: swapper Not tainted 2.6.32-25-server #45-Ubuntu
[747393.713745] Call Trace:
[747393.713746]  <IRQ>  [<ffffffff810f9a2e>] __alloc_pages_slowpath+0x56e/0x580
[747393.713756]  [<ffffffff810f9bb1>] __alloc_pages_nodemask+0x171/0x180
[747393.713760]  [<ffffffff8112cae7>] alloc_pages_current+0x87/0xd0
[747393.713763]  [<ffffffff81132b27>] new_slab+0x2f7/0x310
[747393.713766]  [<ffffffff811353d1>] __slab_alloc+0x201/0x2d0
[747393.713769]  [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713772]  [<ffffffff811363af>] __kmalloc_node_track_caller+0xaf/0x160
[747393.713774]  [<ffffffff81468de6>] ? __netdev_alloc_skb+0x36/0x60
[747393.713776]  [<ffffffff81468aa0>] __alloc_skb+0x80/0x190
[747393.713778]  [<ffffffff81468de6>] __netdev_alloc_skb+0x36/0x60
[747393.713788]  [<ffffffffa00063c5>] e1000_alloc_rx_buffers+0x1c5/0x420 [e1000]
[747393.713792]  [<ffffffffa0004bae>] e1000_clean_rx_irq+0x3fe/0x530 [e1000]
[747393.713795]  [<ffffffff8106d2d8>] ? irq_exit+0x48/0x90
[747393.713799]  [<ffffffffa00032c1>] e1000_clean+0x51/0x230 [e1000]
[747393.713802]  [<ffffffff8147300f>] net_rx_action+0x10f/0x250
[747393.713806]  [<ffffffff81019103>] ? native_sched_clock+0x13/0x60
[747393.713808]  [<ffffffff8106d477>] __do_softirq+0xb7/0x1e0
[747393.713811]  [<ffffffff81030b22>] ? ack_apic_level+0x82/0x1f0
[747393.713813]  [<ffffffff810132ec>] call_softirq+0x1c/0x30
[747393.713815]  [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[747393.713817]  [<ffffffff8106d315>] irq_exit+0x85/0x90
[747393.713820]  [<ffffffff8155f2a5>] do_IRQ+0x75/0xf0
[747393.713822]  [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[747393.713823]  <EOI>  [<ffffffff81037adb>] ? native_safe_halt+0xb/0x10
[747393.713829]  [<ffffffff8155ce86>] ? notifier_call_chain+0x16/0x80
[747393.713831]  [<ffffffff8101a6ad>] ? default_idle+0x3d/0x90
[747393.713833]  [<ffffffff8101a763>] ? c1e_idle+0x63/0x120
[747393.713836]  [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[747393.713839]  [<ffffffff81541c8b>] ? rest_init+0x6b/0x80
[747393.713843]  [<ffffffff8187edcc>] ? start_kernel+0x368/0x371
[747393.713845]  [<ffffffff8187e33a>] ? x86_64_start_reservations+0x125/0x129
[747393.713847]  [<ffffffff8187e438>] ? x86_64_start_kernel+0xfa/0x109
[747393.713848] Mem-Info:
[747393.713850] Node 0 DMA per-cpu:
[747393.713851] CPU    0: hi:    0, btch:   1 usd:   0
[747393.713853] CPU    1: hi:    0, btch:   1 usd:   0
[747393.713854] Node 0 DMA32 per-cpu:
[747393.713856] CPU    0: hi:  186, btch:  31 usd: 186
[747393.713857] CPU    1: hi:  186, btch:  31 usd:  65
[747393.713858] Node 0 Normal per-cpu:
[747393.713860] CPU    0: hi:  186, btch:  31 usd:  98
[747393.713861] CPU    1: hi:  186, btch:  31 usd: 156
[747393.713865] active_anon:466544 inactive_anon:140501 isolated_anon:0
[747393.713865]  active_file:61663 inactive_file:135115 isolated_file:0
[747393.713866]  unevictable:0 dirty:75800 writeback:35617 unstable:1118
[747393.713867]  free:4512 slab_reclaimable:17297 slab_unreclaimable:12299
[747393.713868]  mapped:5623 shmem:228 pagetables:2808 bounce:0
[747393.713869] Node 0 DMA free:13688kB min:32kB low:40kB high:48kB active_anon:120kB inactive_anon:360kB active_file:72kB inactive_file:1492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:28kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:56kB slab_unreclaimable:60kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[747393.713876] lowmem_reserve[]: 0 3000 3422 3422
[747393.713879] Node 0 DMA32 free:4052kB min:6548kB low:8184kB high:9820kB active_anon:1739864kB inactive_anon:435020kB active_file:213976kB inactive_file:494808kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:294352kB writeback:121532kB mapped:16836kB shmem:604kB slab_reclaimable:61964kB slab_unreclaimable:37148kB kernel_stack:4000kB pagetables:7496kB unstable:4472kB bounce:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
[747393.713886] lowmem_reserve[]: 0 0 422 422
[747393.713888] Node 0 Normal free:308kB min:920kB low:1148kB high:1380kB active_anon:126192kB inactive_anon:126624kB active_file:32604kB inactive_file:44160kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:432280kB mlocked:0kB dirty:8820kB writeback:20936kB mapped:5648kB shmem:308kB slab_reclaimable:7168kB slab_unreclaimable:11988kB kernel_stack:1624kB pagetables:3736kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:64 all_unreclaimable? no
[747393.713895] lowmem_reserve[]: 0 0 0 0
[747393.713897] Node 0 DMA: 2*4kB 4*8kB 3*16kB 7*32kB 7*64kB 3*128kB 3*256kB 3*512kB 2*1024kB 2*2048kB 1*4096kB = 13688kB
[747393.713903] Node 0 DMA32: 833*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 4052kB
[747393.713909] Node 0 Normal: 21*4kB 16*8kB 2*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 308kB
[747393.713914] 200977 total pagecache pages
[747393.713916] 3939 pages in swap cache
[747393.713917] Swap cache stats: add 1523025, delete 1519086, find 539681/705157
[747393.713918] Free swap  = 4159756kB
[747393.713919] Total swap = 4194296kB
[747393.723663] 895984 pages RAM
[747393.723663] 32065 pages reserved
[747393.723663] 167230 pages shared
[747393.723663] 705863 pages non-shared
[747393.723663] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[747393.723663]   cache: kmalloc-4096, object size: 4096, buffer size: 4096, default order: 3, min order: 0
[747393.723663]   node 0: slabs: 593, objs: 1363, free: 0

Revision history for this message

CvB (cvb-kruemel) wrote on 2010-11-05:

#22

Something must have changed, though. While earlier, I saw this bug whenever I increased network load on the virtio device, with Linux server64 2.6.32-25-server #45-Ubuntu the situation has improved, i.e. the system has not crashed again so far, despite some load tests.

Revision history for this message

lhotari (lartsa) wrote on 2010-11-05:

#23

Another upstream bug report: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592187

Revision history for this message

lhotari (lartsa) wrote on 2010-11-05:

#24

Changelog of Debian kernel 2.6.32-22 :
* net/{tcp,udp,llc,sctp,tipc,x25}: Add limit for socket backlog
(Closes: #592187)

Revision history for this message

lhotari (lartsa) wrote on 2010-11-05:

#25

Original patch: http://git.kernel.org/linus/8eae939f1400326b06d0c9afe53d2a484a326871

Revision history for this message

lhotari (lartsa) wrote on 2010-11-05:

#26

I hope this fix gets included in 10.04.1 LTS as soon as possible. We haven't been able to upgrade our Ubuntu VMs (running on Linux KVM) to 10.04 because of this bug. I think the priority should be much higher. Could someone assign this bug to someone in the Ubuntu Server team?

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-12-02:

#27

To summarize:

* This is a "RX lockup" condition -- VM still runs, can send traffic (ARP requests, mostly), but cannot receive replies.

* On Lucid, two backports to 2.6.32 are needed, "virtio_net: Make delayed refill more reliable" and "Add limit for socket backlog". I didn't research which mainline versions include them. Debian bugs #576838 and #592187 discuss each of these backports.

* On Karmic, another backport is needed [1]; not to kernel, but to QEMU: ("Fix a race condition where qemu finds that there are not enough virtio ring buffers available and the guest make more buffers available before qemu can enable notifications")

[1] http://forum.proxmox.com/threads/3117-virtio-net-crashing-(stop-sending-traffic)?p=20247#post20247

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-12-02:

#28

SuSE added another two patches to their 2.6.32 tree. Quoting first message:

"These are patches which we have found useful for our 2.6.32 based SLES 11 SP1 release.

The first patch ["Make delayed refill more reliable"] is already upstream, but should be included in stable.

The second patch ["If the add_buf operation fails, indicate failure to the caller"]
is a subset of another upstream patch. Again, stable material.

The third patch ["virtio_net: Add schedule check to napi_enable call"]
solves the last remaining issue we saw when testing kvm configurations with the SUSE
certification test suite. Under heavy load, we observed rx stalls (first two patches applied), and this
third patch was crafted to address the issue. Please apply to stable.
I assume this last problem also exists in more recent kernels than 2.6.32, but I haven't validated that."

http://article.gmane.org/gmane.comp.emulators.kvm.devel/53655
http://article.gmane.org/gmane.comp.emulators.kvm.devel/53653
http://article.gmane.org/gmane.comp.emulators.kvm.devel/53654

Revision history for this message

Peter Lieven (plieven) wrote on 2010-12-02:

#29

Sergey, thank you very much for the summary. From what I can see all patches made it to the stable kernel since 2.6.34, but not the third patch "virtio_net: Add schedule check to napi_enable call". Am I right with that? That might be a reason why I still see the issue with recent kernels (2.6.34+). Any clue why the third patch is not in the stable tree?

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-12-03:

#30

It's not in any kernel tree, either. Probably needs someone to bring it to lkml@'s attention.

Revision history for this message

Peter Lieven (plieven) wrote on 2010-12-03:

#31

This is weird. I remember when I last experiment with a vanilla 2.6.34 some time ago, it still crashed under heavy load. 2.6.34 should have all the patches mentioned applied except for the yet unpublished napi fix from SuSe.

I'm currently rebuilding my test setup (binary news spool server) and see if it still crashed reliably - this usually happened
within a few hours.

I'm meanwhile a new server kernel for lucid from the ubuntu kernel git.
It seems that
- Make delayed refill more reliable (is already applied)

I now manually applied:
- If the add_buf operation fails, indicate failure to the caller (in vanilla kernel since 2.6.34.2 / 2.6.35)
- virtio_net: Add schedule check to napi_enable call

And will try if this kernel is running without crashing.

However, I was not able to apply "Add limit for socket backlog" (in vanilla kernel since 2.6.34) and related patches
for all protocols. The ubuntu kernel code seems to differ too significantly. Is there
a backport for lucid scheduled?

Revision history for this message

Sergey Svishchev (svs) wrote on 2010-12-03:

#32

Right, Lucid kernel package 2.6.32-25.44 includes the "more reliable" fix (see bug 607824).

lhotari, do you use this kernel?

Revision history for this message

lhotari (lartsa) wrote on 2010-12-04:

#33

Sergey, I'm still running 9.04 (2.6.28-19.24) in production because of this virtio_net stability problem. I'd like to upgrade the VMs to 10.04 LTS after this problem is resolved. I haven't done retesting lately.

Revision history for this message

Peter Lieven (plieven) wrote on 2010-12-16:

#34

It took about 5 days this time to crash an unpatches Ubuntu LTS 10.04.1 64-bit server. I will now try my home built kernel with the "virtio_net: Add schedule check to napi_enable call" patch included.
If this patch is the final solution can someone help with:
- getting this patch in the vanilla kernel
- making it available in the official ubuntu kernel

Revision history for this message

Peter Lieven (plieven) wrote on 2011-01-07:

#35

It seems that "virtio_net: Add schedule check to napi_enable call" is the final solution to the virtio_net crashes.
I have a newsserver (constantly 300-500mbit throughput) running a modified kernel with this patch for almost one month
now.

Who can help getting this patch into ubuntu-lucid official kernel and in the kernel sources?

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-03:

#36

This was fixed, but not in Ubuntu yet.

See:

"udp: use limited socket backlog"
http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271096

Bug#576838: virtio network crashes again
starting with comment 184
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592187#184

LP#661212 "crash after kswapd page allocation failure"
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/661212

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-03:

#37

Oh, sorry, I hadn't seen this yet:
KVM: add schedule check to napi_enable call
http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660

If you want to get this into Lucid make a debdiff
https://wiki.ubuntu.com/PackagingGuide/Recipes/Debdiff

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-03:

#38

Thinking about this a bit more, the "udp: use limited socket backlog" is still necessary since the "page allocation failures" from nic drivers problem affects real hardware with e100, e1000 and e1000e nic drivers.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-03:

#39

Debdiff containing two patches:

* [PATCH] KVM: add schedule check to napi_enable call
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660
  * [PATCH 4/8] udp: use limited socket backlog
    - http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271096

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-04:

#40

[PATCH] KVM: add schedule check to napi_enable call Edit (2.7 KiB, text/plain)

Oh, I see what you mean by "can't integrate backlog for every protocol".

I rebuilt the debdiff with only the virtio driver patch and deleted the old debdiff with two patches.

Debdiff containing patch:

* [PATCH] KVM: add schedule check to napi_enable call
- http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-04:

#41

How to install the debdiff:
https://wiki.ubuntu.com/UbuntuPackagingGuide/BuildFromDebdiff

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-04:

#42

To get any further with this bug report read
https://wiki.ubuntu.com/StableReleaseUpdates
and try to do what it says.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-05:

#43

I did the debdiff, the update to the bug report description for the SRU and subscribed https://launchpad.net/~ubuntu-sru

Now this needs sponsorship.
https://wiki.ubuntu.com/SponsorshipProcess

In the meantime I started this PPA with the patch
https://launchpad.net/~nutznboltz/+archive/lucid-virtio-napi

description:

updated

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-06:

#44

I'm actively testing two concurrent "scp -r" of > 200 GB from NFS directory to remote host which crashed the VM on the stock kernel.

$ uname -a
Linux dubnium 2.6.32-28-server #55ubuntu1~ppa3~lucid1-Ubuntu SMP Sun Feb 6 01:03:25 UTC 2011 x86_64 GNU/Linux

To test with the PPA run

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:nutznboltz/lucid-virtio-napi
sudo apt-get update
sudo apt-get upgrade
sudo reboot

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-06:

#45

486.5 GB transmitted without locking up.

$ ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:36:1c:fe:1a
          inet addr:192.168.1.105 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::216:36ff:fe1c:fe1a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:81411197 errors:0 dropped:0 overruns:0 frame:0
          TX packets:333395491 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5621876935 (5.6 GB) TX bytes:486567007874 (486.5 GB)

Revision history for this message

Peter Lieven (plieven) wrote on 2011-02-07:

#46

just for the records. i have a binary newsfeed testserver with the napi patch running stable for 52 days. it really seems that this
was the missing piece! more than 66TB data transferred.

root@ubuntu-newsfeed:~# uptime
13:00:49 up 52 days, 14:47, 2 users, load average: 2.32, 2.61, 2.69

root@ubuntu-newsfeed:~# ifconfig eth0
eth0 Link encap:Ethernet Hardware Adresse 52:54:00:fe:01:2c
          inet Adresse:x Bcast:x Maske:255.255.255.128
          inet6-Adresse: x/64 Gültigkeitsbereich:Global
          inet6-Adresse: fe80::5054:ff:fefe:12c/64 Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
          RX packets:45763921072 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2912546626 errors:0 dropped:0 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX bytes:66249766548775 (66.2 TB) TX bytes:210223722986 (210.2 GB)

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-07:

#47

This patch was posted via E-mail on June 3, 2010
* [PATCH] KVM: add schedule check to napi_enable call
- http://kerneltrap.org/mailarchive/linux-netdev/2010/6/4/6278660
and it never made it into the upstream kernel.

Shortly after that these two were added to the upstream kernel:

http://kerneltrap.org/mailarchive/linux-netdev/2010/7/3/6280482
commit 1788f49548860fa1c861ee3454d47b466c877e43
Author: Michael S. Tsirkin
Date: Fri Jul 2 16:32:55 2010 +0000

virtio_net: do not reschedule rx refill forever

    We currently fill all of RX ring, then add_buf
    returns ENOSPC, which gets mis-detected as an out of
    memory condition and causes us to reschedule the work,
    and so on forever. Fix this by oom = err == -ENOMEM;

http://kerneltrap.org/mailarchive/git-commits-head/2010/7/8/42134
commit 58eba97d0774c69b1cf3e5a8ac74419409d1abbf
Author: Rusty Russell
Date: Fri Jul 2 16:34:01 2010 +0000

virtio_net: fix oom handling on tx

    virtio net will never try to overflow the TX ring, so the only reason
    add_buf may fail is out of memory. Thus, we can not stop the
    device until some request completes - there's no guarantee anything
    at all is outstanding.

Make the error message clearer as well: error here does not
indicate queue full.

Did you test with either or both of them?

I was informed that an SRU would not be done unless the patch was in the upstream kernel.

Revision history for this message

Peter Lieven (plieven) wrote on 2011-02-07:

#48

I tested with kernels that include both of these patches, but they still crashed.
I also think that the both patches you mentioned have been backported to Ubuntu LTS.

How should we proceed? Contact the virtio developers and the developers from Suse
why this patch never went upstream?

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-07:

#49

I already sent an E-mail to Bruce Rogers of Novell asking about why his patch didn't get into the upstream kernel, see:
https://lists.ubuntu.com/archives/kernel-team/2011-February/014414.html

You are welcome to try reaching out to anyone who might have the answer as to why this patch never made it into upstream.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-07:

#50

I have reason to believe that the absence of this patch in upstream kernels is a critical oversight.

I used "apt-add-repository ppa:kernel-ppa/ppa" to put the "Natty" kernel on my Lucid test VM

$ uname -a
Linux dubnium 2.6.38-2-server #29~lucid1-Ubuntu SMP Mon Feb 7 15:09:10 UTC 2011 x86_64 GNU/Linux

The stress test crashed the VM's network driver after copying only 63 GB.

The test consists of running "scp -r /nfs_read_only/1 remote:/dir/1" concurrently with "scp -r /nfs_read_only/2 remote:/dir/2"

The NFS mount options on the client are:
ro,tcp,hard,intr,sloppy,addr=10.1.1.1

Revision history for this message

Peter Lieven (plieven) wrote on 2011-02-08:

#51

Can you patch the natty kernel with the napi patch to be absolutely sure?

You might also use netcat to transfer files between your boxes or use
iperf. This might reduce the time to crash.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-08:

#52

You can test that too.

First install the tools
apt-get install dpkg-dev python-software-properties
After the tools are installed run
apt-add-repository ppa:kernel-ppa/ppa"
and then edit:
/etc/apt/sources.list.d/kernel-ppa-ppa-lucid.list
Copy this line
deb http://ppa.launchpad.net/kernel-ppa/ppa/ubuntu lucid main
and change the copy of the line to
deb-src http://ppa.launchpad.net/kernel-ppa/ppa/ubuntu lucid main
Run
apt-get update
Then run
apt-get source linux-image-2.6.38-2-server
that will pull down about 96 MB of kernel source package and unpack it via dpkg-source -x out to 578 MB total.

Then apply the patch and run
debuild -i -uc -us -b
to build the unsigned binary deb packages.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-08:

#53

Peter, does your news server use NFS?

Revision history for this message

Sergey Svishchev (svs) wrote on 2011-02-08:

#54

I've seen this happen on servers that run java webapps; it seems that high java heap usage (especially when heap size is close to physical memory size) helps trigger one of aforementioned bugs. Unfortunately, I don't have a simple test case.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-08:

#55

Bruce Rogers of Novell replied to my E-mail saying that the patch should have been accepted upstream and it was an oversight.
https://lists.ubuntu.com/archives/kernel-team/2011-February/014428.html

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-08:

#56

virtio-net napi patch for 2.6.38 Edit (4.2 KiB, text/plain)

The attached debdiff contains my modifications to that with my updated version of the patch from Bruce Rogers of Novell. I had to modify the patch a bit to make it work with 2.6.38 which is what Natty is based on.

I used the Ubuntu Kernel Team Daily Build PPA (which isn't really updated daily) as the starting point
https://launchpad.net/~kernel-ppa/+archive/ppa

I put the results in the PPA
https://launchpad.net/~nutznboltz/+archive/natty-virtio-napi

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-09:

#57

The patched 2.6.38 kernel is running and has not crashed while copying data overnight.

$ uname -a
Linux dubnium 2.6.38-2-server #29~lucid3-Ubuntu SMP Tue Feb 8 21:49:57 UTC 2011 x86_64 GNU/Linux

$ date;ps -eO lstart | grep "scp -r" | egrep -v 'grep|ssh'
Wed Feb 9 05:31:07 EST 2011
1035 Tue Feb 8 22:42:38 2011 R pts/2 00:08:21 scp -r /vol/ndnp/ndnp_staging/batches/kyu oxygen:/storage/scratch/virtio-net-test/2
1041 Tue Feb 8 22:42:39 2011 R pts/1 00:08:25 scp -r /vol/ndnp/ndnp_staging/batches/dlc oxygen:/storage/scratch/virtio-net-test/1

$ du -hs /storage/scratch/virtio-net-test/
328G /storage/scratch/virtio-net-test/

Revision history for this message

Peter Lieven (plieven) wrote on 2011-02-09:

#58

I have not tested with NFS, but my newsserver test was also reliably crashing without the NAPI patch.

I have seen Bruce's response. Will he take care of this patch going upstream?

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-09:

#59

If there is a problem fill out complaint form and place it in an envelope addressed to...
http://www.youtube.com/watch?v=gEyFH-a-XoQ#t=1m

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-09:

#60

At this point a representative from the Ubuntu kernel team thanked me for my work in driving this however no evidence exists that the patch has made it into the upstream kernel yet.
https://lists.ubuntu.com/archives/kernel-team/2011-February/014433.html

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-09:

#61

https://lists.linux-foundation.org/pipermail/virtualization/2011-February/016320.html

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-10:

#62

http://goo.gl/FQqS0
https://lists.linux-foundation.org/pipermail/virtualization/2011-February/016321.html
https://lists.linux-foundation.org/pipermail/virtualization/2011-February/016322.html
https://lists.linux-foundation.org/pipermail/virtualization/2011-February/016323.html

But still not in 2.6.38-rc4 yet.

Revision history for this message

Stefan Bader (smb) wrote on 2011-02-10:

#63

Having Rusty pick it up should bring it (usually first to linux-next) to Linus tree. As soon as it hits there we can go on with adding it to 10.04 and 10.10. Sorry about the procedure being somewhat tedious, but this makes sure that relevant maintainers have looked at the change and it is being integrated for any newer release. As we see from the fact that this got forgotten so long, it is just too easy to drop things without being anal to a certain degree.

I will try to monitor the tree myself but feel free to nudge us with a reminder to the kernel team mailing list in case this slips by unnoticed (the only good thing about it being marked as stable material is that over time it would come back from the 2.6.32.y longterm tree, but it is understandably something that should better be in sooner than later).

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-10:

#64

I started following @Linux_Kernel
http://twitter.com/Linux_Kernel

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-10:

#65

http://www.spinics.net/lists/linux-virtualization/msg12364.html
http://www.spinics.net/lists/linux-virtualization/msg12365.html

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-10:

#66

$ git show-branch
[master] Merge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6

$ git log drivers/net/virtio_net.c
commit 3e9d08ec0a68f6faf718d5a7e050fe5ca0ba004f
Author: Bruce Rogers <email address hidden>
Date: Thu Feb 10 11:03:31 2011 -0800

virtio_net: Add schedule check to napi_enable call

    Under harsh testing conditions, including low memory, the guest would
    stop receiving packets. With this patch applied we no longer see any
    problems in the driver while performing these tests for extended periods
    of time.

Make sure napi is scheduled subsequent to each napi_enable.

    Signed-off-by: Bruce Rogers <email address hidden>
    Signed-off-by: Olaf Kirch <email address hidden>
    Cc: <email address hidden>
    Signed-off-by: Rusty Russell <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Stefan Bader (smb) on 2011-02-11

Changed in linux (Ubuntu Lucid):
assignee:	nobody → Stefan Bader (stefan-bader-canonical)
importance:	Undecided → Medium
status:	New → In Progress
Changed in linux (Ubuntu Maverick):
assignee:	nobody → Stefan Bader (stefan-bader-canonical)
status:	New → In Progress
importance:	Undecided → Medium

Stefan Bader (smb) on 2011-02-11

Changed in linux (Ubuntu Lucid):
status:	In Progress → Fix Committed
Changed in linux (Ubuntu Maverick):
status:	In Progress → Fix Committed

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-02-11:

#67

This is now officially in linus' tree but not yet tagged. Will be in the next Natty upload.

Changed in linux (Ubuntu):
assignee:	nobody → Andy Whitcroft (apw)

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-15:

#68

Patch is now in pre-proposed.
https://launchpad.net/~kernel-ppa/+archive/pre-proposed

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-16:

#69

Tested Ok on my VM:
$ uname -a
Linux dubnium 2.6.32-29-server #58pre201102150902-Ubuntu SMP Tue Feb 15 10:16:07 UTC 2011 x86_64 GNU/Linux

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-02-16:

#70

This is now Fix Committed for Natty as we have just rebased to v2.6.38-rc5 mainline which contains this fix.

Changed in linux (Ubuntu):
status:	Triaged → Fix Committed

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-16:

#71

In proposed now:
http://www.ubuntuupdates.org/packages/show/199704

Please test if you can:
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-16:

#72

Actually, it's not in proposed yet.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-02-16:

#73

This bug was fixed in the package linux - 2.6.38-4.31

---------------
linux (2.6.38-4.31) natty; urgency=low

[ Andy Whitcroft ]

  * add in bugs closed by upstream patches pulled in by rebases
  * rebase to 795abaf1e4e188c4171e3cd3dbb11a9fcacaf505
  * [Config] enable CONFIG_VSX to allow use of vector instuctions
  * resync with maverick 98defa1c5773a3d7e4c524967eb01d5bae035816
  * rebase to mainline v2.6.38-rc5
  * SAUCE: ecryptfs: read on a directory should return EISDIR if not
    supported
    - LP: #719691

[ Colin Ian King ]

* SAUCE: Dell All-In-One: Remove need for Dell module alias

[ Manoj Iyer ]

* SAUCE: (drop after 2.6.38) add ricoh 0xe823 pci id.
- LP: #717435

[ Tim Gardner ]

* [Config] CONFIG_CRYPTO_CRC32C_INTEL=y

[ Upstream Kernel Changes ]

  * Quirk to fix suspend/resume on Lenovo Edge 11,13,14,15
    - LP: #702434
  * vfs: fix BUG_ON() in fs/namei.c:1461

[ Vladislav P ]

* SAUCE: Release BTM while sleeping to avoid deadlock.
- LP: #713837

[ Major Kernel Changes ]

  * rebase from v2.6.38-rc4 to v2.6.38-rc5
    - LP: #579276
    - LP: #715877
    - LP: #713769
  * resync with Maverick Ubuntu-2.6.35-27.47
-- Andy Whitcroft <email address hidden> Fri, 11 Feb 2011 17:24:09 +0000

Changed in linux (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-23:

#74

This patch is in 2.6.37.1 as of Thu, 10 Feb 2011 19:03:31 +0000

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-02-24:

#75

Yesterday a new ticket and branch stable-v2.6.32.29 was created with this patch

https://bugs.launchpad.net/ubuntu/lucid/+source/linux/+bug/723819

http://kernel.ubuntu.com/git?p=rtg/ubuntu-lucid.git;a=blob;f=drivers/net/virtio_net.c;h=fb09effbfb63f5e080a87bfc80a823f83c363810;hb=refs/heads/stable-v2.6.32.29

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-01:

#76

The 2.6.32-29.58 kernel update that recently was pushed out lacks the virtio-net napi patch.
I updated my PPA against the updated kernel.
https://launchpad.net/~nutznboltz/+archive/lucid-virtio-napi

The updated PPA is still compiling; ETA Mar 2, 2011 02:40:00 UTC

To test with my PPA run:
sudo apt-get install python-software-properties
sudo apt-add-repository ppa:nutznboltz/lucid-virtio-napi
sudo apt-get update
sudo apt-get upgrade
sudo reboot

Revision history for this message

Martin Pitt (pitti) wrote on 2011-03-02: Please test proposed package

#77

Accepted linux into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message

Martin Pitt (pitti) wrote on 2011-03-02:

#78

Accepted linux into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-03:

#79

I had my first Karmic KVM guest encounter this issue today. I'm going to add support for Karmic to my PPA.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-03:

#80

Patch is now in Lucid
https://launchpad.net/ubuntu/lucid/+source/linux/2.6.32-30.59

Revision history for this message

Martin Pitt (pitti) wrote on 2011-03-03:

#81

Accepted linux-ec2 into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Brad Figg (brad-figg) on 2011-03-03

tags:

added: verification-needed-lucid verification-needed-maverick

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-03:

#82

debdiff of virtio-net napi patch for Karmic Edit (2.5 KiB, text/plain)

Revision history for this message

Steve Conklin (sconklin) wrote on 2011-03-03:

#83

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-<release>' to 'verification-done-<release>'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-03:

#84

Will the patch be added to 2.6.31?

Meanwhile
$ uname -a
Linux dubnium 2.6.32-30-server #59-Ubuntu SMP Tue Mar 1 22:46:09 UTC 2011 x86_64 GNU/Linux

xxxx 7765 7764 1 17:14 pts/3 00:01:33 scp -r /vol/ndnp/ndnp_staging/batches/kyu oxygen:/storage/scratch/virtio-net-test/2
xxxxx 7766 7765 22 17:14 pts/3 00:18:46 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes oxygen scp -r -t /storage/scratch/virtio-net-test/2
xxx 7771 7770 2 17:14 pts/2 00:01:46 scp -r /vol/ndnp/ndnp_staging/batches/dlc oxygen:/storage/scratch/virtio-net-test/1
xxxxxx 7772 7771 26 17:14 pts/2 00:22:09 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes oxygen scp -r -t /storage/scratch/virtio-net-test/1

55 GB of data copied via "two concurrent scp" test described in previous messages, still going.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-04:

#85

110 GB copied in three hours with no problems.

tags:

added: verification-done-lucid
removed: verification-needed-lucid

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-04:

#86

I don't use Maverick in my environment and I still use Karmic. I don't know who you are going to get to do the Maverick testing but it ain't me.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-04:

#87

16 people (17 including me) checked the "affects me too" button on this bug report? Will any of them test proposed on Maverick?

The instructions are right here:
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-04:

#88

In bash

sudo -i
cat >> /etc/apt/preferences << EOF
Package: *
Pin: release a=maverick-security
Pin-Priority: 990

Package: *
Pin: release a=maverick-updates
Pin-Priority: 900

Package: *
Pin: release a=maverick-proposed
Pin-Priority: 400
EOF

echo "deb http://archive.ubuntu.com/ubuntu/ maverick-proposed restricted main multiverse universe" > /etc/apt/sources.list.d/proposed.list

apt-get update
apt-get install linux-image-2.6.35-28-server
# or apt-get install linux-image-2.6.35-28-generic if you are testing a Desktop not a server
reboot

Is that really so hard?

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-04:

#89

Well, I rebuilt Dubnium as Maverick and ran the test:

$ w
15:12:16 up 1:25, 3 users, load average: 1.94, 1.89, 1.90
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
nutz pts/3 140.147.245.89:S 13:51 1:20m 17:04 15:53 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -- oxyg
nutz pts/4 140.147.245.89:S 13:51 1:20m 14:47 13:43 /usr/bin/ssh -x -oForwardAgent no -oPermitLocalCommand no -oClearAllForwardings yes -- oxyg
nutz pts/5 140.147.245.89:S 14:48 0.00s 0.59s 0.00s w

$ uname -a
Linux dubnium 2.6.35-28-server #49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 x86_64 GNU/Linux

$ df -hP /storage/scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/storage-scratch 500G 115G 386G 23% /storage/scratch

tags:

added: verification-done-maverick
removed: verification-needed-maverick

Revision history for this message

Divinsa Development (dev-divinsa) wrote on 2011-03-07:

#90

Would love to see a fix for this as well - running over 10 10.04 instances on ec2 and having crashes + hangs often with this bug.

Finding that MTU increase to 9000 speeds up time to failure.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-08:

#91

@Divinsa did you test proposed?
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message

AvaCam (cameron-pierce) wrote on 2011-03-08:

#92

I've installed the proposed kernel onto both a 10.10 and 10.04 VMs. I've just set their MTU's to 9000. What would be a good way to stress test them?

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-08:

#93

@AvaCam use enough RAM until you start getting "page allocation failure." messages in the system logs.

What is happening is that the network driver needs to have a free page of RAM. It also cannot wait around for a page to become free. It can however, try again later. So if there are no free pages the network driver aborts with a lengthy series of system log messages including "page allocation failure" and the expects to try again later. This works great for, say, e1000, but virtio-net sometimes never gets the retry correct and hangs instead.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-08:

#94

@AvaCam Note also that the file system buffer cache uses up RAM so activity that reads in many files while performing network I/O triggers this bug. That is why my test case of running two concurrent recursive "scp -r ..." jobs causes an unpatched virtio-net driver to lock up.

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2011-03-09:

#95

Lucid proposed kernel with virtio-net napi patch passed all of the QA Team's regression testing
https://wiki.ubuntu.com/QATeam/KernelSRU-lucid-2.6.32-30.59

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-03-17:

#96

Download full text (8.6 KiB)

This bug was fixed in the package linux - 2.6.32-30.59

---------------
linux (2.6.32-30.59) lucid-proposed; urgency=low

[ Steve Conklin ]

* Release Tracking Bug
- LP: #727336

[ Tim Gardner ]

* [Config] CONFIG_IRQ_TIME_ACCOUNTING=n
- LP: #723819

[ Upstream Kernel Changes ]

  * virtio_net: Add schedule check to napi_enable call
    - LP: #579276
  * NFS: fix the return value of nfs_file_fsync()
    - LP: #585657
  * block: check for proper length of iov entries earlier in
    blk_rq_map_user_iov(), CVE-2010-4163
    - LP: #721504
    - CVE-2010-4163
  * filter: make sure filters dont read uninitialized memory
    - LP: #721282
    - CVE-2010-4158
  * tty: Make tiocgicount a handler, CVE-2010-4076, CVE-2010-4077
    - LP: #720189
    - CVE-2010-4077
  * staging: usbip: remove double giveback of URB
    - LP: #723819
  * USB: EHCI: ASPM quirk of ISOC on AMD SB800
    - LP: #723819
  * rt2x00: add device id for windy31 usb device
    - LP: #723819
  * ALSA: snd-usb-us122l: Fix missing NULL checks
    - LP: #723819
  * hwmon: (via686a) Initialize fan_div values
    - LP: #723819
  * USB: serial: handle Data Carrier Detect changes
    - LP: #723819
  * USB: CP210x Add two device IDs
    - LP: #723819
  * USB: CP210x Removed incorrect device ID
    - LP: #723819
  * USB: usb-storage: unusual_devs update for Cypress ATACB
    - LP: #723819
  * USB: usb-storage: unusual_devs update for TrekStor DataStation maxi g.u
    external hard drive enclosure
    - LP: #723819
  * USB: usb-storage: unusual_devs entry for CamSport Evo
    - LP: #723819
  * USB: usb-storage: unusual_devs entry for Coby MP3 player
    - LP: #723819
  * USB: serial: Updated support for ICOM devices
    - LP: #723819
  * USB: adding USB support for Cinterion's HC2x, EU3 and PH8 products
    - LP: #723819
  * USB: EHCI: ASPM quirk of ISOC on AMD Hudson
    - LP: #723819
  * USB: EHCI: fix DMA deallocation bug
    - LP: #723819
  * USB: g_printer: fix bug in module parameter definitions
    - LP: #723819
  * USB: io_edgeport: fix the reported firmware major and minor
    - LP: #723819
  * USB: ti_usb: fix module removal
    - LP: #723819
  * USB: Storage: Add unusual_devs entry for VTech Kidizoom
    - LP: #723819
  * USB: ftdi_sio: add ST Micro Connect Lite uart support
    - LP: #723819
  * USB: cdc-acm: Adding second ACM channel support for Nokia N8
    - LP: #723819
  * USB: ftdi_sio: Add VID=0x0647, PID=0x0100 for Acton Research
    spectrograph
    - LP: #723819
  * USB: prevent buggy hubs from crashing the USB stack
    - LP: #723819
  * staging: comedi: add support for newer jr3 1-channel pci board
    - LP: #723819
  * staging: comedi: ni_labpc: Use shared IRQ for PCMCIA card
    - LP: #723819
  * Staging: hv: fix sysfs symlink on hv block device
    - LP: #723819
  * staging: hv: Enable sending GARP packet after live migration
    - LP: #723819
  * hvc_iucv: allocate memory buffers for IUCV in zone DMA
    - LP: #723819
  * iwlagn: enable only rfkill interrupt when device is down
    - LP: #723819
  * ath9k: Fix bug in delimiter padding computation
    - LP: #723819
  * correct vdso version string
    - LP: #723819
  * fix medium error problems with so...

Ubuntu
linux package

Lost network in KVM VM / virtio_net page allocation failure

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Medium	Andy Whitcroft
Lucid	Fix Released	Medium	Stefan Bader
Maverick	Fix Released	Medium	Stefan Bader

Ubuntulinux package

Lost network in KVM VM / virtio_net page allocation failure

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package