KVM/QEMU guest bridged network loss on kernels 3.8.0-27, and 29

Bug #1215051 reported by Ron
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

We have experienced problems on Raring with bridged KVM/QEMU guests loosing network connection after upgrading any host server to kernel 3.8.0-27, and 3.8.0-29. This has occurred on 4 separate host machines.

This tends to happen sooner / more often, to guests which generate more traffic than others. The only way to recover networking is to shutdown or force-stop the guest, and restart it. We have attempted to do a live-migration of a guest in this condition, which resulted in the source-host server kernel panicing.

All host diagnostics including: ifconfig, "virsh domiflist", and "brctl show" outputs look identical to working guests. On the guest, the interface appears up, there are no syslog errors, the guest can ping it's own address, but any outgoing communication fails.

This looks just like Ubuntu Precise bug #997978, which affected us last year as well..

Revision history for this message
Ron (ron-neversleep) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1215051

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ron (ron-neversleep) wrote :
Revision history for this message
Ron (ron-neversleep) wrote :

Due to the nature of this networking-loss bug report, there is no crash data, nor dump files. This failure occurs silently.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you not see this bug if you boot back into 3.8.0-26?

tags: added: performing-bisect raring
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and post back? We are looking for the first kernel version that exhibits this bug:

v3.8.13.2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8.13.2-raring/
v3.8.13.3: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8.13.3-raring/
v3.8.13.4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8.13.4-raring/
v3.8.13.5: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8.13.5-raring/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, the 3.8.0-30.43 kernel is now available in the -proposed repository. Would it be possible for you to test this latest kernel and post back if it resolves this bug?

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Thank you in advance!

Revision history for this message
Ron (ron-neversleep) wrote :

I'm absolutely positive this problem does not exist in 3.8.0-26, we have 6 heavy use KVM/QEMU host servers on it.

The bridge network loss definitely occurs on any of these servers, once upgraded to 3.8.0-27, and/or 3.8.0-29. And it happens pretty quickly, ~1-2 hours max.

I can give the 3.8.0-30 a try tomorrow! And I'll post back.

Revision history for this message
Ron (ron-neversleep) wrote :

It looks like we are on 24 hours uptime, with only one guest. Our only exception to norm; Friday was a pretty light use day, the office was very quiet.

By mid-Monday, I will do some network load tests, and load up a dozen or so VMs. So far 3.8.0-30 seems good..

Revision history for this message
Ron (ron-neversleep) wrote :

I will be upgrading our development cluster of 3 hosts to 3.8.0-30 (proposed) this weekend; for further validation of this new kernel as a solution.

Revision history for this message
Ron (ron-neversleep) wrote :

We now have 12 hours with our entire KVM cluster running on 3.8.0-30 with no outage. Including our high Network I/O guests now.

I will post success/failure once more this week, for final load verification.

Revision history for this message
Ron (ron-neversleep) wrote :

We're looking very solid on kernel 3.8.0-30, under heavy load. I count this as resolved.

Thanks again all.

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.