Bridged Guests losing network connectivity under non-ubuntu Xen after upgrade from 8.10 to 10.04

Bug #728519 reported by Jesse Newland
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Medium
Chuck Short

Bug Description

Opening at the request of Serge Hallyn.

We're seeing interfaces dropping after relatively short periods of uptime (days) on bridged guests that have been upgraded to 8.10 to 10.04 on the same Xen host (Centos 5.5).

Our guests have two interfaces on the guest connected to the same bridge on the Xen host. In all situations (we've experienced this a dozen+ times), only *one* of these interfaces has dropped, and usually during periods of high load. We're able to SSH into the guest on the non-affected interface, and running ifup/ifdown on the affected interface does not resolve the issue. A reboot is required to bring the interface back up.

Another interesting thing to note is that we have *not* experienced this behavior on any newly installed 10.04 guests - only guests that were upgraded from 8.10 to 10.04. In more than one case, reinstalling 10.04 on a guest experiencing this problem more than once a week has prevented it from reoccurring ever since (months). The problem not reoccurring doesn't necessarily mean the upgrade is what caused it since we can't manually trigger the problem, but it's worth mentioning.

    brctl show
    ifconfig -a
    iptables -vnL
    iptables -vnL -t nat

Output on affected guest: https://gist.github.com/26ce1eea5a1db0ae26a6#file_output_on_affected_guest
Output on affected host: https://gist.github.com/26ce1eea5a1db0ae26a6#file_output_on_host

We don't use qemu. Here's the xm list --long for this guest:

https://gist.github.com/26ce1eea5a1db0ae26a6#file_xen_list_for_guest

Guest network info: https://gist.github.com/26ce1eea5a1db0ae26a6#file_guest_network_info

One additional thing we've noticed is that upgraded guests have the xen_netfront and xen_blkfront kernel modules loaded, while newly provisioined ones do not.

Please let me know if any additional info would be helpful! I'd be happy to help debug in any way.

Revision history for this message
Jesse Newland (jnewland) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for filing this bug.

Have you tried doing 'rmmod xen_netfront xen_blkfront' on an upgraded host?

I'm not familiar enough with Xen to know which interfaces do what, but several of the interfaces on the host do have the same hwaddr. Do you know whether that can be related? In particular, peth1 and veth0.1 are both on xenbr1 and have the same hwaddr. And xenbr0 and xenbr1 have the same mac address presumably because they assumed those of peth0 and peth1, which are also the same.

Changed in linux (Ubuntu):
importance: Undecided → Low
importance: Low → Medium
status: New → Incomplete
Revision history for this message
Jesse Newland (jnewland) wrote :

We just had a guest lose networking. 'rmmod xen_netfront' caused the other interface to go down, and I surmise removing the other module would fail because of the attached devices:

    # lsmod | grep xen
    xen_netfront 14919 0
    xen_blkfront 8991 2

Turns out this guest didn't have linux-image-2.6.35-23-virtual_2.6.35-23.41, but rather the default kernel installed after the upgrade. Installing this kernel removes those modules. We'll see if it resolves the issues over time!

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Jesse. Could you list precisely which kernel was actually running?

Revision history for this message
Jesse Newland (jnewland) wrote :

2.6.32-0206322611-generic was being used on this server.

Unfortunately, we just had another virtual server that *is* running 2.6.35-23-virtual and does not have the xen_* modules lose network connectivity on one interface.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

So if I understand correctly, this does not in fact appear to be a bug in the kernel?

Sorry if I've asked this before, but is there definately nothing in the guest's syslog right before the connection is dropped?

I'm afraid I don't know enough about xen setups to be able to guess at what might be different on the host when you install (vs update) a host.

Revision history for this message
Jesse Newland (jnewland) wrote :

I'm not sure where the bug lies at this point, to be honest.

There nothing in syslog before and after.

There is nothing different on the host between upgraded and installed guests at all, and it appears this problem is occurring on KVM as well. Xen is not the issue.

The problem not yet occurring on a newly installed guest doesn't necessarily mean the upgrade is what caused it since we can't manually trigger the problem. The 2.6.35-23-virtual kernel seems to be dropping network connections *much less* than the others, but not fixing it completely. This is likely why the newly installed guests haven't had this problem occur yet - they all get that kernel.

Is there any information that I can provide that would help narrow down this issue?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Temporarily assigning to Chuck to see if he has any ideas pertaining to Xen. Chuck, any insight you have would be greatly appreciated.

Changed in linux (Ubuntu):
assignee: nobody → Chuck Short (zulcss)
Revision history for this message
Alexander Olsson (noseglid) wrote :

I am having the exact same issue as described above, but using the following:

* Ubuntu 10.10 (upgraded from 10.04, upgraded from 9.10, upgraded from 9.04 which was a fresh install).
* KVM (QEMU PC emulator version 0.12.5 (qemu-kvm-0.12.5), Copyright (c) 2003-2008 Fabrice Bellard)
* Linux alexo 2.6.38 #2 SMP Mon Apr 11 13:57:58 CEST 2011 x86_64 GNU/Linux

I am using virt-manager to manage the virtual machines.

Network settings:

~ % ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:1b:21:3a:ed:30 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::21b:21ff:fe3a:ed30/64 scope link
       valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:1c:c0:ab:d4:2c brd ff:ff:ff:ff:ff:ff
    inet 10.85.25.1/16 brd 10.85.255.255 scope global eth0
    inet 10.85.25.2/16 scope global secondary eth0
    inet 10.85.25.3/16 scope global secondary eth0
    inet 10.85.25.4/16 scope global secondary eth0
    inet6 fe80::21c:c0ff:feab:d42c/64 scope link
       valid_lft forever preferred_lft forever
23: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500
    link/ether fe:54:00:e6:85:4d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fee6:854d/64 scope link
       valid_lft forever preferred_lft forever
25: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:1b:21:3a:ed:30 brd ff:ff:ff:ff:ff:ff
    inet 10.85.25.5/16 brd 10.85.255.255 scope global virbr0
    inet6 fe80::21b:21ff:fe3a:ed30/64 scope link
       valid_lft forever preferred_lft forever
26: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500
    link/ether fe:54:00:1c:37:4f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe1c:374f/64 scope link
       valid_lft forever preferred_lft forever

~ % brctl show
bridge name bridge id STP enabled interfaces
virbr0 8000.001b213aed30 yes eth1
                                                        vnet0
                                                        vnet1

One of the vnet devices (which is the tap interfaces to the virtual machines) work at a time. At the moment of writing it's vnet0 - but appears to be random.

I have tried all setups I can think off:
* Setup above.
* Turning eth1 off completely, bridging eth0 instead and using it from both host and guest.
* Used eth0 for bridge and eth1 from host.

Consequently, only one of my virtual machines work.

Some googling have suggested it only appears on 64 bit system. Is that true for everyone here?

Any other information I can provide to help shed some light on this? I am completely stumped.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks @Alexander.

When vnet0 is the working device, then can vnet1 talk to the host at least, or not even that?

It looks like you're using a custom kernel. Is there any chance of trying with a stock kernel?

Actually I'm pretty sure your issue would be different from Jesse's. Could you please file a separate bug and provide the info there?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Jesse,

are you still seeing this issue?

Revision history for this message
Jesse Newland (jnewland) wrote :

Yes.

Revision history for this message
Alexander Olsson (noseglid) wrote :

I actually custom-compiled the kernel in an attempt to mitigate the issue.
I used 2.6.35-27-generic when the problem first appeared.

Both virtual instances (tapped by vnet0 and vnet1) can always reach each other, as well as the host.

As per your request, I will file a separate bug report.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Could you show the contents of /etc/xen/scripts/vif-bridge?

Is it random whether the 10.x.x.x or 216.x.x.x device goes down?

Is it perhaps always the one with the higher or lower mac address?

Once this affects a guest, and after a reboot, does it ever re-occur on the same guest?

You showed install instruction, but not upgrade instructions. Do you just do a standard do-release-upgrade in the guest? Do you reboot afterward?

Revision history for this message
Jesse Newland (jnewland) wrote :

vif-bridge:

https://gist.github.com/26ce1eea5a1db0ae26a6#file_vif_bridge.sh

It seems to be random, no pattern as far as we can tell. We've had guests lose one, then the other after a reboot. I haven't noticed if it has anything to do with Mac addresses. I'll check next time.

Yes, we have seen reoccurrences after a reboot.

Upgrade procedure is here: https://gist.github.com/26ce1eea5a1db0ae26a6#file_upgrade

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.