Bridged Guests losing network connectivity

Bug #633392 reported by Andrew Klettke
64
This bug affects 12 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Binary package hint: qemu-kvm

Opening at the request of Serge Hallyn.

Running 10.04 with KVM and Libvirt, starting VMs with "virsh start <domain>"

I've tested with OpenBSD 4.7 and Debian Lenny 5.0.5, both of them are having this issue.

I notice that networking with the bridge works fine at first for my guests, but after a while, the bridge stops functioning, and I can't reach it via SSH, nor can I ping/ssh out.

Also, when this happens, the ARP entry on the firewall I'm using to SSH to the guest changes from the Guest's MAC to the Host's MAC address.

On OpenBSD, if I run the /etc/netstart script to re-initialize the interface, the bridge starts functioning again, as if I restarted the VM, but quickly fails again (10-20 minutes). I have to reboot the Debian machine to get networking functioning again.

Here is my /etc/network/interfaces on the host:
# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet dhcp

auto br0
iface br0 inet static
 address 192.168.8.166
 netmask 255.255.252.0
 network 192.168.8.0
 broadcast 192.168.11.255
 gateway 192.168.8.1
 bridge_ports eth1
 bridge_fd 9
 bridge_hello 2
 bridge_maxage 12
 bridge_stp off

And my /etc/libvirt/qemu/networks/autostart/default.xml:
<network>
  <name>default</name>
  <bridge name="virbr%d" />
  <forward/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254" />
    </dhcp>
  </ip>
</network>

# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- anywhere anywhere udp dpt:domain
ACCEPT tcp -- anywhere anywhere tcp dpt:domain
ACCEPT udp -- anywhere anywhere udp dpt:bootps
ACCEPT tcp -- anywhere anywhere tcp dpt:bootps

Chain FORWARD (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere 192.168.122.0/24 state RELATED,ESTABLISHED
ACCEPT all -- 192.168.122.0/24 anywhere
ACCEPT all -- anywhere anywhere
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

# netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
192.168.8.0 0.0.0.0 255.255.252.0 U 0 0 0 br0
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0
0.0.0.0 192.168.8.1 0.0.0.0 UG 0 0 0 br0

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.00188b3e8dc0 no eth1
       vnet0
virbr0 8000.000000000000 yes

IFCONFIG -- With Guest OS Running
# ifconfig -a
br0 Link encap:Ethernet HWaddr 00:18:8b:3e:8d:c0
          inet addr:192.168.8.166 Bcast:192.168.11.255 Mask:255.255.252.0
          inet6 addr: fe80::218:8bff:fe3e:8dc0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:236817 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2654 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:17692813 (17.6 MB) TX bytes:362935 (362.9 KB)

eth0 Link encap:Ethernet HWaddr 00:18:8b:3e:8d:be
          inet addr:192.168.1.151 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::218:8bff:fe3e:8dbe/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:76541 errors:0 dropped:0 overruns:0 frame:0
          TX packets:62301 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:10233726 (10.2 MB) TX bytes:21694280 (21.6 MB)
          Interrupt:16 Memory:f8000000-f8012800

eth1 Link encap:Ethernet HWaddr 00:18:8b:3e:8d:c0
          inet6 addr: fe80::218:8bff:fe3e:8dc0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:241821 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10075 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:22612242 (22.6 MB) TX bytes:1391493 (1.3 MB)
          Interrupt:16 Memory:f4000000-f4012800

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:4013 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4013 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4101296 (4.1 MB) TX bytes:4101296 (4.1 MB)

virbr0 Link encap:Ethernet HWaddr 62:47:ad:6a:ce:a5
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          inet6 addr: fe80::6047:adff:fe6a:cea5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:468 (468.0 B)

vnet0 Link encap:Ethernet HWaddr fe:54:00:53:2d:a1
          inet6 addr: fe80::fc54:ff:fe53:2da1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:7445 errors:0 dropped:0 overruns:0 frame:0
          TX packets:237663 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:977842 (977.8 KB) TX bytes:20968868 (20.9 MB)

# cat /etc/libvirt/qemu/networks/default.xml
<network>
  <name>default</name>
  <bridge name="virbr%d" />
  <forward/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254" />
    </dhcp>
  </ip>
</network>

# qemu --version
QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c) 2003-2008 Fabrice Bellard

# libvirtd --version
libvirtd (libvirt) 0.7.5

# uname -a
Linux ubuntu-kvm 2.6.32-24-server #42-Ubuntu SMP Fri Aug 20 15:38:55 UTC 2010 x86_64 GNU/Linux

# cat /etc/libvirt/qemu/infra01.xml
<domain type='kvm'>
  <name>infra01</name>
  <uuid>46597427-81ce-e696-82f1-df7d38b0cbb9</uuid>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.12'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/virt//infra01.img'/>
      <target dev='hda' bus='ide'/>
    </disk>
    <disk type='block' device='cdrom'>
      <driver name='qemu'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
    </disk>
    <interface type='bridge'>
      <mac address='52:54:00:53:2d:a1'/>
      <source bridge='br0'/>
      <model type='e1000'/>
    </interface>
    <console type='pty'>
      <target port='0'/>
    </console>
    <console type='pty'>
      <target port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
    </video>
  </devices>
</domain>

Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
assignee: nobody → Serge Hallyn (serge-hallyn)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks very much for opening this bug.

Can you append the /etc/netstart script from openbsd here?

I notice that while you have the libvirt-defined virbr0, your VM is using
a hand-build br0 (which has stp off). Is there any particular reason for
this? It looks like you want the guests locked into eth1 only, is that
right?

1. If that is NOT what you necessarily wanted, then you might try editing
your VM xml to have:

    <interface type='network'>
      <mac address='52:54:00:53:2d:a1'/>
      <source network='default'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

instead of

    <interface type='bridge'>
      <mac address='52:54:00:53:2d:a1'/>
      <source bridge='br0'/>
      <model type='e1000'/>
    </interface>

and see whether that works.

2. Alternatively, you could see whether having libvirt create the equivalent of
br0 works better (or just turn on stp on br0). You could change eth1 to have a
static address, remove br0 from /etc/network/interfaces, and then create a file
/etc/libvirt/qemu/networks/private.xml (symlinked into
/etc/libvirt/qemu/networks/autostart as well) containing something like:

<network>
  <name>default</name>
  <bridge name="virbr%d" />
  <forward mode="route" dev="eth1"/>
  <ip address="192.168.123.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.123.2" end="192.168.123.254" />
    </dhcp>
  </ip>
</network>

3. Depending on the results of that, I may post a recipe for testing
attaching a veth tunnel to virbr0 and br0 and testing continuity through
that.

4. If none of these work, we can try backported kernels (from
https://launchpad.net/~kernel-ppa/+archive/pre-proposed) and/or
qemu-kvm (from https://launchpad.net/~serge-hallyn/+archive/lucid-kvm-test)
packages.

Changed in qemu-kvm (Ubuntu):
status: New → Incomplete
Revision history for this message
Andrew Klettke (aklettke) wrote :
Download full text (13.2 KiB)

Here is the netstart script:

#!/bin/sh -
#
# $OpenBSD: netstart,v 1.129 2010/01/12 07:43:41 henning Exp $

# Strip comments (and leading/trailing whitespace if IFS is set)
# from a file and spew to stdout
stripcom() {
        local _l
        [[ -f $1 ]] || return
        while read _l; do
                [[ -n ${_l%%#*} ]] && echo $_l
        done<$1
}

# Returns true if $1 contains only alphanumerics
isalphanumeric() {
        local _n
        _n=$1
        while [ ${#_n} != 0 ]; do
                case $_n in
                        [A-Za-z0-9]*) ;;
                        *) return 1;;
                esac
                _n=${_n#?}
        done
        return 0
}

# Start the $1 interface
ifstart() {
        if=$1
        # Interface names must be alphanumeric only. We check to avoid
        # configuring backup or temp files, and to catch the "*" case.
        if ! isalphanumeric "$if"; then
                return
        fi

        file=/etc/hostname.$if
        if ! [ -f $file ]; then
                echo "netstart: $file: No such file or directory"
                return
        fi
        # Not using stat(1), we can't rely on having /usr yet
        set -A stat -- `ls -nL $file`
        if [ "${stat[0]#???????} ${stat[2]} ${stat[3]}" != "--- 0 0" ]; then
                echo "WARNING: $file is insecure, fixing permissions"
                chmod -LR o-rwx $file
                chown -LR root.wheel $file
        fi
        ifconfig $if > /dev/null 2>&1
        if [ "$?" != "0" ]; then
                # Try to create interface if it does not exist
                ifconfig $if create > /dev/null 2>&1
                if [ "$?" != "0" ]; then
                        return
                fi
        fi

        # Now parse the hostname.* file
        while :; do
                if [ "$cmd2" ]; then
                        # We are carrying over from the 'read dt dtaddr'
                        # last time.
                        set -- $cmd2
                        af="$1" name="$2" mask="$3" bcaddr="$4" ext1="$5" cmd2=
                        # Make sure and get any remaining args in ext2,
                        # like the read below
                        i=1
                        while [ $i -lt 6 -a -n "$1" ]; do shift; let i=i+1; done
                        ext2="$@"
                else
                        # Read the next line or exit the while loop.
                        read af name mask bcaddr ext1 ext2 || break
                fi
                # $af can be "dhcp", "up", "rtsol", an address family,
                # commands, or a comment.
                case "$af" in
                "#"*|"") # skip comments and empty lines
                        continue
                        ;;
                "!"*) # parse commands
                        cmd="${af#*!} ${name} ${mask} ${bcaddr} ${ext1} ${ext2}"
                        ;;
                "dhcp")
                        [ "$name" = "NONE" ] && name=
                        [ "$mask" = "NONE" ] && mask=
                        [ "$bcaddr" = "NONE" ] && bcaddr=
                        cmd="ifconfig $if $name $mask $bcaddr $ext1 $ext2 down"
...

Revision history for this message
Andrew Klettke (aklettke) wrote :

Serge,

Yes, I want to be able to lock down guests to a specific NIC, this is needed for VLAN segregation, as well as physical diversity.

Turning on STP makes no difference, the bug occurs regardless.

Which kernel should I try to install and run first?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I'd try the pre-proposed lucid kernels, see:

https://launchpad.net/~kernel-ppa/+archive/pre-proposed

Revision history for this message
Andrew Klettke (aklettke) wrote :

Serge,

No luck, still having the same issues with the new kernel.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

I want to see if we can prove that this is purely a bridging problem. Could you follow
the recipe at https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/616064/comments/33
(but attaching veth0 to br0 instead of virbr0)?

If this also loses connectivity, then we can pin it down to either the bridging module in
the kernel, or something in your environment.

Revision history for this message
Andrew Klettke (aklettke) wrote :

Serge,

Something different happens with this. Now, I get the network instability that I saw before with SSH sessions exiting with "Broken Pipe", and the netcat session you showed in the link just hangs.

If I try to re-SSH into the box, however, it works again. It looks like the same thing that was happening, only the ARP entry doesn't change (i'm assuming it's because there's no guest) after it does.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 633392] Re: Bridged Guests losing network connectivity

Quoting Andrew Klettke (<email address hidden>):
> Serge,
>
> Something different happens with this. Now, I get the network
> instability that I saw before with SSH sessions exiting with "Broken
> Pipe", and the netcat session you showed in the link just hangs.

How quickly does it do that? I think that's cause for immediately
opening a bug against the kernel.

> If I try to re-SSH into the box, however, it works again. It looks like
> the same thing that was happening, only the ARP entry doesn't change
> (i'm assuming it's because there's no guest) after it does.

Yeah ARP changing might be a symptom of some complicating layer
trying to recover from the original bridge hang.

-serge

Revision history for this message
Andrew Klettke (aklettke) wrote :

Serge,

It takes anywhere from maybe 10-30 minutes during the tests I ran, but it always happens.

You think this is an issue with the kernel?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

> You think this is an issue with the kernel?

Did you get a chance to run this test with libvirt shut down? If so,
then the kernel and hardware are the only thing left in the equation.
If not, then it's not inconceivable, but unlikely, that libvirt is
periodically doing something nefarious. Is there anything suspicious in
/var/log/syslog around the time that the connection hangs?

I will try to reproduce this using a lucid kvm guest with a similar
network configuration. But I'm marking this as affecting the kernel
in the meantime in case someone over there has seen this before
or seen it discussed.

affects: qemu-kvm (Ubuntu) → linux (Ubuntu)
Revision history for this message
Soren Hansen (soren) wrote :

A couple of things would be helpful:

a) Have you tried this with multiple VM's at the same time? If so, do they both stop working at the same time? If so, can they still reach each other?
b) Can you still reach the bridge itself when this problem happens? I.e. can you access 192.168.8.166 from the firewall box?
c) Can you provide the output of "ifconfig -a" and "brctl show" when the the problem is occurring?

Revision history for this message
Andrew Klettke (aklettke) wrote :

Soren:

See this bug for more info from me on this: https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/584048

a) I'll need to find time to set up this test, I'm working with you guys on this while I also have internal issues here at work, so bear with me.
b) No, when I SSH to the VM from my firewall and the connection is reset, the VM is inaccessible and the ARP entry on my firewall changes to the MAC address of the physical interface on the host. I can't ping out from the VM, either; all network connectivity is lost. If I'm NOT running a VM and just bridging as Serge's example above had me do, I can immediately SSH back into the box no problem, but the connection interruption still occurs.
c) The output of these commands looks exactly the same before, after, and during the problem.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

I see mention that testing has been done on the Lucid released kernel as well as the Lucid pre-proposed kernels. Just curious if any testing has been done on the Maverick kernel? There should be a Maverick LTS backports kernel provided in the kernel-ppa (https://edge.launchpad.net/~kernel-ppa/+archive/pre-proposed), it's the 2.6.35 based kernel.

Revision history for this message
Andrew Klettke (aklettke) wrote :

Leann,

I'm testing the kernel you suggested with the bridging setup described by Serge in comment #6

I'll let you know what the results are.

Revision history for this message
Andrew Klettke (aklettke) wrote :

Leann,

Still no luck, the exact same issue persists. When I SSH into the server, invoking a loop that echos "test" onto the screen, eventually the SSH session breaks:

Write failed: Broken pipe

I can SSH right back into the IP, but the session will always fail eventually. I can't keep an SSH session alive using the bridge. I am SSHing to other boxes in the same VLAN from this same firewall, but never have this issue with them.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Andrew,

Just curious if any interesting error messages (a kernel oops for ex) show up in dmesg output after the ssh session fails?

Revision history for this message
Andrew Klettke (aklettke) wrote :

Leann,

I have the following in dmesg, but I believe all came before the ssh sessions were established and failed:

[ 235.700268] lo: Disabled Privacy Extensions
[ 254.123755] ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 268.992077] device veth0 entered promiscuous mode
[ 268.992104] br0: new device veth0 does not support netpoll (disabling)
[ 293.102269] ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 293.102298] br0: port 2(veth0) entering learning state
[ 293.102300] br0: port 2(veth0) entering learning state
[ 302.120040] br0: port 2(veth0) entering forwarding state
[ 303.130042] veth0: no IPv6 routers present
[ 303.430064] veth1: no IPv6 routers present

Revision history for this message
Jan Vonde (jan.vonde) wrote :

Hi!

I have the same problem with a Ubuntu 10.04 LTS server currently. I have a minimum install of lucid on the physical machine with kvm and the network interface is a bridge. All machines were set up using virt-manager. From time to time the network connection to the machines gets lost. No ping, no ssh no nothing. Nothing can be seen in the logfiles. After some seconds, sometimes minutes, the machines are back as nothing had happened.
I have another server installed at Ubuntu 9.10 with the same setup, and no problems exist. Any idea?

\Jan

Revision history for this message
Tais P. Hansen (taisph) wrote :

My setup (Ubuntu Maverick 10.10, kernel 2.6.35-22-server) is this:

- eth0, eth1 bonded as bond0, active-backup.
- vlan1000 on top of bond0 as bond0.1000.
- br1000 bridge with bond0.1000 as interface.
- vnet0 guest in br1000 bridge as well.

Looking at arp traffic outside and inside the guest:

br1000 (in the host):

17:12:44.438727 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.2.3.4 tell 1.2.3.6, length 28
17:12:44.438782 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 1.2.3.4 tell 1.2.3.6, length 42
17:12:44.439324 52:54:00:6e:b0:df > 52:54:00:07:9f:19, ethertype ARP (0x0806), length 56: Reply 1.2.3.4 is-at 52:54:00:6e:b0:df, length 42
17:12:45.438718 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.2.3.4 tell 1.2.3.6, length 28
17:12:45.438778 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 1.2.3.4 tell 1.2.3.6, length 42
17:12:45.439261 52:54:00:6e:b0:df > 52:54:00:07:9f:19, ethertype ARP (0x0806), length 56: Reply 1.2.3.4 is-at 52:54:00:6e:b0:df, length 42

eth0 (the guest):
17:12:44.442208 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.2.3.4 tell 1.2.3.6, length 28
17:12:44.442717 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 1.2.3.4 tell 1.2.3.6, length 42
17:12:45.442204 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.2.3.4 tell 1.2.3.6, length 28
17:12:45.442721 52:54:00:07:9f:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 1.2.3.4 tell 1.2.3.6, length 42

So the arp reply never makes it through to the guest for some reason effectively cutting off the guest from the network.

Revision history for this message
Tais P. Hansen (taisph) wrote :

I can now reproduce the behavior on demand.

I'll try to explain my setup:

A: Router.
B: KVM guest on a Maverick Host.
C: Other Host.

A, B and C are all in the same subnet. I boot B and ssh from outside to B.

While in B I ping C. C initially doesn't respond and the arp cache shows C as failed. After a while C suddenly starts responding and the arp cache reflects C as reachable. Deleting the entry from the arp cache and it starts over with the random delay before responding.

Now, dumping arp traffic on the host bridge and on the vnet0 if also on the host side while running the scenario above shows who-has requests coming through nicely. But the is-at replies for C only appears on the host bridge for some time. It doesn't make it to vnet0. After anything from 10 to 60 seconds a couple of the is-at replies makes it through and the pings start getting sent and replied.

Here's the odd part. IF there's no other traffic from B (ie. ssh output) while B is sending who-has requests for C, the ssh connection severely lags. Getting a CTRL-C through takes quite some time. Usually the lag lasts until B finally gets the is-at reply or gives up trying. But if there's a lot of output from pings and a top running in another connection, the lag is nearly unnoticable.

Revision history for this message
Andrew Clegg (andrew-clegg-signups) wrote :

Tais -- we had similar problems to what you describe when trying to share a bonded channel between a host and 1+ guests. Random delays, lost packets, nasty low-level things like MAC addresses appearing for the wrong machine in other hosts' ARP tables.

In the end, we only made it work properly by un-bonding the host adapters (eth0->eth3) and creating a bridge for each of them (br0->br3), and then assigning VMs to bridges explicitly in virt-manager. This seems fine so far, everything is smooth and fast.

Admittedly this is on CentOS 5.4 (KVM 83-164) but I would recommend trying this on any host OS if you're having trouble with a bonded channel.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Tais,

If you could try to reproduce with the daily builds, then we can confirm whether this
is still present in upstream. They are at:
https://launchpad.net/~ubuntu-server-edgers/+archive/server-edgers-qemu-kvm
and
https://launchpad.net/~ubuntu-server-edgers/+archive/server-edgers-libvirt

Revision history for this message
Tais P. Hansen (taisph) wrote :

@Andrew I tend to avoid bridging physical interfaces. My current setup used to work (Ubuntu Lucid, Gentoo etc).

@Serge Tried upgrading the packages and now I can't start the guest:

error: Failed to start domain osddev1
error: internal error Process exited while reading console log output: libvir: Security Labeling error : internal error error calling aa_change_profile()

- similar to bug #605960.

Revision history for this message
Tais P. Hansen (taisph) wrote :

@Serge May have something to do with /usr/lib/libvirt/virt-aa-helper missing?

Nov 16 20:42:32 kvm8 libvirtd: 20:42:32.563: error : qemudReadLogOutput:2164 : internal error Process exited while reading console log output: libvir: Security Labeling error : internal error error calling aa_change_profile()#012
Nov 16 20:42:32 kvm8 libvirtd: 20:42:32.565: error : virRunWithHook:854 : internal error '/usr/lib/libvirt/virt-aa-helper -R -u libvirt-16f15832-18f7-fc60-2195-2e48ac44a9de' exited with non-zero status 1 and signal 0: libvir: error : cannot execute binary /usr/lib/libvirt/virt-aa-helper: No such file or directory#012
Nov 16 20:42:32 kvm8 libvirtd: 20:42:32.565: error : AppArmorRestoreSecurityAllLabel:557 : internal error could not remove profile for 'libvirt-16f15832-18f7-fc60-2195-2e48ac44a9de'

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Tais: Gah, I'll have to look into it, thanks. As for my request to try the dailies, I was
once again confusing bugs, and forgot that this bug is actually a kernel bug. (If you
suspect yours is actually due to libvirt/qemu, then that'll require a new bug, but I don't
think you do - you just managed to recreate it again using those, right?)

Actually, Tais, if you look at comment #6, I'd like to see what your results are with it.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Tais Plougmann Hansen (<email address hidden>):
> @Serge May have something to do with /usr/lib/libvirt/virt-aa-helper
> missing?

Thanks for noticing that. I don't know why I had to copy that over manually,
but I did. The updated .deb has it.

Revision history for this message
Tais P. Hansen (taisph) wrote :

@Serge: New deb works. Guests start nicely again. Doesn't fix the problem though.

With regards to the newer kernel, it also doesn't fix the problem for me.

I've used parts of the test you linked to in comment #6 and can confirm this is completely reproducible in the host without any guests running.

Revision history for this message
rICh morrow (rich-quicloud) wrote :
Download full text (4.5 KiB)

I believe we've hit the same issue as others above.

System:
Host - 10.10 ubuntu running the "2.6.35-22-server" kernel
Guest(s) - 9.4 ubuntu running "2.6.32-25-server" kernel

virsh version:
Compiled against library: libvir 0.8.3
Using library: libvir 0.8.3
Using API: QEMU 0.8.3
Running hypervisor: QEMU 0.12.5

NIC in HOST:
hosts have eth0 (public) bridged through br0, eth1 (private) bridged through br1
virbr0 seems to be allocated by KVM on the 192.168.122/24 network
vnet0,1,2...n come up per guest

NICs in GUEST:
plain jane eth0 (public), and eth1 (private).

Symptoms:
From a fresh install of all the above, everything was working flawless... hosts & guests could see both private & public, all were accessible from outside via SSH. IPtables was locked down, and everything was going smooth.

~2 days into server builds, guests cannot be SSH'd into... can't be pinged... VNC'ing into console, we can't ping out. Network is all set up as it was just hours before when all was working, route / ifconfig / arp / and several other commands are run verifying so. Many heads are scratched. Many (including myself) began to cry.

Noticed some chatter implicating AppArmor in /var/log/messages:
******** SCREEN PASTE START ********
Nov 19 16:38:39 LB-01 kernel: [178667.785748] type=1400 audit(1290213519.522:23): apparmor="STATUS" operation="profile_remove" name="libvirt-cd4fbae6-58be-f3d9-4623-3968f91cf6cb" pid=15435 comm="apparmor_parser"
Nov 19 16:38:47 LB-01 kernel: [178675.423697] type=1400 audit(1290213527.182:24): apparmor="DENIED" operation="open" parent=1231 profile="/usr/lib/libvirt/virt-aa-helper" name="/var/lib/virt/images/baseline-vol" pid=15437 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=103
Nov 19 16:38:47 LB-01 kernel: [178675.484980] type=1400 audit(1290213527.242:25): apparmor="STATUS" operation="profile_load" name="libvirt-cd4fbae6-58be-f3d9-4623-3968f91cf6cb" pid=15438 comm="apparmor_parser"
Nov 19 16:38:47 LB-01 libvirtd: 16:38:47.981: warning : qemudParsePCIDeviceStrs:1422 : Unexpected exit status '1', qemu probably failed
Nov 19 16:38:47 LB-01 kernel: [178676.214864] device vnet0 entered promiscuous mode
Nov 19 16:38:47 LB-01 kernel: [178676.214886] br0: new device vnet0 does not support netpoll (disabling)
Nov 19 16:38:47 LB-01 kernel: [178676.216191] br0: port 2(vnet0) entering learning state
Nov 19 16:38:47 LB-01 kernel: [178676.216195] br0: port 2(vnet0) entering learning state
Nov 19 16:38:47 LB-01 kernel: [178676.218294] device vnet1 entered promiscuous mode
Nov 19 16:38:47 LB-01 kernel: [178676.218314] br1: new device vnet1 does not support netpoll (disabling)
******** SCREEN PASTE END ********

Crying ceased, Hope shined -- AppArmor disabled, removed, and thrown in a river out back with a concrete block tied to it's ankle.

Rebooted host. Still nothing but VNC into guests. Crying ensued.

Here's what we're still seeing in (seems highly odd -- notice warnings & last line about qemu fail) in /var/log/messages:
******** SCREEN PASTE START ********
Nov 19 18:28:38 LB-01 kernel: [ 10.543416] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Nov 19 18:28:38 LB-01 kernel: [ 10.543557] br0: port 1(eth0) enter...

Read more...

Revision history for this message
rICh morrow (rich-quicloud) wrote :

One other salient point to our bug report above... we had two hosts with the exact same setup described above. Guests on *both* hosts lost networking within a few hours of each other. Hosts also retained full network access to both public & other devices on private during whole guest outage (which continues).

Revision history for this message
rICh morrow (rich-quicloud) wrote :

One correction on our setup -- our guests are 10.04 LTS (Lucid), not 9.04. Hosts are, indeed 10.10 Maverick.

Revision history for this message
Jesse Newland (jnewland) wrote :

We're also seeing interfaces dropping after upgrading bridged 8.10 guests to 10.04 on the same Xen host (Centos 5.5). This is the most similar bug report I can find at the moment, so I'm describing our experiences here.

Our guests have two interfaces on the guest connected to the same bridge on the Xen host. In all situations (we've experienced this a dozen+ times), only *one* of these interfaces has dropped, and usually during periods of high load.

Another interesting thing to note is that we have *not* experienced this behavior on any newly installed 10.04 guests - only guests that were upgraded from 8.10 to 10.04. In more than one case, reinstalling 10.04 on a guest experiencing this problem more than once a week has prevented it from reoccurring ever since (months). The problem not reoccurring doesn't necessarily mean the upgrade is what caused it since we can't manually trigger the problem, but it's worth mentioning.

We have a dozen or so systems that have been affected by this, and several hundred that are in danger of losing network connectivity without a fix. Please let me know if I should file a new bug, or if any additional information would help debug! This happens weekly, and when we lose a single, we're still able to access the guest via another interface, so that might be a nice avenue towards isolating the bug.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Jesse Newland (<email address hidden>):
> without a fix. Please let me know if I should file a new bug, or if any

Hi,

please do file a new bug. Please include the following information for
all hosts before upgrade, after upgrade, and after a reinstall:

Detailed steps followed when you update.
Detailed steps followed when you reinstall.

Results of:

brctl show
ifconfig -a
iptables -vnL
iptables -vnL -t nat

on both host and guest.

Contents of:

/etc/libvirt/qemu/network/default.xml

on the host, and contents of:

/etc/network/interface
/etc/udev/rules.d/70-persistent-net.rules

on the guest.

thanks,
-serge

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Andrew,

Could you describe the network which sits (or sat) behind eth0 and eth1? Both basic topology (I assume eth1 was on a private network separate from eth1?) and hardware and any possible settings?

Revision history for this message
Andrew Klettke (aklettke) wrote :

Serge,

We've since moved on to Proxmox, but here's the description:

eth0 was the external interface, connected to an interface on an OpenBSD firewall (with a switch in between).

eth1 was the bridged interface, connected directly to a separate interface on the firewall with a crossover cable.

All worked until the bug would show up and drop connectivity.

Revision history for this message
Jesse Newland (jnewland) wrote :
Changed in linux (Ubuntu):
assignee: Serge Hallyn (serge-hallyn) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Nathaniel W. Turner (nturner) wrote :

I'm seeing very similar behavior to the OP on 14.04.

Changed in linux (Ubuntu):
status: Expired → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.