kvm images losing connectivity w/bridged network

Bug #584048 reported by Billy Charlton on 2010-05-22
114
This bug affects 17 people
Affects Status Importance Assigned to Milestone
openSUSE
Invalid
Undecided
Unassigned
libvirt (Ubuntu)
High
Serge Hallyn
Nominated for Hardy by Frank Groeneveld
Lucid
High
Serge Hallyn
Maverick
High
Serge Hallyn
qemu-kvm (Fedora)
Fix Released
High

Bug Description

Binary package hint: kvm

=====================================
SRU justification:
1. Impact: Networking can freeze for a while when VMs are started or stopped.
2. How addressed: a patch was cherry-picked from upstream, which carefully chooses high enough macaddr for newly created network interfaces so that the bridge does not change macaddrs, which is the cause of the hangs.
3. Patch: see the linked bzr tree
4. Regression potential: very unlikely as this patch is upstream, and only affects the mac address chosen for the libvirt brige.
5. To reproduce: Define a VM with a macaddr lower than the bridge's. Start the VM.
=====================================

Serious networking problems with KVM running a Windows XP x64 host using bridged networking. This may be a KVM bug, or a bridged-networking or ethernet driver bug; I unfortunately don't have the skills to tell the difference but will do what I can to help pinpoint the problem.

My Setup:

- Lucid 64bit installed on dual quad-core (Intel) Dell Precision R710, 32Gb RAM.

- One KVM instance running Windows XP x64 with 16Gb RAM, using bridged networking. Symptoms occur with both standard "-net nic" KVM network drivers (rtl8139?) and virtio network drivers on Windows. Either way I am also using "-net tap" so I can get out to the real network.

- I'm starting KVM from the command line instead of using libvirt & virt-manager, because I need to specify sockets/cores/threads for my CPU. Otherwise Windows XP only uses 2 CPUs. I don't think this is relevant but who knows.

The Problem:

The KVM windows image starts up fine, and at first works properly. But after a very short duration, usually less than two minutes, the networking freezes up. Can't get to the internet, can't access local shares, nothing. Sometimes it magically comes back; sometimes it's gone forever.

An interesting, possibly related detail: If I shut down the Windows image, the host machine's networking freezes up for many seconds just as the image is exiting. After the exit is complete and KVM is completely shut down, I can ssh or VNC back into the host machine again without problems.

So, after restarting the KVM image I try running the "iperf" network performance tool in two modes.

1) From a separate PC on my LAN, I can run iperf between the Lucid *host* and that PC, and never have any problems. I can consistently get 800 Mbits/sec up and 200 MBits/sec down. (I've run it about 50 times)

2) From the separate PC to the KVM Windows image running on the Lucid host:
      - it sometimes runs successfully with 130Mbit/sec in both directions;
      - other times gets just 25-40Mbit/sec;
      - and sometimes it fails completely with the error message "write failed: Connection reset by peer. read on server close failed: Connection reset by peer".

What can I do to help pinpoint this bug? My hunch is the error is in the bridging or the ethernet driver, but I don't know how to test that since I only use bridging for KVM clients! Any ideas on what I can do? I really want to help diagnose this.

Thierry Carrez (ttx) wrote :

Could you give us the version of the kvm package used, and the kvm command line you use to start your VM ?

Changed in kvm (Ubuntu):
importance: Undecided → High
status: New → Incomplete
affects: kvm (Ubuntu) → qemu-kvm (Ubuntu)
angriukas (andrius-uskevicius) wrote :

I am also observed network freeze.
Used virtio drivers for network and for disk too (for WinXP/Win7 x86 guests),
On every guest shutdown and sometimes on guest start-up network connection (to host and to guests) is interrupted from 2 seconds up to 10 seconds (sometimes more, it vary).

My case:
Lucid x64 with eucalyptus installed on quad-core Intel CPU (motherboard 'Asus P6T SE'), network is bridged.
Running kvm manually as well as using eucalyptus cloud - always same network freeze occur.

angriukas (andrius-uskevicius) wrote :

Probably network freeze occur because of adding (on guest start-up) created tun/tap interface to the bridge.
On guest shutdown tun/tap interface are removed from bridge. This add/remove operation to/from bridge of tap interface causes network freeze in my case. I have play around by replacing '-net tap' to '-net tap,ifname=tap.0,script=no,downscript=no' - no freeze occur. But in this case guest are not connected to the outside world.

I have no idea does network freeze problem related to the virtual network kernel drivers or is it related to the physical network interface drivers of the host machine (in my case host if is 'RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)').

Also tested following, run ping:
from LAN to IP of bridged if - time out during freeze
from bridged machine to bridged if - no freeze
from bridged machine to LAN - time out during freeze

I have tested on:
Lucid x64 + latest updates
installed eucalyptus cloud
kernel: 2.6.32-22-server
qemu-kvm: 0.12.3+noroms-0ubuntu9

I'm seeing this issue as well. I have a Hardy host system, with bridged network. I start a lucid 64 bit kvm with virtio and a tap device (which is in a bridge already) and after a while, the network seems down (from outside, no ssh access etc). However, when I run a ping from within the kvm instance, the network works again.

@Thierry, I'm running the latest kvm version from the hardy ppa (it's newer than the original hardy I believe):
Version: 1:84+dfsg-0ubuntu12.1~rc5ppa1

This is the line I use to start the VM:
screen -t development kvm -m 1024m -nographic -drive file=development.raw,if=virtio,boot=on -net nic,macaddr=xx:xx:xx:xx:xx:xx,model=virtio -net tap,ifname=tap1,script=no,downscript=no

Changed in qemu-kvm (Ubuntu):
status: Incomplete → Confirmed
Paolo Maero (fabrica64) wrote :

It is a bridge problem. I get the same freeze if I delete a tap interface from the bridge. I was trying to use a "static" tap interface and use it from the virtual machine, then I executed the command:
# brctl delif br0 tap0
and suddenly the network frozen in the same exact way, so I suppose that the same line in qemu-ifdown script generate the freeze.
I logged in from another interface and re-added the tap0 to the bridge, this apparently un-freeze the network, but when I repeated the procedure I did not always get the same results...

Jürgen Sauer (juergen-sauer) wrote :

Also here, it is a huge problem, which costs much money. such a nasty bug!
If in virtual servers are the network interfaces freezing - this is not amusing.
This bus stale a whole production server range!

Any ideas further ?

Jürgen Sauer (juergen-sauer) wrote :

Here is a Log excerpt:
[ 1927.794336] br0: starting userspace STP failed, starting kernel STP
[ 8145.487525] type=1505 audit(1279203091.030:18): operation="profile_remove" pid=7165 name="libvirt-d9018107-db16-f1f1-e1b9-b8ec787282aa" namespace="root"
[ 8145.585758] br0: port 5(vnet3) entering disabled state
[ 8145.643386] device vnet3 left promiscuous mode
[ 8145.643391] br0: port 5(vnet3) entering disabled state
[ 8158.294350] br0: port 2(vnet0) entering disabled state
[ 8158.362301] device vnet0 left promiscuous mode
[ 8158.362309] br0: port 2(vnet0) entering disabled state
[ 8158.829476] type=1505 audit(1279203104.370:19): operation="profile_remove" pid=7171 name="libvirt-e49ac8b0-5417-4621-ae80-83272169f6e2" namespace="root"
[ 8160.494446] br0: port 4(vnet2) entering disabled state
[ 8160.602287] device vnet2 left promiscuous mode
[ 8160.602295] br0: port 4(vnet2) entering disabled state
[ 8161.004029] type=1505 audit(1279203106.550:20): operation="profile_remove" pid=7175 name="libvirt-8a494d5a-4d33-421c-a670-39dfebf7996e" namespace="root"
[ 8182.485745] br0: port 3(vnet1) entering disabled state
[ 8182.563350] device vnet1 left promiscuous mode
[ 8182.563355] br0: port 3(vnet1) entering disabled state
[ 8183.812965] type=1505 audit(1279203129.360:21): operation="profile_remove" pid=7184 name="libvirt-1c1fed8d-24f4-41ca-aa48-36214eacbbd8" namespace="root"

Ouch!

Jojo

Paolo Maero (fabrica64) wrote :

Yesterday I saw another strange behavior.

I shutdown VM once a day to make a backup and I adjusted my scripts for the 1-10 min network freeze, but yesterday the network did not unfreeze at all. I restarted the machine and the network continued to be frozen! I had to physically power off the machine to unfreeze the network, as if something was kept somewhere in the network chip that resisted reboot. My machine is a HP DL380 G6

Network freeze is indeed not total. Not all IPs are blocked. It seems IPs that have no traffic at the time of the bridge operation are not blocked. So the freeze I am talking about above (the one that resisted the reboot) was not total. Some IPs were able to ping and connect and other no. And the situation was identical before and after reboot. Only after power off everything returned to normal state.

Hope it may help...

Jürgen Sauer (juergen-sauer) wrote :

This morning I had an freeze also.
The virtualized Hardy was freezed. (Host is Lucid-AMD64 on Xenons)
The virtualized Lucid was running fine. (on same Host)
The virtualized Dapper was running fine also (same Host).

The Hardy is on-production Server Ouch! Ouch! Ouch!

"I hate mondays"

Jürgen Sauer (juergen-sauer) wrote :

Alert! Occoured again $#!³!!!
OUCH OUCH OUCH

Jürgen Sauer (juergen-sauer) wrote :

None of the wor around ways are working, neither apparmor disable nor making it manual

Critical DOS

Jürgen Sauer (juergen-sauer) wrote :

The problem is also occouring, if the bridged devices are not touched in any case.
Nothing was happening on the host side - all though the problem occours somewhat sometimes randomally.

That is a ver, very, very nasty DOS bug - makes lucid complete unusable as virtual hosting plattform.

CRITICAL

Dustin Kirkland  (kirkland) wrote :

Can anyone reproduce this on Maverick at this point? We'd need to reproduce and fix it there before getting a fix out to Lucid. And if it is fixed there, then we can try to zero into a fix for Lucid.

Changed in qemu-kvm (Ubuntu):
assignee: nobody → Serge Hallyn (serge-hallyn)
status: Confirmed → Incomplete
Dustin Kirkland  (kirkland) wrote :

Also, we don't really have access to Windows guests. Can anyone reproduce this bug with a Linux guest, or is it affecting Windows guests only?

Paolo Maero (fabrica64) wrote :

It is not a KVM bug, it is a bridge bug, affecting adding and removing interfaces from the bridge. qemu-ifup and qemu-ifdown do so, and generates the bug, but it is not Windows or Linux VM related, not either KVM related...

I have this problem with a Linux guest (Lucid) on a Linux host (Hardy)
that uses the backports KVM.

Serge Hallyn (serge-hallyn) wrote :

Could someone experiencing this bug with a lucid or maverick host please
post your /etc/network/interfaces (or the equivalent info) and the contents
of your /etc/libvirt/qemu/network/autostart/default.xml (or whichever network
your guests are attached to)?

Paolo Maero (fabrica64) wrote :

This is my /etc/network/interfaces

eth0 is connected to the local network, eth3 is reserved for the cluster (it's a redhat cluster), domain.it is not the real domain...

I have no autostart network in libvirtd, the /etc/libvirtd/qemu/networks/default.xml is the standard one as installed by lucid libvirt-bin package:

<network>
  <name>default</name>
  <bridge name="virbr%d" />
  <forward/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254" />
    </dhcp>
  </ip>
</network>

But I removed the symlink under /etc/libvirtd/qemu/networks/autostart as I only use br0 bridge

Serge Hallyn (serge-hallyn) wrote :

I can't reproduce this on my lucid testbox. Do you have networkmanager
or wicd installed? Leaving wicd installed, for instance, caused my
box to frequently add a duplicate default route, which stopped my
network.

Can you gather the output, both when network is up, and while it is
paused, of:

 netstat -nr
 ifconfig -a
 brctl show
 ps -ef

====

To show precisely how I tried to reproduce, I did:
 apt-get remove network-manager
 apt-get remove wicd
 cat > /etc/network/interfaces << EOF
  auto lo
  iface lo inet loopback

  auto eth0
  iface eth0 inet manual

  auto br0
  iface br0 inet dhcp
   bridge_ports eth0
   bridge_stp off
   bridge_fd 0
   bridge_maxwait 0
 EOF
 (rebooted)

Then I cloned two working lucid server images
 qemu-img create -f qcow2 -b s1.img s2.img
 qemu-img create -f qcow2 -b s1.img s3.img
 MACADDR1=`ifconfig br0 | head -2 | tail -1 | awk -F: '{ print $2 '}| awk '{ print $1 '}`
 MACADDR2=`ifconfig br0 | head -2 | tail -1 | awk -F: '{ print $2 '}| awk '{ print $1 '}`

I started the first with kvm:
 kvm -drive file=s2.img,if=scsi,index=0,boot=on -m 1G -smp 2 -net nic,macaddr=$MACADDR1,model=virtio -net tap,ifname=tap1,script=no,downscript=no

and hooked up its interface on the host:
 ifconfig tap1 0.0.0.0 up
 brctl addif br0 tap1
then ran dhclient on the guest.

Then I started the second host the same way (with s3.img, tap2, and $MACADDR2).

The whole time I left a terminal open with 'ping google.com', and
never saw any hiccoughs.

Please let me know if any of this gives you an idea of what else I should
do to reproduce.

Andrew Klettke (aklettke) wrote :

I'm also having this problem. Running 10.04 with KVM and Libvirt, starting VMs with "virsh start <domain>"

I've tested with OpenBSD 4.7 and Debian Lenny 5.0.5, both of them are having this issue.

I notice that networking with the bridge works fine at first for my guests, but after a while, the bridge stops functioning, and I can't reach it via SSH, nor can I ping/ssh out.

Also, when this happens, the ARP entry on the firewall I'm using to SSH to the guest changes from the Guest's MAC to the Host's MAC address.

On OpenBSD, if I run the /etc/netstart script to re-initialize the interface, the bridge starts functioning again, as if I restarted the VM, but quickly fails again (10-20 minutes). I have to reboot the Debian machine to get networking functioning again.

Here is my /etc/network/interfaces on the host:
# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet dhcp

auto br0
iface br0 inet static
 address 192.168.8.166
 netmask 255.255.252.0
 network 192.168.8.0
 broadcast 192.168.11.255
 gateway 192.168.8.1
 bridge_ports eth1
 bridge_fd 9
 bridge_hello 2
 bridge_maxage 12
 bridge_stp off

And my /etc/libvirt/qemu/networks/autostart/default.xml:
<network>
  <name>default</name>
  <bridge name="virbr%d" />
  <forward/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254" />
    </dhcp>
  </ip>
</network>

Andrew Klettke (aklettke) wrote :

General instability exists with this bug as well. For example, SSHing to the VM:

$ ssh root@192.168.8.166
root@192.168.8.166's password:
Linux deb64 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Jul 22 15:57:28 2010 from 192.168.8.1
deb64:~#

///////////////////////////////////////////////////////////////////////////////////////
After waiting a few minutes, and pressing "Enter"...
////////////////////////////////////////////////////////////////////////////////////

deb64:~# Write failed: Broken pipe

///////////////////////////////////////////////////////////////////////////////////////
Then, trying to re-SSH into the server...
////////////////////////////////////////////////////////////////////////////////////
$ ssh root@192.168.8.166
root@192.168.8.166's password:
Read from socket failed: Connection reset by peer
$ ssh root@192.168.8.166
ssh: connect to host 192.168.8.166 port 22: Connection refused

This is when the ARP entry changes on the firewall I'm SSHing from, and the bridge connectivity freezes, and I have to restart the VM.

Serge Hallyn (serge-hallyn) wrote :

Quoting Andrew Klettke (<email address hidden>):
> General instability exists with this bug as well. For example, SSHing to
> the VM:
>
> $ ssh root@192.168.8.166
> root@192.168.8.166's password:
> Linux deb64 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64
>
> The programs included with the Debian GNU/Linux system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
>
> Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
> permitted by applicable law.
> Last login: Thu Jul 22 15:57:28 2010 from 192.168.8.1
> deb64:~#
>
> ///////////////////////////////////////////////////////////////////////////////////////
> After waiting a few minutes, and pressing "Enter"...
> ////////////////////////////////////////////////////////////////////////////////////
>
> deb64:~# Write failed: Broken pipe
>
> ///////////////////////////////////////////////////////////////////////////////////////
> Then, trying to re-SSH into the server...
> ////////////////////////////////////////////////////////////////////////////////////

So now is this going through the firewall?

> $ ssh root@192.168.8.166
> root@192.168.8.166's password:
> Read from socket failed: Connection reset by peer
> $ ssh root@192.168.8.166
> ssh: connect to host 192.168.8.166 port 22: Connection refused
>
>
> This is when the ARP entry changes on the firewall I'm SSHing from, and the
> bridge connectivity freezes, and I have to restart the VM.

What kind of firewall? Do other hosts inside the firewall have the
right ARP entry for the guest? What about the host itself?

Note that this sounds like a different problem from what others are
reporting. Others are saying that their bridge hangs when new devices
are added or removed.

Andrew Klettke (aklettke) wrote :

> So now is this going through the firewall?
This is from the firewall.

> What kind of firewall? Do other hosts inside the firewall have the
> right ARP entry for the guest? What about the host itself?
It's a simple i386 OpenBSD 4.7 box with PF enabled. All traffic to and from the Host and the VM is unfiltered. I'll have to check whether the ARP entry changes for the host, but other hosts show the ARP entry change as well.

> Note that this sounds like a different problem from what others are
> reporting. Others are saying that their bridge hangs when new devices
> are added or removed.
Note that the OP has said nothing about adding or removing new devices.

Serge Hallyn (serge-hallyn) wrote :

Quoting Andrew Klettke (<email address hidden>):
> > So now is this going through the firewall?
> This is from the firewall.
>
> > What kind of firewall? Do other hosts inside the firewall have the
> > right ARP entry for the guest? What about the host itself?
> It's a simple i386 OpenBSD 4.7 box with PF enabled. All traffic to and from the Host and the VM is unfiltered. I'll have to check whether the ARP entry changes for the host, but other hosts show the ARP entry change as well.
>
> > Note that this sounds like a different problem from what others are
> > reporting. Others are saying that their bridge hangs when new devices
> > are added or removed.
> Note that the OP has said nothing about adding or removing new devices.

Yes, I just want to make clear that there may be two separate issues
here.

If you don't mind, once you verify the host's /proc/net/arp entries
before and after it goes bad, I'm going to whip up a recipe for trying
to reproduce this without kvm, so we can reclassify it approrpriately.

thanks,
-serge

Andrew Klettke (aklettke) wrote :

> If you don't mind, once you verify the host's /proc/net/arp entries
> before and after it goes bad, I'm going to whip up a recipe for trying
> to reproduce this without kvm, so we can reclassify it approrpriately.

I don't see the ARP entry for the guest at all on the host machine before the networking fails, it's just not there; even after pinging the host machine from the guest, and vice-versa. It doesn't make a difference whether or not STP is enabled on the bridge.

BEFORE NETWORKING FAILS:
# cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.1.154 0x1 0x2 00:30:1b:be:63:4f * eth0
192.168.8.1 0x1 0x2 00:15:17:dc:2f:a2 * br0

AFTER NETWORKING FAILS:
# cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.1.154 0x1 0x2 00:30:1b:be:63:4f * eth0
192.168.8.1 0x1 0x2 00:15:17:dc:2f:a2 * br0

Sergey Svishchev (svs) wrote :

These may be related:

https://bugzilla.redhat.com/show_bug.cgi?id=571991 " libvirt should not use the MAC address assigned to tap devices/vnet interfaces by the TAP/TUN driver."

https://www.redhat.com/archives/libvir-list/2010-July/msg00450.html "[libvirt] [PATCH] Set a stable & high MAC addr for guest TAP devices on host"

For me, these are not related, because I'm not modifying the bridge
when the connectivity is lost. As long as there is steady network
trafic (for example a ping every minute), the connectivity is not
lost. As soon as I stop this, the connectivity will be lost within a
few hours.

Andrew Klettke (aklettke) wrote :

I am also not modifying the bridge at all.

Where my problem differs from Frank's, however, is my connectivity is lost even if there is steady network traffic to or from the guest.

Paolo Maero (fabrica64) wrote :

I verified that my problem is the same of RedHat bug 571991 and I am losing connectivity because of the bridge change its MAC address.

Serge Hallyn (serge-hallyn) wrote :

Quoting Paolo Maero (<email address hidden>):
> I verified that my problem is the same of RedHat bug 571991 and I am
> losing connectivity because of the bridge change its MAC address.

Thanks, Paolo, so then launchpad bug # 579892 is the one you want.
It looks like a fix was *just* sent upstream. Should be included
in 0.8.3, or, if we merge 0.8.2 into maverick instead we should be
able to cherrypick it.

Any updates on this? I still have this problem, running 2 LTS releases, Hardy as host and Lucid as guest, both simple server installs, with KVM, virtio, tap devices and a bridge...

Oh, btw, Serge: I'm not using anything like wicd or networkmanager, this is on a server.

Changed in qemu-kvm (Ubuntu):
status: Incomplete → Confirmed
Santi Manninen (kschzt) wrote :

I think I'm having the same debilitating problem. Especially after adding a fourth and fifth VM, packet loss would increase over time, until the host and guests would become completely unreachable. I've added keep-alive pings to all guests and the host, which seems to be helping. Still, hasn't got completely rid of packet loss. Would be great to have an update!

Thierry Carrez (ttx) on 2010-08-26
Changed in qemu-kvm (Ubuntu):
status: Confirmed → Triaged
Serge Hallyn (serge-hallyn) wrote :

I'm still waiting for the build system to get around to building these, but I've
uploaded packages with the proposed fix to ppa:serge-hallyn/libvirt-fix-macaddr
for lucid and hardy. Santi, Frank, or Andrew, can any of you test with this
package and see if it solves the problem for you?

Hi Serge,

Thanks for your time!

The problem is not with the mac addresses (at least not for me). As I reported earlier:
> For me, these are not related, because I'm not modifying the bridge
> when the connectivity is lost. As long as there is steady network
> trafic (for example a ping every minute), the connectivity is not
> lost. As soon as I stop this, the connectivity will be lost within a
> few hours.

Also note that my situation is like this:
I have a host running Hardy with a bridge with two eth's and two tap devices. I have two kvm guests: one is Hardy as well and one is Lucid. Only the Lucid machines loses connectivity, the Hardy machine works fine. In KVM I have hardcoded the mac addresses (due to an issue of years before where both machines could get the same mac address).

Serge Hallyn (serge-hallyn) wrote :

Thanks, Frank. Would you mind opening a separate bug about your issue,
summarizing your network topology, how you start the VMs (the libvirt xml
or kvm command line or both), and at what point you lose connectivity?

Can Santi or Andrew (or Billy, the original poster) test the proposed package?

Andrew Klettke (aklettke) wrote :

I'll go ahead and test the package when I have a minute here; I've multiple other servers that need attention.

@Serge: I reported the bug, you can find it in: #626662

Frank

2010/8/27 Serge Hallyn <email address hidden>:
> Thanks, Frank.  Would you mind opening a separate bug about your issue,
> summarizing your network topology, how you start the VMs (the libvirt xml
> or kvm command line or both), and at what point you lose connectivity?
>
> Can Santi or Andrew (or Billy, the original poster) test the proposed
> package?
>
> --
> kvm images losing connectivity w/bridged network
> https://bugs.launchpad.net/bugs/584048
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Thierry Carrez (ttx) on 2010-09-02
tags: added: server-mrs
Serge Hallyn (serge-hallyn) wrote :

The patch which I'm waiting on Santi or Andrew to test is in the
libvirt package in Maverick, so setting the status appropriately
(I hope) there to 'Fix Released'.

Changed in qemu-kvm (Ubuntu Maverick):
status: Triaged → Fix Released
status: Fix Released → Fix Committed
Serge Hallyn (serge-hallyn) wrote :

(ok, i give - can't seem to find a way to track this separately for Lucid and older
versus maverick)

Andrew Klettke (aklettke) wrote :

Serge,

I'm testing the new packages now, I'll let you know what I find.

Andrew Klettke (aklettke) wrote :

Serge,

No dice; the same problem occurs.

Curiously, running the /etc/netstart script in OpenBSD works to restore connectivity, which fails again after a bit.

Serge Hallyn (serge-hallyn) wrote :

I'm looking over the history of this particular bug, and there at least
3 or 4 different bugs represented here. We should have been stricter
from the start about filing separate bugs.

@Billy Charlton

Can you tell us whether you are still having a problem? Yours appears
likely to be a bug in the bridge driver, so if you are, then I'd like
to have you test the pre-preposed kernel packages from
https://launchpad.net/~kernel-ppa/+archive/pre-proposed).

@Andrew

thanks for testing. I'd like to try a few things, but I need your bug
kept separate from the others. Can you please file a *new* bug (refuse
any suggestions at duplicates for now)? Please re-attach your
/etc/network/interfaces, output of 'iptabes -L' and 'netstat -nr',
'brctl show', 'ifconfig -a' (with a VM up), 'qemu --version', 'libvirtd
--version', 'uname -a', the contents of /etc/libvirt/qemu/network/*, and
the xml of a sample VM that fails? If any VM does not lose
connectivity, then include that too, but I don't think any is. (I
have specific questions I had started to write out, but will reserve them
for the new bug) I really am eager to look into this one, but need
the information segragated. Thanks!

@angriukas

Near as I can tell, the fix I've committed is actually for *your* bug
(which is still separate from Frank's, Billy's, Andrew's, and Santi's).
Ideally you would (a) open a new bug, specify that you lose connectivity
on guest shutdown, and comment on whether the proposed fix works for
you.

(Note to self - I suspect Jurgen's is the same as Frank's)

The status of THIS bug is to be set only based on feedback from Billy.
Setting to incomplete for now.

Sergey Svishchev (svs) wrote :

FYI, there is yet another way (or two) to lose connectivity to KVM VM -- bug 579276

Serge Hallyn (serge-hallyn) wrote :

(Marking this invalid for Maverick until someone can confirm that they have the
same problem as Billy has on Maverick)

Changed in qemu-kvm (Ubuntu Lucid):
status: New → Incomplete
Changed in qemu-kvm (Ubuntu Maverick):
status: Fix Committed → Invalid
Serge Hallyn (serge-hallyn) wrote :

Yup, thanks for the link, Sergey. I don't know that this is a dup of that, but I
suspect that it is likewise a kernel bug and not a kvm one.

Andrew Klettke (aklettke) wrote :

Serge: I opened the new bug report as you requested, here:

https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/633392

Shrenik (shrenik-bhura) wrote :

Traced the problem to mal-functioning server motherboard. Replacement of motherboard fixed several errors reported in the logs. Yet monitoring closely.

Changed in opensuse:
status: New → Invalid
Tais P. Hansen (taisph) wrote :

I think I have this problem (or the problem at #633392) with a guest running on a Maverick kvm host. But only on Maverick. I'm not seeing it on any of the Lucid hosts.

The only peculiar difference I've found between the two hosts is that tcpdumping the bridge shows double externally broadcasted packets on Maverick, but not on Lucid.

Serge Hallyn (serge-hallyn) wrote :

Hi Tais,

could you take a look at bug #673705? I think that one is the same problem
you are having. I don't want to connect it to this bug because this one got too
mired down in similar-but-not-identical bugs.

Tais P. Hansen (taisph) wrote :

I've traced the double packets to be a vlan issue and as such is not related to this error.

That leaves me with a kvm guest experiencing periodical outages on a Maverick host with no apparent differences compared to a Lucid host.

Serge Hallyn (serge-hallyn) wrote :

Quoting Tais Plougmann Hansen (<email address hidden>):
> I've traced the double packets to be a vlan issue and as such is not
> related to this error.
>
> That leaves me with a kvm guest experiencing periodical outages on a
> Maverick host with no apparent differences compared to a Lucid host.

Do the outages simply happen at random, or when you shut down another
guest?

If the former, please file a new bug, and I'll ask for the relevant
information there. Thanks.

Tais P. Hansen (taisph) wrote :

I have not had the opportunity to test multiple guests on the Maverick host since I've had quite a few problems with it.

Right now it seems like the problem I'm seeing is related to a vlan bug very similar to the one reported in bug #658460. The outages appear random but it looks like it occurs when there haven't been any traffic for a while and the guests mac is cleared from arp tables in the network. It may not be kvm related at all although I've only observed the outage in a kvm guest so far.

Tais P. Hansen (taisph) wrote :

@Serge The outages appear random. I suspect the outage is related to bug #633392, while the double broadcast packets must be a vlan bug. I'm adding my findings to bug #633392.

angriukas (andrius-uskevicius) wrote :

@Serge Hallyn

Sorry for big delay, just installed libvirt-fix-macaddr to our production system.
No more network interrupts was observed on guest start/shutdown.
Thank you for fix.

Serge Hallyn (serge-hallyn) wrote :

@angriukas

thanks, could you please open a new bug against libvirt in lucid, detailing your precise symptoms with a pointer back to this bug? Then I can use that bug to request an SRU to get the fix into lucid.

Trevor Sharpe (tsharpe) wrote :

I know this is a old bug but I experiencing the same issue. I am running Ubuntu 10.04.2 LTS 64 bit server and can't seem to find angriukas fix for libvirt. Perhaps I am not looking hard enough any help would be appreciated.

Serge Hallyn (serge-hallyn) wrote :

Hi Trevor,

thanks for commenting.

The package he was testing is at https://launchpad.net/~serge-hallyn/+archive/libvirt-fix-macaddr.

This bug seems to have slipped through the cracks! I'm going to reassign this bug to libvirt and proceed with the SRU.

description: updated
description: updated
Changed in qemu-kvm (Ubuntu Lucid):
status: Incomplete → In Progress
importance: Undecided → High
Changed in qemu-kvm (Ubuntu Lucid):
assignee: nobody → Serge Hallyn (serge-hallyn)

I have the same 4-second bridge freeze when using LXC containers and lxc-start or lxc-stop. (I am using a custom script to create my LXC containers.) I saw the reference to this above:

https://www.redhat.com/archives/libvir-list/2010-July/msg00450.html

I changed my custom script from this:

MACADDR="00:16:3e:"`head /dev/urandom | md5sum | sed -r 's/^(.{6}).*$/\1/; s/([0-9a-f]{2})/\1:/g; s/:$//;'`

...to this:

MACADDR="fe:16:3e:"`head /dev/urandom | md5sum | sed -r 's/^(.{6}).*$/\1/; s/([0-9a-f]{2})/\1:/g; s/:$//;'`

...and now the freezing is gone.

Since this is my own custom script, I'm not sure what package it would belong to. Just be aware that this is a fundamental "bridge" issue and not limited to libvirt, KVM. It shows up with LXC and probably any other system where you add a new MAC.

Serge Hallyn (serge-hallyn) wrote :

Quoting Derek Simkowiak (<email address hidden>):
> I have the same 4-second bridge freeze when using LXC containers and
> lxc-start or lxc-stop. (I am using a custom script to create my LXC
> containers.) I saw the reference to this above:
>
> https://www.redhat.com/archives/libvir-list/2010-July/msg00450.html
>
> I changed my custom script from this:
>
> MACADDR="00:16:3e:"`head /dev/urandom | md5sum | sed -r
> 's/^(.{6}).*$/\1/; s/([0-9a-f]{2})/\1:/g; s/:$//;'`
>
> ...to this:
>
> MACADDR="fe:16:3e:"`head /dev/urandom | md5sum | sed -r
> 's/^(.{6}).*$/\1/; s/([0-9a-f]{2})/\1:/g; s/:$//;'`
>
> ...and now the freezing is gone.
>
> Since this is my own custom script, I'm not sure what package it would
> belong to. Just be aware that this is a fundamental "bridge" issue and
> not limited to libvirt, KVM. It shows up with LXC and probably any
> other system where you add a new MAC.

Thanks for your comment.

Please talk with upstream lxc about how it can use it. I'm certain
it will be helpful.

That it is a general bridge property is indeed known. The fix in
this bug is, like your script, simply working around that fact.

affects: qemu-kvm (Ubuntu) → libvirt (Ubuntu)
Changed in libvirt (Ubuntu):
status: Invalid → Fix Released

Accepted libvirt into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in libvirt (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: verification-needed
Gert van Dijk (gertvdijk) wrote :

I've installed the libvirt package in lucid-proposed and this issue hasn't reoccurred for me! Very happy user here. :)

Right after installation it did happen again though, but I forgot to do 'service libvirt-bin restart'. After that I've started/stopped VMs more than 20 times, all running smooth.

Martin Pitt (pitti) on 2011-05-02
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.7.5-5ubuntu27.11

---------------
libvirt (0.7.5-5ubuntu27.11) lucid-proposed; urgency=low

  * add debian/patches/fix-tap-interfaces-mac-addrs.patch to prevent
    network freezes due to badly chosen tap interface macaddrs.
    (LP: #584048)
 -- Serge Hallyn <email address hidden> Wed, 20 Apr 2011 08:14:10 -0500

Changed in libvirt (Ubuntu Lucid):
status: Fix Committed → Fix Released
spidernik84 (alexander-rilik) wrote :

Just a brief comment for who's going to read this discussion: this was happening with bond interfaces as well, not only bridges. Everything that's not physical seemed to be affected.

Cheers!

Changed in qemu-kvm (Fedora):
importance: Unknown → High
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.