KVM images lose connectivity with bridged network

Bug #997978 reported by Jonathan Tullett
352
This bug affects 61 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
qemu-kvm (Ubuntu)
Fix Released
High
Unassigned
Precise
Fix Released
High
Unassigned

Bug Description

=========================================
SRU Justification:
1. Impact: networking breaks after awhile in kvm guests using virtio networking
2. Development fix: The bug was fixed upstream and the fix picked up in a new
   merge.
3. Stable fix: 3 virtio patches are cherrypicked from upstream:
   a821ce5 virtio: order index/descriptor reads
   92045d8 virtio: add missing mb() on enable notification
   a281ebc virtio: add missing mb() on notification
4. Test case: Create a bridge enslaving the real NIC, and use that as the bridge
   for a kvm instance with virtio networking. See comment #44 for specific test
   case.
5. Regression potential: Should be low as several people have tested the fixed
   package under heavy load.
=========================================

System:
-----------
Dell R410 Dual processor 2.4Ghz w/16G RAM
Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise

Setup:
---------
We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.

From the host:
# cat /etc/network/interfaces
auto br0
iface br0 inet static
        address 212.XX.239.98
        netmask 255.255.255.240
        gateway 212.XX.239.97
        bridge_ports eth0
        bridge_fd 9
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
          TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
          Interrupt:36 Memory:da000000-da012800

# ifconfig br0
br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
          inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.d4ae52842d5a no eth0

I have no default network configured to autostart in libvirt as we're using bridged networking:
# virsh net-list --all
Name State Autostart
-----------------------------------------
default inactive no

# arp
Address HWtype HWaddress Flags Mask Iface
mailer03.xxxx.com ether 52:54:00:82:5f:0f C br0
mailer01.xxxx.com ether 52:54:00:d2:f7:31 C br0
mailer02.xxxx.com ether 52:54:00:d3:8f:91 C br0
dxi-gw2.xxxx.com ether 00:1a:30:2a:b1:c0 C br0

From one of the guests:
<domain type='kvm' id='4'>
  <name>mailer01</name>
  <uuid>d41d1355-84e8-ae23-e84e-227bc0231b97</uuid>
  <memory>2097152</memory>
  <currentMemory>2097152</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-1.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/dev/mapper/vg_main-mailer01--root'/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/dev/mapper/vg_main-mailer01--swap'/>
      <target dev='hdb' bus='ide'/>
      <alias name='ide0-0-1'/>
      <address type='drive' controller='0' bus='0' unit='1'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:d2:f7:31'/>
      <source bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-d41d1355-84e8-ae23-e84e-227bc0231b97</label>
    <imagelabel>libvirt-d41d1355-84e8-ae23-e84e-227bc0231b97</imagelabel>
  </seclabel>
</domain>

From within the guest:
# cat /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
        address 212.XX.239.100
        netmask 255.255.255.240
        network 212.XX.239.96
        broadcast 212.XX.239.111
        gateway 212.XX.239.97

# ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:d2:f7:31
          inet addr:212.XX.239.100 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::5054:ff:fed2:f731/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5631830 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6683416 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2027322829 (2.0 GB) TX bytes:2076698690 (2.0 GB)

A commandline which starts the KVM guest:
/usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name mailer01 -uuid d41d1355-84e8-ae23-e84e-227bc0231b97 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mailer01.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/dev/mapper/vg_main-mailer01--root,if=none,id=drive-ide0-0-0,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/dev/mapper/vg_main-mailer01--swap,if=none,id=drive-ide0-0-1,format=raw -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -netdev tap,fd=18,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d2:f7:31,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Problem:
------------
Periodically (at least once a day), one or more of the guests lose network connectivity. Ping responds with 'host unreachable', even from the dom host. Logging in via the serial console shows no problems: eth0 is up, can ping the local host, but no outside connectivity. Restart the network (/etc/init.d/networking restart) does nothing. Reboot the machine and it comes alive again.

I've verified there's no arp games going on on the primary host (the arp tables remain the same before - when it had connectivity - and after - when it doesn't.

This is a critical issue affecting production services on the latest LTS release of Ubuntu. It's similar to an issue which was 'resolved' in 10.04 but appears to have risen its ugly head again.

Changed in libvirt (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting this bug. Does this also happen ifonly one of the VMs is up? Is there any pattern to the time of day or length of a vm's uptime before this happens? What does 'route -n' show before and after it happens?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

(setting status to incomplete while awaiting response)

Changed in libvirt (Ubuntu):
status: New → Incomplete
Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

This happens to one of the VMs at any one time (on Wednesday two failed, about 2 hours apart). There's no discernible pattern in terms of time of day or length of a VMs uptime.

The next time one fails (they've been stable today), I'll do a route -n and post the output. For record, currently (with a working VM), route -n shows:

 $ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 212.XX.239.97 0.0.0.0 UG 100 0 0 eth0
212.XX.239.96 0.0.0.0 255.255.255.240 U 0 0 0 eth0

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 997978] Re: KVM images lose connectivity with bridged network

Thanks,

In order to check whether it is a qemu (perhaps virtio driver) bug or
a bug in the kernel or network utilities on the host, would you be
able to try setting up a container and checking it's networking?
There are lighter weight ways of testing this, but the simplest way
would be to:

sudo apt-get install lxc
# If having lxcbr0 bothers you, since you don't need it for this test, you
# can set LXC_AUTO=false in /etc/default/lxc and do
# "sudo stop lxc; sudo start lxc".
cat > lxc.conf << EOF
lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up
EOF

sudo lxc-create -t ubuntu -f lxc.conf -n lxc1
sudo lxc-start -n lxc1 -d

Then log into the container's console with

sudo lxc-console -n lxc1

and, from there, periodically check the network status. If that also
loses connectivity periodically, then we know the bug is happening
below kvm.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

I tried building the container this morning using both your file above, and the following:
lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up
lxc.network.ipv4=212.XX.239.103/28
lxc.network.name=eth0

# lxc-create -t ubuntu -f lxc.conf -n lxc1
debootstrap is /usr/sbin/debootstrap
Checking cache download in /var/cache/lxc/precise/rootfs-amd64 ...
Copy /var/cache/lxc/precise/rootfs-amd64 to /var/lib/lxc/lxc1/rootfs ...
Copying rootfs to /var/lib/lxc/lxc1/rootfs ...

##
# The default user is 'ubuntu' with password 'ubuntu'!
# Use the 'sudo' command to run tasks as root in the container.
##

'ubuntu' template installed
'lxc1' created

But it fails to start:

# lxc-start -n lxc1
lxc-start: failed to spawn 'lxc1'

/var/log/syslog shows:
May 12 09:47:12 dom0 kernel: [ 1107.903216] device vethHzjri2 entered promiscuous mode
May 12 09:47:12 dom0 kernel: [ 1107.905151] ADDRCONF(NETDEV_UP): vethHzjri2: link is not ready

ifconfig shows a load of virtual devices and brctl shows:

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.0649d9be1876 no eth0
       veth4bkC47
       vethHzjri2
       vethNBwjzP
       vethZo4vwt
       vethhzluzM
       vethidQWcJ
       vethmtoeDY
       vethuPj7Qk
       vethuxztRp
       vnet1

(I've tried starting the contain a few times).

I'm happy to debug this with you, but lxc isn't software I'm familiar with, unfortunately. Any ideas?

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

It took a few days, but we've finally had a failure of VM instance 2. It died 2.5 hours ago. Logging into the dom0 host shows the arp table dead for that host:

mailer02.xxxx.com (incomplete) br0

Logging into the machine itself via console:

root:~# ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:d3:8f:91
          inet addr:212.XX.239.101 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::5054:ff:fed3:8f91/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:6893246 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8242152 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2290922658 (2.2 GB) TX bytes:3314395798 (3.3 GB)

root:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 212.XX.239.97 0.0.0.0 UG 100 0 0 eth0
212.XX.239.96 0.0.0.0 255.255.255.240 U 0 0 0 eth0

Restarting the network on the vm (/etc/init.d/networking restart) does nothing. Rebooting the VM brings it back to life.

I'm happy to try with the container again, but using the instructions provided (even with some additional research online), I can't get it running. Please advise.

Thanks.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, I didn't see your update on the 12th.

The configuration file as you showed it may also work, but it should be simpler
to use the one I posted.

Can you do
 sudo lxc-start -n lxc1 -l DEBUG -o debugout

and attach the file debugout here?

Do you have cgroups mounted? (what does 'grep cgroup /proc/self/mounts' show?)

Does the dhcp server for that network answer all requests, or only for
certain mac addresses? (Lack of dhcp response shouldn't prevent the
container from starting anyway, so shouldn't explain the problem)

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Hey,

There is no DHCP server for that network, which is why I set up the static IPs.

The cgroup-bin package wasn't installed, I installed that and then grep showed some output:

root:~# grep cgroup /proc/self/mounts
cgroups /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0

The container has now started.

I'll configure it so it's running the same software as the other VMs and let's see what happens over the coming days.

Thanks for your help so far.

Changed in libvirt (Ubuntu):
status: Incomplete → New
Changed in bridge-utils (Ubuntu):
status: New → Incomplete
Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in bridge-utils (Ubuntu):
importance: Undecided → High
Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

An update: had a different KVM VM die today (same symptoms/resolution as previously); the LXC instance remains working.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Another update: multiple KVM VM failures over the past couple of days, the lxc-container is working without issue.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks Jonathan, sounds like the issue is definately in qemu then.

Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
Changed in bridge-utils (Ubuntu):
status: Incomplete → Invalid
Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

I have the same symptoms with Ubuntu 12. The network is not reachable 3-4 times a day, for about 1-2 minutes but recovers by itself. Does not seem traffic/load related as it also happens, when there is nothing on.

- No dhcp
- bridged network with static ip's for guests
- atm only running one VM Ubuntu 12
- pretty bare host install with just kvm and libvirt
- no relevant output in any log files

Network

---------------------------------------
interfaces file on host:

auto lo
iface lo inet loopback

# device: eth0
auto eth0
iface eth0 inet static
  address 176.9.x.x
  broadcast 176.9.x.x
  netmask 255.255.255.192
  gateway 176.9.x.x

# default route to access subnet
up route add -net 176.9.x.x netmask 255.255.255.192 gw 176.9.x.x eth0

auto br0
iface br0 inet static
  address 176.9.x.x
  netmask 255.255.255.255
  gateway 176.9.x.x
  pointopoint 176.9.x.x
  bridge_ports eth0
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0
  up route add -host 176.9.x.x dev br0
  up route add -host 176.9.x.x dev br0ll

------------------------------------
interfaces in vm

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
  address 176.9.x.x
  netmask 255.255.255.192
  gateway 176.9.x.x
  pointopoint 176.9.x.x
  dns-nameservers 213.133.98.98. 213.133.99.99

--------------------------------
virsh # version
Compiled against library: libvir 0.9.8
Using library: libvir 0.9.8
Using API: QEMU 0.9.8
Running hypervisor: QEMU 1.0.0

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok, looking back over the original description, the bug poster has an eth0 mac address of d4:ae:52:84:2d:5a. I'm quite certain that is at least a part of the problem - the bridge will always take the lowest mac address of any device on it, and the nics on the VMs have lower mac addresses. So any time a VM goes on or offline, the bridge mac address will change, causing network traffic to pause.

Georg, please show 'ifconfig -a' output while the VMs are running

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

If the mac address of eth0 is the cause of your problem, then a (non-ideal) workaround would be to use the stock NATed virbr0 for your VMs instead, as it won't be bridged with eth0.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, I had it backwards. The bridge takes the higher address.

Revision history for this message
bradleyd (bradleydsmith) wrote :

I have the same issue as stated above, but instead of rebooting the guests to bring them back, a quick ifdown, ifup does the trick. Network is restored after this.

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

@serge, thanks for your input. here Dom0 ifconfig, with 2 machines running.

i setup br0 myself because the machines are running at hetzner, with special addon IP's so the vm's can be reached from the outside. (wiki.hetzner.de/index.php/KVM_mit_Nutzung_aller_IPs_-_the_easy_way) i am also evaluating the problem with hetzner support to ensure its not one of their routers in from of the Dom0.

vnet0, vnet1 are the bridges created by libvirt, but the vm have static ip entries in /network/interfaces and are bound to br0 via libvirt.

------------------------------

br0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
          inet addr:176.9.126.xx Bcast:0.0.0.0 Mask:255.255.255.255
          inet6 addr: fe80::ca60:ff:fee9:4a2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5852477 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4403013 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2310013511 (2.3 GB) TX bytes:2588138678 (2.5 GB)

eth0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
          inet addr:176.9.126.xx Bcast:176.9.126.95 Mask:255.255.255.192
          UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
          RX packets:6623242 errors:0 dropped:35190 overruns:0 frame:0
          TX packets:4668377 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2484375973 (2.4 GB) TX bytes:1715146205 (1.7 GB)
          Interrupt:17 Memory:fe500000-fe520000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:59917 errors:0 dropped:0 overruns:0 frame:0
          TX packets:59917 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:9106822 (9.1 MB) TX bytes:9106822 (9.1 MB)

vnet0 Link encap:Ethernet HWaddr fe:54:00:2e:3d:0e
          inet6 addr: fe80::fc54:ff:fe2e:3d0e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2033457 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2971987 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:1023321831 (1.0 GB) TX bytes:1482597549 (1.4 GB)

vnet1 Link encap:Ethernet HWaddr fe:54:00:3f:2a:9c
          inet6 addr: fe80::fc54:ff:fe3f:2a9c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:124034 errors:0 dropped:0 overruns:0 frame:0
          TX packets:140178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:27970351 (27.9 MB) TX bytes:56681695 (56.6 MB)
------------------------------------

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

Got this wrong:
vnet0, vnet1 are the bridges created by libvirt, => are the network interface created by libvirt.

I am jsut reading this bug report concerning the MAC address problems:

https://bugzilla.redhat.com/show_bug.cgi?id=571991
https://bugzilla.redhat.com/show_bug.cgi?id=583139

and so far it seems ok that the br0 still has the MAC of eth0 and the VMs both start with FE::

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Georg Leciejewski (<email address hidden>):
> @serge, thanks for your input. here Dom0 ifconfig, with 2 machines
> running.
>
> i setup br0 myself because the machines are running at hetzner, with
> special addon IP's so the vm's can be reached from the outside.
> (wiki.hetzner.de/index.php/KVM_mit_Nutzung_aller_IPs_-_the_easy_way) i
> am also evaluating the problem with hetzner support to ensure its not
> one of their routers in from of the Dom0.
>
> vnet0, vnet1 are the bridges created by libvirt, but the vm have static
> ip entries in /network/interfaces and are bound to br0 via libvirt.
>
> ------------------------------
>
> br0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
> inet addr:176.9.126.xx Bcast:0.0.0.0 Mask:255.255.255.255
> inet6 addr: fe80::ca60:ff:fee9:4a2e/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:5852477 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4403013 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:2310013511 (2.3 GB) TX bytes:2588138678 (2.5 GB)
>
> eth0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
> inet addr:176.9.126.xx Bcast:176.9.126.95 Mask:255.255.255.192
> UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
> RX packets:6623242 errors:0 dropped:35190 overruns:0 frame:0
> TX packets:4668377 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:2484375973 (2.4 GB) TX bytes:1715146205 (1.7 GB)
> Interrupt:17 Memory:fe500000-fe520000

One thing I notice here is that you have an ip address on eth0, which I
assume is bridged with br0?

When I bridge eth0 to br0 using the following /etc/network/interfaces:

=========================
auto lo
iface lo inet loopback

auto br0
iface br0 inet dhcp
 bridge_ports eth0

# The primary network interface
auto eth0
iface eth0 inet manual
=========================

I get the following ifconfig -a output:

=========================
br0 Link encap:Ethernet HWaddr fa:16:3e:59:27:16
          inet addr:10.55.60.89 Bcast:10.55.60.255 Mask:255.255.255.0
          inet6 addr: fe80::f816:3eff:fe59:2716/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:223 errors:0 dropped:0 overruns:0 frame:0
          TX packets:178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:21627 (21.6 KB) TX bytes:19555 (19.5 KB)

eth0 Link encap:Ethernet HWaddr fa:16:3e:59:27:16
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:259 errors:0 dropped:0 overruns:0 frame:0
          TX packets:178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:26871 (26.8 KB) TX bytes:19487 (19.4 KB)
=========================

What does your /etc/network/interfaces look like?

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

here it is. already posted it above with ip's xx:

================================
auto lo
iface lo inet loopback

# device: eth0
auto eth0
iface eth0 inet static
  address 176.9.126.79
  broadcast 176.9.126.95
  netmask 255.255.255.192
  gateway 176.9.126.65

# default route to access subnet
up route add -net 176.9.126.64 netmask 255.255.255.192 gw 176.9.126.65 eth0

auto br0
iface br0 inet static
  address 176.9.126.79
  netmask 255.255.255.255
  gateway 176.9.126.65
  pointopoint 176.9.126.65
  bridge_ports eth0
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0
  up route add -host 176.9.126.92 dev br0
  up route add -host 176.9.126.93 dev br0

======================================
and brctl show
======================================
bridge name bridge id STP enabled interfaces
br0 8000.c86000e94a2e no eth0
                              vnet0
virbr0 8000.000000000000 yes
======================================

I also did an mtr during downtimes and it shows that packages are lost on dom0 -> .79
I still hope it is some kind of misconfig, but as said before same network/bridge config runn without interupts in ubuntu 10.4 lts.
Thanks for your patience.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Georg Leciejewski (<email address hidden>):
> here it is. already posted it above with ip's xx:
>
> ================================
> auto lo
> iface lo inet loopback
>
> # device: eth0
> auto eth0
> iface eth0 inet static

Hi,

I believe this is wrong. Could you change the eth0 bit to simply

auto eth0
iface eth0 inet manual

The fact that

> I also did an mtr during downtimes and it shows that packages are lost on dom0 -> .79

Supports the idea that that might solve your problem. (It also suggests
that yours is not the same as the original bug reporter's problem).

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

I tried that with no difference, but i am on another path: maybe it is related to acpi

Changed in libvirt (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Christian Parpart (trapni) wrote :

I am having the same issue, and gave an in-depth inspection in my report: bug 1016848

I am running version Essex 2012.1, and networking in VlanManager mode, a dedicated nova-network gateway, KVM and virt-type, Ubuntu 12.04 Precise, and run into this incident in up to every 1-2 days, yesterday even twice within 3 hours.

One symptom may be high networking traffic I/O on the given KVM instance.

Until now, I worked around by `nova reboot $instance_name`.

bradleyd (bradleydsmith), what exactly did you mean by ifdown & ifup? the vnet%d interface or the bridge (e.g. br100) ? What interface did you re-up? -- for the mean time, I'd like to write a tiny daemon that runs on the hypervisor nodes, to check every N secs whether or not it can PING the KVM, and if not, it is to re-up its underlying network interface.

Serge Hallyn, I'd like to assist in whatever you need to get this beast fixed, since for me this is also a very major incident, too, and I just can't add more production services until knowing the OpenStack-stack is functioning well. So please tell me what I can provide you with. :-)

Regards,
Christian.

Revision history for this message
andrew bezella (abezella) wrote :

i believe that we're seeing the same problem with ganeti-managed kvm instances running on 12.04 utilizing a bridged network.

in an initial deployment of 8 guests (also 12.04) we had half of them drop off the network within a few hours. there is a weak correlation between high network load in the guests and the network dropping. in some but not all cases from the kvm instance's console i was able to ifdown/ifup its interface and bring it back online.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Same kind of problem here :

Running Ubuntu-servers 12.04 VMs on Ubuntu-servers 12.04 using KVM/Libvirt over bridges.

Pinging my gateway from a random VM and watching packets with tcpdump on the kvm host :

icmp is ok on my vnet -> ok on the bridge -> ok on my bond (active-backup) -> ok on my gateway (reply) -> ok on my bond -> ok on my brigde -> No packet received on my vnet !!!!

brct showmacs mybridge seems to be ok showing my mac:addr (bridge+vm)

I have to ifdown/ifup my eth0 on virtual guest to make it work again til the next time.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Note I'm using 3.2.0-23-virtual kernel for my VMs ...

Revision history for this message
Christian Parpart (trapni) wrote :

So is it a bug in the VM's networking driver or in the hypervisor ?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Georg,

have you found any more about the relation to acpi?

@Stephane,

in your case it would certainly seem to be a bug in either the guest kernel or the virtual nic driver, as with the original bug submitter. Can you try switching to a different virtual nic type, i.e.

   model type='ne2k_pci'

Also if it is possible for you to run a test on a quantal host, which has a much newer qemu-kvm, that would be interesting. Can you tell me in numbers how heavy traffic needs to be before the VM drops out? Is it traffic to the VM which freezes that VM, or does any traffic to the host or any VM threaten to freeze any VM.

@Christian,

several people have piped in on this bug. I'm not certain about yours, but this bug in particular is in qemu itself (or perhaps, though unlikely, the kernel).

Revision history for this message
Stephane Neveu (stefneveu) wrote :

@Serge

I did upgrade my kernels yesterday evening both on hypervisors and guests so now I'm running 3.2.0-25-generic on hosts and the same kernel version but virtual for VMs. Same problem today morning, some VMs dropped out. I'm actually using virtio everywhere. I'll try to test another one and keep you in touch asap.
I also tried to run a VMs with the generic kernel (not the virtual one) and I'm facing the same issue.
Serge as I'm building this new plateform I can say they there is no traffic at all on my VMs exept 2 guys testing some java stuffs. I almost thinking that VMs drop out when there is not enough traffic on nics...

I'll keep you in touch.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Serge,

ne2k_pci seems to drop the link when generating some traffic (only tested on 4 VMs).
e1000 seems have to same problem as virtio, dropping connections without traffic ...
What else may I try ? Is it really a driver issue ?

Revision history for this message
Stephane Neveu (stefneveu) wrote :

I've noticed I never had such a problem on one host running 3.2.0-24-generic ... on this one, my VMs have only 2 vnet per VM whereas on others I have at least 4 vnet per VM.

Is there a tap generation limit somewhere ? (I don't think so, I do not see such a thing in sysctl -a)
I'll try to downgrade my kernel on one buggy host while waiting for some ideas...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Stephane,

can you show your host network configuration and a VM's xml? Are you bridging the VM nics with the host's eth0 over your own br0, or are you using the default NATed virbr0? Does this happen even when only a single VM is up?

I will try to reproduce next week.

(For host network configuration, the results of :
   sudo ifconfig -a
   sudo brctl show
   netstat -nr
   virsh net-dumpxml
should suffice, and 'virsh dumpxml VM1' to show the xml configuration for a guest)

Revision history for this message
Stephane Neveu (stefneveu) wrote :
Download full text (3.4 KiB)

Serge,

I'm using bond0 (active-backup with eth0/eth4) then tagging vlans with bond0.XXXX and linking my bond0.XXXX in a bridge ... then I do the same with bond1 (eth1/eth5) etc until bond3.

Then here is a dumpxml example :

<domain type='kvm' id='5'>
  <name>myguest1</name>
  <uuid>cc31a6e0-267c-4470-bcd7-8a92755a85cd</uuid>
  <memory>2097152</memory>
  <currentMemory>2097152</currentMemory>
  <vcpu>2</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/vm/disques//myguest1.qcow2'/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:a1:3d:dc'/>
      <source bridge='bridge1'/>
      <target dev='vnet16'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:3b:81:78'/>
      <source bridge='bridge2'/>
      <target dev='vnet17'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:3d:96:57'/>
      <source bridge='bridge3'/>
      <target dev='vnet18'/>
      <model type='virtio'/>
      <alias name='net2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:10:2e:f1'/>
      <source bridge='bridge4'/>
      <target dev='vnet19'/>
      <model type='virtio'/>
      <alias name='net3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/6'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/6'>
      <source path='/dev/pts/6'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5904' autoport='yes'/>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-cc31a6e0-267c-4470-bcd7-8a9...

Read more...

Revision history for this message
Alex Dioso (adioso) wrote :

This bug affects me as well, is it related to the one discussed here https://bugzilla.kernel.org/show_bug.cgi?id=42829

I will try using the vhost_net driver in the host and vhost=on as a guest parameter to see if it bypasses the issue.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Alex,

Thanks for the link, I'll also try to test with :

<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>

Does it work better for you with vhost_net ?

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Ok just in cas it may help...

It does work modprobing vhost_net and adding :
<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>
on every nics definitions.
Tested on more than 100 VMs.

Revision history for this message
Christian Parpart (trapni) wrote :

Stephane, ya, as a workaround (not to say: the better way), modprobe'ing "vhost_net" driver before actually starting the VMs works perfect. no incidents since 4 days now (tested on 30+ VMs).

But is your driver-tag for? I did not need to do that, my libvirt-bin added the vhost=on parameter to qemu-kvm automatically - so what do I need this line for then?

Regards,
Christian.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Christian,

I'm not really sure if just enabling vhost_net is enough... you may probably be right : it works well for you since 4 days ...
Like I still do not understand where the bug is locate, I prefered to do everything to fix it quickly.

Reading the bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=42829 they were also talking about event_idx='off'
It seems to be patched now (I'm not sure basically):
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commitdiff;h=4b727361f0bc7ee7378298941066d8aa15023ffb;hp=e1ac50f64691de9a095ac5d73cb8ac73d3d17dba

Regards,

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Christian,

You are right, no need to add :

<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>

in the xml ...

modprobe vhost_net should be enough.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks Stephane, per comment #38 I'm marking this bug as affecting the kernel.

Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → High
tags: added: kernel-kvm
Revision history for this message
Alex Dioso (adioso) wrote :

Just to confirm Stephane in #39, all we did was modprobe vhost_net (and add it to /etc/modules) then stop and start all our VMs. libvirtd detected vhost_net and added the vhost=on parameter automatically to our VMs. So far no crashes.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

(Perhaps a test kernel should be built in ppa with the patch from comment #38)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The proposed commit, 4b727361f0bc7ee7378298941066d8aa15023ffb: "virtio_net: fix truesize underestimation" (http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=4b727361f0bc7ee7378298941066d8aa15023ffb) is in the precise kernel already. That's unfortunately a dead end.

If someone who can reproduce this bug could either
    1. test with a quantal image on the affected hardware, or
    2. test with qemu compiled from git HEAD
        a. git clone git://git.qemu.org/qemu.git
     b. cd qemu
 c. ./configure --target-list=x86_64-softmmu
 d. cd x86_64-softmmu
 e. ./qemu_64-softmmu <arguments>

the result should be helpful.

Revision history for this message
Eugene Nelen (enelen) wrote :

I have faced with the same bug
When a VM using virtio-net driver receives too much network traffic, the network interface stops working.
Steps to reproduce -

On VM (with virtio-net):

% nc -k -l 0.0.0.0 4242 > /dev/null

On another machine (baremetal or VM):
% cat /dev/zero | nc IP 4242

After some time, the VM network will stop working. With 2 listeners on the VM, I succeeded to reproduce this issue in about 1 hour.

When I restart networking it working again

I am using next packages:
linux-image_3.2.0.23.25_amd64.deb
qemu_1.0+noroms-0ubuntu13_amd64.deb

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

On quantal, I can't get such a VM to run without it automatically loading vhost_net. The following kvm command

sudo kvm -hda x.img -netdev tap,ifname=tap0,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d8:33:46,bus=pci.0,addr=0x3 -vnc :1

causes vhost_net to get loaded. (The same command doesn't quite work in precise)

I'll leave the testcase (with nc) running for a few hours on quantal in any case.

Then I'll replace my quantal install with a precise one, and try again to reproduce there.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Eugene,

I ran that test case on precise for 3.5 hours, with virtio network bridged with eth0 and without the vhost_net kernel module loaded, but network never hung.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Switching to eth0 slave to bond0 inside br0 does not help me to reproduce.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ifenslave (Ubuntu):
status: New → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

could anyone who has reproduced this without bonding please comment here?

Revision history for this message
Gary Cuozzo (ua5r) wrote :

I have seen this issue on 2 different servers which use bridging but not bonding.

One server was a customer system and we were forced to back-date the OS to an earlier release. They were experiencing the issue up to once/day and quickly got impatient to have it resolved.

The other server is an internal system which runs multiple vm's. We have only seen the issue on one of the vm's and only once every 2-3 weeks. The vm which experiences the issue is our LTSP server.

I have been testing a small cluster of 3 host machines which use both bonding and bridging. I have not seen this issue affect them, but the usage is quite light and the vm's come & go since it's a testing environment right now. Due to this bug, we have halted any plans to upgrade vm hosts to Precise until we can verify it's fixed.

We've seen the following when the issue has occurred:
* Absolutely nothing in any logs, dmesg, etc.
* Host machine cannot ping the guest
* arp shows guest as incomplete
* guest machine can ping its own IP, but nothing else (host, gw, etc)
* restarting networking subsystem is successful (no errors) but has no effect on the problem
* rebooting the guest fixes the problem until it happens again. The reboot does not actually kill the kvm session and get a new process ID, but somehow having the guest go through the init again fixes it (until it happens again some period later).
* This issue has occurred on one 12.04 guest and one 11.10 guest
* Both of the servers which this occured on are Dell 2950 series machines. I have not seen this issue on any of our HP Proliant (mostly DL360's) machines.

If there is some sort of test I can run to help debug, I'm happy to do that.

Thank you for trying to address this. This is a huge bug for us.

Thanks,
gary

Revision history for this message
kraig (kamador) wrote :

I'm noticing the same behavior on 12.04 hosts with 12.04 guests. No bonding. I'm currently running a test with two identical VMs, one started after modprobing vhost_net

Revision history for this message
Thomas Vachon (vachon) wrote :

I have bonding, I have seen this both with and without vhost_net module.

Revision history for this message
Thomas Vachon (vachon) wrote :

Causes nova network connectivity to freeze and be unresponsive

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The following from comment #51:

rebooting the guest fixes the problem until it happens again. The reboot does not actually kill the kvm session and get a new process ID, but somehow having the guest go through the init again fixes it (until it happens again some period later).

makes me suspect the guest OS (though it still could be virtual hw in qemu).

@Kraig,
 do you have a script or recipe with which you can pretty reliably trigger this?

Revision history for this message
kraig (kamador) wrote :

Nothing concrete or that I can distribute unfortunately. I was able to reproduce it with an infinitely long bi-direction iperf but right now I am testing our internal web services under production level loads. I've yet to reproduce the problem after loading the vhost_net module but i'll need more time before I feel confident that it works.

Revision history for this message
andrew bezella (abezella) wrote :

we had 12.04 vms experiencing these hangs regularly (sometimes only staying online for hours). after enabling vhost_net there was a period of weeks without issue but a few days one of them hung again with the same (or very similar) symptoms. an ifdown/ifup from within the guest brought it back on the network.

Revision history for this message
Sergio Rubio (rubiojr) wrote :

There's a quite interesting thread in the openstack mailing list. May not be related to this bug but I guess it's worth investigating in any case:

http://markmail.org/message/xrvipkn2pvln2qty

Revision history for this message
Gary Cuozzo (ua5r) wrote :

Regarding comment #55:
I don't believe this is a guest OS issue. In my comment #51 I also indicate that I have had non-precise vm's (specifically 11.10) experience the same issue.

I have only experienced this issue when using 12.04 precise as the host OS. On a customer server which I back-dated by reinstalling 11.10 as the host OS, the issue went away.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Gary Cuozzo (<email address hidden>):
> Regarding comment #55:
> I don't believe this is a guest OS issue. In my comment #51 I also indicate that I have had non-precise vm's (specifically 11.10) experience the same issue.
>
> I have only experienced this issue when using 12.04 precise as the host
> OS. On a customer server which I back-dated by reinstalling 11.10 as
> the host OS, the issue went away.

Ah, thanks for that info.

If I push an updated version of qemu-kvm to ppa:ubuntu-virt/ppa, would you
(would anyone) be able to test it to confirm whether it fixes the issue?
If so then we could be sure the problem is in qemu userspace.

Revision history for this message
kraig (kamador) wrote :
Download full text (9.3 KiB)

I can test that today.

Sent from my iPhone

On Aug 16, 2012, at 7:11 AM, Serge Hallyn <email address hidden> wrote:

> Quoting Gary Cuozzo (<email address hidden>):
>> Regarding comment #55:
>> I don't believe this is a guest OS issue. In my comment #51 I also indicate that I have had non-precise vm's (specifically 11.10) experience the same issue.
>>
>> I have only experienced this issue when using 12.04 precise as the host
>> OS. On a customer server which I back-dated by reinstalling 11.10 as
>> the host OS, the issue went away.
>
> Ah, thanks for that info.
>
> If I push an updated version of qemu-kvm to ppa:ubuntu-virt/ppa, would you
> (would anyone) be able to test it to confirm whether it fixes the issue?
> If so then we could be sure the problem is in qemu userspace.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> New
> Status in “bridge-utils” package in Ubuntu:
> Invalid
> Status in “ifenslave” package in Ubuntu:
> Confirmed
> Status in “libvirt” package in Ubuntu:
> Confirmed
> Status in “linux” package in Ubuntu:
> Confirmed
> Status in “qemu-kvm” package in Ubuntu:
> Confirmed
>
> Bug description:
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
> inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)
>
> # brctl show
> bridge name bridge id STP enabled interfaces
> br0 8000.d4ae52842d5a no et...

Read more...

Revision history for this message
Gary Cuozzo (ua5r) wrote :

Hi Serge,
I would be willing to install your ppa packages. The only issue on my end is that this problem only occurs every 3-4 weeks for me. So I don't think I would be able to give any sort of concrete feedback for whether it addresses the issue or not.

My customer's server was having the issue several times per week, but I had to take action pretty quickly to keep them happy and chose to backdate to 11.10. After 9 days, they have not had the issue, which lends more evidence that it is not a guest-based issue but host-based.

Revision history for this message
andrew bezella (abezella) wrote :

as a counter-example, our original deployment of eight 12.04 guests on 12.04 hosts were dropping off the network regularly. when the guests were reinstalled as 10.04 (on the same hosts, with the same workload) the problem did not recur.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Gary, I've pushed a test package, with version0.9.13-0ubuntu7ppa1, to ppa:ubuntu-virt/backport. Once it builds, you can install it with

sudo add-apt-repository ppa:ubuntu-virt/backport
sudo apt-get update
sudo apt-get -y dist-upgrade

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry. Obviously that needs to be qemu-kvm. I'll add that in a few minutes.

Revision history for this message
Thomas Vachon (vachon) wrote :

I will test too

Revision history for this message
Thomas Vachon (vachon) wrote :

Your ubuntu-virt ppa is actually coming up with no canidate repo. Did this already get merged into the mainline?

Revision history for this message
Soren Hansen (soren) wrote :

Anecdotal evidence[1] suggests that this is a problem with the driver in the guest. It would be interesting to learn when this problem appeared and if it's gone with Quantal guests.

[1]: http://lists.openstack.org/pipermail/openstack-operators/2012-August/001921.html

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Thomas,

I'm sorry, that should be 'ppa:ubuntu-virt/backports' (plural, not singular).

Revision history for this message
Metin Akat (akat-metin) wrote :

This bug also affects me.
I run several VMs on the same host. A mix of ubuntu 11.10 and 12.04. The problem occurs regardless of guest OS.
I was experiencing the problem for months, though it was occuring rarely and I didn't search for a solution. Recently it started happening every day (sometimes twice a day), because we started using the machines in question more heavily.

I can easily confirm that this happens on high network traffic machines. It never occurs on ones with lower traffic, and it always occurs after a machine's network traffic increases for some reason (services on it start being used more frequently etc.)

Now I'll install Serge's PPA and we'll see what happens. Please, correct me if I'm wrong, but as far as I understood, I only need to install that on the host, right?

Revision history for this message
Thomas Vachon (vachon) wrote :

I think the big is in the guest code. Did you try 10.04 LTS?

Revision history for this message
Metin Akat (akat-metin) wrote :

No I might try it later today (as a guest)

I installed the PPA to the host. From what I can see, the situation is even worse. Had several failures in the first 2 hours since.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Metin Akat (<email address hidden>):
> No I might try it later today (as a guest)
>
> I installed the PPA to the host. From what I can see, the situation is
> even worse. Had several failures in the first 2 hours since.

Thanks, that's valuable information.

(Yes, the ppa was to be installed on the host, not the guests)

Revision history for this message
Metin Akat (akat-metin) wrote :

Today I see there is a new qemu-kvm package on the PPA. Installed. Let's se what happens.

Revision history for this message
Metin Akat (akat-metin) wrote :

So far no VMs have lost their network connectivity.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Metin,

still no incidents?

Revision history for this message
Metin Akat (akat-metin) wrote :

@Serge
Yes, still no incidents. Haven't updated the machine since.

Revision history for this message
Gary Cuozzo (ua5r) wrote :

I installed the new packages on Monday and have not had any issues. That said, I was only experiencing failures once every 3-4 weeks. So I don't think my data point will be valid for at least a few weeks.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Gary Cuozzo (<email address hidden>):
> I installed the new packages on Monday and have not had any issues.
> That said, I was only experiencing failures once every 3-4 weeks. So I
> don't think my data point will be valid for at least a few weeks.

Thanks, Gary, will wait before making assumptions.

Revision history for this message
Metin Akat (akat-metin) wrote :

I am absolutely sure that at least with me, this is fixed. Whole working week and not a single incident. It used to be several every day last week (before I updated to the PPA)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for the feedback.

I've looked through the commit log and found no obvious single or group of commits which would solve this.

We could talk about backporting the quantal packages into precise - however the quantal packages do appear to also bring in their own regressions, for instance bug 1040033.

Let's see if the packages also fix it for Gary.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Soren has suggested commit a281ebc11a6917fbc27e1a93bb5772cd14e241fc ('virtio: add missing mb() on notification') seems a likely candidate for a fix.

Revision history for this message
Gustave Hellman (gustavehellman) wrote :

I am also seeing the problem on our Ganeti managed KVM under Ubuntu 12.04. In our case on those servers with higher network load, connectivity to the guest is lost but may take one to two days. The one server with low network traffic has never lost connectivity.

By the way the loss of connectivity problem does not happen with 12.04 guests running on 11.10 containers.

I have patched to the current levels but still see the problem. The kvm package installed is;
kvm 1:84+dfsg-0ubuntu16+1.0+noroms+0ubuntu14.1

Is there something else I should be looking for?

Thanks

Revision history for this message
Peng Yong (ppyy) wrote :

confirm the issue.

if i upgrade to the ppa, can i upgrade to official package later? how long will the ppa become official package?

Revision history for this message
Metin Akat (akat-metin) wrote :

@Peng Yong. Yes, you can. There is a package called ppa-purge in the official repositories. Install it. It will allow you to revert to official packages at will.

Revision history for this message
Soren Hansen (soren) wrote :

As Serge says, we think we've narrowed in on the set of commits that will address this problem:

   a821ce5 virtio: order index/descriptor reads
   92045d8 virtio: add missing mb() on enable notification
   a281ebc virtio: add missing mb() on notification

I'd be happy to provide a SRU candidate in a PPA with just those patches applied if anyone is willing to test?

Serge, would it be ok for me to use the ubuntu-virt/backports ppa for this, or would you rather I create a new PPA?

Revision history for this message
kraig (kamador) wrote :
Download full text (9.3 KiB)

I can test that. I had to use the hardware to deal with a separate issue but I loaded up the backported package on Friday and I have been running a test over the weekend. No problems yet.

Sent from my iPhone

On Sep 3, 2012, at 1:21 AM, Soren Hansen <email address hidden> wrote:

> As Serge says, we think we've narrowed in on the set of commits that
> will address this problem:
>
> a821ce5 virtio: order index/descriptor reads
> 92045d8 virtio: add missing mb() on enable notification
> a281ebc virtio: add missing mb() on notification
>
> I'd be happy to provide a SRU candidate in a PPA with just those patches
> applied if anyone is willing to test?
>
> Serge, would it be ok for me to use the ubuntu-virt/backports ppa for
> this, or would you rather I create a new PPA?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> New
> Status in “bridge-utils” package in Ubuntu:
> Invalid
> Status in “ifenslave” package in Ubuntu:
> Confirmed
> Status in “libvirt” package in Ubuntu:
> Confirmed
> Status in “linux” package in Ubuntu:
> Confirmed
> Status in “qemu-kvm” package in Ubuntu:
> Confirmed
>
> Bug description:
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
> inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)
>
> # brctl show
> bridge name bridge id STP enabled interfaces
> br0 8000.d4ae52842d5a no eth0
>
> ...

Read more...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Soren,

thanks very much. It's probably best to use a new ppa, as the backports may be more generally useful, and would force unfortunate versioning games.

Just using ppa:ubuntu-virt/ppa should probably be fine too.

Revision history for this message
Soren Hansen (soren) wrote :

Packages are ready for testing in the all new ubuntu-virt/kvm-network-hang PPA.

It has a lower version than the test package Serge posted earlier, so you'll need to first disable that other PPA and then enable this one and upgrade.

This should do the trick:

# You can skip these first two commands if you didn't test Serge's packages earlier
sudo apt-get install ppa-purge
sudo ppa-purge -p backports ubuntu-virt

sudo add-apt-repository ppa:ubuntu-virt/kvm-network-hang
sudo apt-get update
sudo apt-get install qemu-kvm

Of course, use at your own risk, etc.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Metin,

will you be able to test with the kernel Soren proposed in comment #89, or is that hardware now taken?

Revision history for this message
Mark McLoughlin (markmc) wrote :

Ok, this bug seems to be about KVM networking bugs in Ubuntu. I don't see anything specific for upstream OpenStack developers to debug or fix. Marking as Invalid.

If there is some specific OpenStack upstream issue diagnosed, it's probably best to file a new bug with the details of the specific issue clearly isolated from everything else going on in this bug.

Thierry Carrez (ttx)
Changed in nova:
status: New → Invalid
Revision history for this message
Mercadolibre CloudBuilders (cloudbuilders-n) wrote :

Canonical is now aware of this issue and they are working on it

Revision history for this message
Eugene Nelen (enelen) wrote :

I updated qemu-kvm package from ppa:ubuntu-virt/kvm-network-hang.
I started next test to load network on VMs -
On VM 1 :
% nc -k -l 0.0.0.0 4242 > /dev/null
On VM 2:
% cat /dev/zero | nc IP 4242
There is no problems with this VMs from 12/Sep/2012 .
So qemu-kvm package from this PPA resolved this issue.

no longer affects: ifenslave (Ubuntu)
no longer affects: libvirt (Ubuntu)
no longer affects: bridge-utils (Ubuntu)
no longer affects: linux (Ubuntu)
Changed in qemu-kvm (Ubuntu):
status: Confirmed → Fix Released
Changed in qemu-kvm (Ubuntu Precise):
status: New → In Progress
importance: Undecided → High
description: updated
Revision history for this message
Soren Hansen (soren) wrote :

Lovely, thanks for the feedback. I've just uploaded this to precise-proposed.

Revision history for this message
Matt Hilt (mjhilt-x) wrote :

We have been testing the kvm-network-hang patch in our production setup since 9/17. We just saw our second failure. This is much improved from the multiple failure per day we were seeing, I don't think its fixed yet.

Revision history for this message
Soren Hansen (soren) wrote :

Matt, are the symptoms identical? You might be experiencing a different bug entirely.

Revision history for this message
Matt Hilt (mjhilt-x) wrote :

Soren,

We have a 12.04 based OpenStack cluster with 4 host nodes running about 30 VMs currently.
We performed the steps to add the kvm-network-hang repo and updated to the latest version on the host machines, then rebooted the instances. My understanding is that this should catch the update, since a new KVM command is run on reboot.

I caught the first failure ~12 hours after the upgrade. It had the usual symptoms: networking loss, but the VM is still up and an active VNC session was possible. I thought I just might have missed a reboot on one of the VMs, so I didn't report anything. The second failure happened yesterday, but someone else caught it and rebooted the VM. As best we can tell after the fact, it looks like the usual failure (no full harddrive, or kernel panic, or anything that got logged).

As I mentioned before, we used to see at least one failure per day, usually much more. This patch has at least reduced the occurence to a minimal amount. These non-deterministic bugs are hard to track down.

Revision history for this message
Gary Cuozzo (ua5r) wrote :
Download full text (10.5 KiB)

I don't believe just rebooting a guest will cause a new KVM instance to load. As a test, I just rebooted a guest VM on a system here and the pid of the kvm process did not change. I think it may be possible that you are still running on the old software.

Also, to update my data point... On my server which was experiencing issues, I rebooted the host just to make sure everything was fresh. It's been about a month and I have not experienced the failure again. I was typically going a few weeks between issues.

gary

----- Original Message -----
From: "Matt Hilt" <email address hidden>
To: <email address hidden>
Sent: Tuesday, September 25, 2012 12:07:44 PM
Subject: [Bug 997978] Re: KVM images lose connectivity with bridged network

Soren,

We have a 12.04 based OpenStack cluster with 4 host nodes running about 30 VMs currently.
We performed the steps to add the kvm-network-hang repo and updated to the latest version on the host machines, then rebooted the instances. My understanding is that this should catch the update, since a new KVM command is run on reboot.

I caught the first failure ~12 hours after the upgrade. It had the usual
symptoms: networking loss, but the VM is still up and an active VNC
session was possible. I thought I just might have missed a reboot on one
of the VMs, so I didn't report anything. The second failure happened
yesterday, but someone else caught it and rebooted the VM. As best we
can tell after the fact, it looks like the usual failure (no full
harddrive, or kernel panic, or anything that got logged).

As I mentioned before, we used to see at least one failure per day,
usually much more. This patch has at least reduced the occurence to a
minimal amount. These non-deterministic bugs are hard to track down.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/997978

Title:
  KVM images lose connectivity with bridged network

Status in OpenStack Compute (Nova):
  Invalid
Status in “qemu-kvm” package in Ubuntu:
  Fix Released
Status in “qemu-kvm” source package in Precise:
  In Progress

Bug description:
  =========================================
  SRU Justification:
  1. Impact: networking breaks after awhile in kvm guests using virtio networking
  2. Development fix: The bug was fixed upstream and the fix picked up in a new
     merge.
  3. Stable fix: 3 virtio patches are cherrypicked from upstream:
     a821ce5 virtio: order index/descriptor reads
     92045d8 virtio: add missing mb() on enable notification
     a281ebc virtio: add missing mb() on notification
  4. Test case: Create a bridge enslaving the real NIC, and use that as the bridge
     for a kvm instance with virtio networking. See comment #44 for specific test
     case.
  5. Regression potential: Should be low as several people have tested the fixed
     package under heavy load.
  =========================================

  System:
  -----------
  Dell R410 Dual processor 2.4Ghz w/16G RAM
  Distributor ID: Ubuntu
  Description: Ubuntu 12.04 LTS
  Release: 12.04
  Codename: precise

  Setup:
  ---------
  We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged n...

Revision history for this message
Matt Hilt (mjhilt-x) wrote :

Running "sudo reboot" from the VM doesn't change the PID, but using the reboot command on the openstack dashboard does.
It seems like some of our VMs used the former, and some the later, with correlation between the soft reboot and the instance dying. So we'll hard reboot the vms, and profusely apologize for causing alarm.

Revision history for this message
Soren Hansen (soren) wrote :

Matt, no problem at all. Please be sure to report back if you encounter the issue again after the hard reboot. Thanks!

Revision history for this message
Joe T (joe-topjian-v) wrote :

Hello,

Prior to applying the qemu package in Soren's PPA, we were able to reproduce this problem within 45 minutes (on average). We're now up to 22 hours (and climbing) without an issue.

If anyone is curious, here is the test setup that we have been using with OpenStack:

---

nova boot --image cfefd40f-be71-4c93-b480-c9964689f5ce --key_name sandbox --flavor 2 dhcp-1

dhcp-1> sudo su
dhcp-1> apt-get iperf

nova boot --image cfefd40f-be71-4c93-b480-c9964689f5ce --key_name sandbox --flavor 2 dhcp-2

dhcp-2> sudo su
dhcp-2> apt-get iperf
dhcp-2> iperf -s

dhcp-1> iperf -c dhcp-1 -t 86400 -i 10

---

Thanks,
Joe

Revision history for this message
Adam Conrad (adconrad) wrote : Please test proposed package

Hello Jonathan, or anyone else affected,

Accepted qemu-kvm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/qemu-kvm/1.0+noroms-0ubuntu14.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu-kvm (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → In Progress
Revision history for this message
Adam Conrad (adconrad) wrote :

Ignore the above automated message, the precise fix that was in the queue was superseded by a security update of the same version.

Revision history for this message
Joe T (joe-topjian-v) wrote :

Hi Adam,

Does this mean that the qemu fix for this ticket is not in -proposed yet? Or that the security update contains the fix?

Thanks,
Joe

Revision history for this message
BenKochie (ben-nerp) wrote :

As far as I can tell the fix is still not released in -proposed

Revision history for this message
Adam Conrad (adconrad) wrote :

The security update doesn't contain the fix, the original proposed update needs to be rebased against the security update. I may do that in a bit if Soren doesn't get there first.

Revision history for this message
Adam Conrad (adconrad) wrote :

Hello Jonathan, or anyone else affected,

Accepted qemu-kvm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/qemu-kvm/1.0+noroms-0ubuntu14.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu-kvm (Ubuntu Precise):
status: In Progress → Fix Committed
Revision history for this message
Joe T (joe-topjian-v) wrote :

I have installed this package on two of my OpenStack compute nodes. I'll have an update in a day or so on if this package still fixes the issue like the package from Soren's PPA.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Installed yesterday and rebooted the dom0 machine and thus all virtual machines. Will report back if there are any problems.

Revision history for this message
Joe T (joe-topjian-v) wrote :

We tested the -proposed packages yesterday and are confident that they resolve the issue. We used the test scenario described in comment #101.

Servers that have not had the updated package applied failed the test within an hour. Servers with the updated package did not fail the test.

Robert Dupont (rdupontd)
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in qemu-kvm (Ubuntu):
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Released → Fix Committed
tags: added: verification-done
removed: verification-needed
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu-kvm - 1.0+noroms-0ubuntu14.3

---------------
qemu-kvm (1.0+noroms-0ubuntu14.3) precise-proposed; urgency=low

  * Fix race condition in virtio code on multicore systems. (LP: #997978)
    - 9001-virtio-add-missing-mb-on-notification.patch
    - 9002-virtio-add-missing-mb-on-enable-notification.patch
    - 9003-virtio-order-index-descriptor-reads.patch
 -- Soren Hansen <email address hidden> Mon, 03 Sep 2012 10:15:54 +0200

Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
David Geng (genggjh) wrote :

I got the same issue, but my host OS is RHEL 6.3 (2.6.32-220.el6.x86_64), the qemu-kvm version is 0.12.1.2 , and my guest base image is Ubuntu 12.4 LTS.
My problem is:
After I enable the libvirt_use_virtio_for_bridges = true in the nova.conf, the new instance can not get ip address and the gateway can not be added in the router table.

The router table like this:

--before enable virtio
~$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.17.33.8 0.0.0.0 UG 100 0 0 eth0
172.17.32.0 0.0.0.0 255.255.252.0 U 0 0 0 eth0

--after enable virtio
~$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.17.32.1 0.0.0.0 UG 100 0 0 eth0
172.17.32.0 0.0.0.0 255.255.252.0 U 0 0 0 eth0

Here is the dnsmasq process on my host server:
root 32034 32033 0 Oct12 ? 00:00:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file= --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=172.17.33.8 --except-interface=lo --dhcp-range=172.17.33.3,static,120s --dhcp-lease-max=256 --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro

Soren,
Your solution are only for ubuntu host and should install the ppa on the host machine, right?
Is there any solution or workaround for RHEL?

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

This bug is considered fixed for me. Not a single network glitch since installing the package from PPA. Many thanks to the development team!

Revision history for this message
kraig (kamador) wrote : Re: [Bug 997978] Re: KVM images lose connectivity with bridged network
Download full text (8.9 KiB)

Same here, thank you everyone!

--
Kraig Amador

On Friday, November 2, 2012 at 5:42 AM, Jonathan Tullett wrote:

> This bug is considered fixed for me. Not a single network glitch since
> installing the package from PPA. Many thanks to the development team!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> Invalid
> Status in “qemu-kvm” package in Ubuntu:
> Fix Released
> Status in “qemu-kvm” source package in Precise:
> Fix Released
>
> Bug description:
> =========================================
> SRU Justification:
> 1. Impact: networking breaks after awhile in kvm guests using virtio networking
> 2. Development fix: The bug was fixed upstream and the fix picked up in a new
> merge.
> 3. Stable fix: 3 virtio patches are cherrypicked from upstream:
> a821ce5 virtio: order index/descriptor reads
> 92045d8 virtio: add missing mb() on enable notification
> a281ebc virtio: add missing mb() on notification
> 4. Test case: Create a bridge enslaving the real NIC, and use that as the bridge
> for a kvm instance with virtio networking. See comment #44 for specific test
> case.
> 5. Regression potential: Should be low as several people have tested the fixed
> package under heavy load.
> =========================================
>
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
> inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)
>
> # brctl show
> bridge name bridge id STP enabled interfaces
> br0 8000.d4ae52842d5a no eth0
>
> I have no default network configured to autostart in libvirt as we're using bridged networking:
> # virsh n...

Read more...

Revision history for this message
BenKochie (ben-nerp) wrote :

So I'm sorry to report that after about 50 days of uptime on qemu-kvm (1.0+noroms-0ubuntu14.3) I had 3 VMs out of ~60 in my cluster drop off the network. It happened on different host machines, so it's not a single machine in the cluster that's a problem.

Two of the nodes were restarted (full qemu shutdown/relaunch) so I didn't have a chance to debug them.

One of them I was able to console and work on before I gave up and restarted it. The interesting things I discovered was that the workarounds I had done in the past did not work. Previously I was able to ifdown/ifup the virtual interface to restore networking. Also migrating between nodes did not restore networking.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

But otherwise it sounds like it looked the same - VM still up but its network down?

Had these VMs been running for 50 days, or was it that you hadn't seen the problem in 50 days? If the latter, could these have been on a newer kernel?

Revision history for this message
Scobo (mk-binary-artworks) wrote :

Will there be a fix for 10.04 lucid, too?
I think this bug affects me too: Windows SBS 2008 with virtio network driver on ubuntu 10.04 lucid host loses network connection from time to time. It's not really deterministic when this error occurs. Usually it happens when there's a large amount of traffic sent over the network. After disabling (via VNC console) the network card in Windows and re-enabling it, everything works fine again, no need to reboot the VM.
Any help is appreciated, since it's a quite annoying bug... :-)

Revision history for this message
BenKochie (ben-nerp) wrote :

I'm testing this some more, and it looks like it's still easy to reproduce on my system.

cc2ab6833adc73311a2407be2eb5f915 /usr/bin/qemu-system-x86_64

I can cause the guest to drop VM networking with an rsync+ssh from a nearby host.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@BenKochie,

could you please file a new bug with details?

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

I've found this problem in my saucy installation. I just opened a bug cause I'm not sure if it's related.

https://bugs.launchpad.net/ubuntu/+source/core-network/+bug/1255516

My problem is actually worse since host machine cannot get access to other machines on the network. Not even virtual machines but hardware machines, routers, NAS, and so on.

The only way to recover from this situation is to ifdown ifup the bridge. Then it recovers until it happens again.

When I remove the bridge no problems. But I was not able to test it since I need the bridge up.

I'm investigating macvtap instead of bridge...

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

I have to clarify that I'm not sure that this has something to do with KVM, QEMU and the like.

I think is more a problem of the linux bridge driver since It fails even if you don't have any vm running. Yes, it takes more time to fail but this can be because not enough traffic to make it fail.

Revision history for this message
Krzysztof Janowicz (janowicz) wrote :

I think I may have the same problem running a 13.10 qemu-kvm host with 8 virtual machines. As far as I can see only two VMs seem to have the problem. All of them use em1: macvtap as source device and a bridge as source mode. After reading many posts here, I changed from virtio to rtl8139 as a device model to see if I still lose network connection.

Revision history for this message
Dewey McDonnell (dewey-w) wrote :

My web search reveals this problem has been around for years and most posts conclude that the problem continues. After solving my tap0 TX packets dropped problem, I find that my VM network has random freezes a few times each day. Many bloggers on this subject say this VM network freeze is difficult to reproduce. Not for me! I can cause a network freeze on my VM in a heartbeat. Like many others, all I need to do is start a data transfer over the bridge (FTP 2Kb file or larger). When the freeze happens, ifconfig tap0 says overruns:1. This is QEMU version 1.6.2 without KVM on CentOS 6.5 version 2.6.32-431.

Revision history for this message
Thomas Vachon (vachon) wrote :
Download full text (9.9 KiB)

After moving to 3.5 kernel I haven't seen it, even at 3x traffic which used
to cause it
On Mar 25, 2014 9:06 PM, "Dewey McDonnell" <email address hidden> wrote:

> My web search reveals this problem has been around for years and most
> posts conclude that the problem continues. After solving my tap0 TX
> packets dropped problem, I find that my VM network has random freezes a
> few times each day. Many bloggers on this subject say this VM network
> freeze is difficult to reproduce. Not for me! I can cause a network
> freeze on my VM in a heartbeat. Like many others, all I need to do is
> start a data transfer over the bridge (FTP 2Kb file or larger). When the
> freeze happens, ifconfig tap0 says overruns:1. This is QEMU version
> 1.6.2 without KVM on CentOS 6.5 version 2.6.32-431.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> Invalid
> Status in "qemu-kvm" package in Ubuntu:
> Fix Released
> Status in "qemu-kvm" source package in Precise:
> Fix Released
>
> Bug description:
> =========================================
> SRU Justification:
> 1. Impact: networking breaks after awhile in kvm guests using virtio
> networking
> 2. Development fix: The bug was fixed upstream and the fix picked up in
> a new
> merge.
> 3. Stable fix: 3 virtio patches are cherrypicked from upstream:
> a821ce5 virtio: order index/descriptor reads
> 92045d8 virtio: add missing mb() on enable notification
> a281ebc virtio: add missing mb() on notification
> 4. Test case: Create a bridge enslaving the real NIC, and use that as
> the bridge
> for a kvm instance with virtio networking. See comment #44 for
> specific test
> case.
> 5. Regression potential: Should be low as several people have tested the
> fixed
> package under heavy load.
> =========================================
>
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged
> networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4...

Revision history for this message
Dewey McDonnell (dewey-w) wrote :

Thanks Thomas for your speedy reply. Now my CentOS uname -a is version 3.5.0. My tap0 lockup problem is exactly the same as with the older kernel. Anything else you would suggest?

Revision history for this message
Russell McOrmond (russell-flora) wrote :

We're running qemu-kvm 1.0+noroms-0ubuntu14.13 on a 12.04.4 LTS with KVM based 12.04.4 LTS virtual machines , and have observed this problem. We have been using the regular software bridges on many machines, and have only noticed the problem on one of our newest servers.

Using virtio devices we get the discussed lockups.

Using e1000 we get no lockups, but this is a much lower performing interface and thus we have performance issues. We have NFS and other traffic between VMs and the host, so need more than the GigE that we have to external hosts.

I note above that this bug was considered fixed by qemu-kvm - 1.0+noroms-0ubuntu14.3, but this appears to not be the case.

Revision history for this message
Fawad Khaliq (fawadkhaliq) wrote :

Very easily reproducible on my side.

Revision history for this message
Izhar ul Hassan (ezhaar) wrote :

Yes, I think it is safe to say that the bug is still around. The VM loses network connectivity under "enough" load. I, for example can reproduce this by running a spark job which transfers a few gigabytes of data between worker VMs. And within a minute one of the VMs lose network connectivity. If I try to reboot the VM, it goes into error state. Trying to delete makes the qemu-kvm process defunct.

uname -r
3.8.0-29-generic

virsh --version
1.1.1

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@ezhaar - please open a new bug so we can collect new information. If you are on trusty then please file against qemu, othewise file against qemu-kvm. Then mark it as also affecting libvirt and linux (the kernel). Then reproduce the bug, and immediately after the crashes do 'apport-collect <bug-number>', which should collect the data for each of those packages. Please show the host network configuration, the libvirt network config if applicable, the xml dumps for the vms, and where to get spark.

Revision history for this message
Øyvind Jelstad (oyvind-2) wrote :

Looks like I have this problem on 12.04.1 LTS with kernel 3.2.0-67-generic #101-Ubuntu SMP Tue Jul 15 17:46:11 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux on the host and Debian Wheezy on the guest (SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux)

The guest will occationally loose connection with other hosts on the LAN, and their entries in the arp table on the guest are gone.
Only sporadic forwarding of arp replies on the host back to the guest seem to be the problem.

Dumping arp request (triggered by a ping from guest) on the bridge external (real) interface on the host catches both requests and replies, while the same dump on vnet0 misses replies for minutes until a reply suddeny comes through and reestablishes connection.

I am using virtio interface. It makes no difference if I change to e1000. FW policy is ACCEPT on all tables.

Revision history for this message
Martin Pajak (mpajak-r) wrote :

I faced probably the same problem installing Xen-4.3.2 with Gentoo kernel-3.12.21.

DomU's interface hangs after short time under heavy network load (starting at ~10Mbyte/s). From outside it looks, like the instance would crash, but deactivating and activating the interface e.g. form the "xl console <domU name>" with /etc/init.d/net.eth0 stop/start, restores normal operation.

After 3 days of testing/searching I found a workaround. Setting the following options with ethtool, I could successfully prevent my domU's interfaces from hanging:

ethtool --offload <network device> gso off tso off sg off gro off

This http://cloudnull.io/2012/07/xenserver-network-tuning/ leeds me to my solution.

I also posted my other expirience with this bridged network configuration in the Gentoo wiki https://wiki.gentoo.org/wiki/Xen .

Revision history for this message
Øyvind Jelstad (oyvind-2) wrote :

My problem with missing arp replies was solved( worked around) by setting
bridge_ageing 0
for the bridge in /etc/network/interfaces, making it a hub forwarding all packets to all hosts.

Revision history for this message
Gibbo (gibbo87) wrote :

I report the same situation as comment 129.
Spark Cluster installed with Ambari 1.7, installed HDP 2.2 running Zookeeper, Ganglia, HDFS and YARN.
Ubuntu 12.04

uname -r
3.2.0-67-virtual

Instances running fine until the start of a Spark of Hadoop job with YARN. The Job gets accepted but then 2 slaves of the cluster are interested by the bug and lose connectivity. They are still accessible by the openstack web page console but can't reach the network. Reboots brings the VM in a halt state.
Solved with ethtool -K eth0 tx off sg off tso off ufo off gso off gro off lro off

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Please file a new bug against the linux package, preferably (if possible) using the command 'ubuntu-bug linux'

Revision history for this message
Ramiro Varandas Jr (ramirovjnr) wrote :

If this problem is related to this one - https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/1325560?comments=all - then upgrading the kernel should fix it.

In my case I'm using Ubuntu 14.04 with kernel 3.13 and after upgrading to kernel 3.16, no more connectivity problems.

Revision history for this message
Ramiro Varandas Jr (ramirovjnr) wrote :

- Updating -

The problem came back today - the VM was running OK for more than 72 hours - and then the connectivity with the gateway was lost again.

Looking Martin's post #132, I was seeing some packets being dropped by being incorrect. Applied the ethtool fix and also changed the driver to e1000 and going to monitor it.

After applying the ethtool patch, I got no more dropped packets with incorrect checksum.

Revision history for this message
Shades (initialhit) wrote :

Still seeing this on 14.04.4 LTS under enough load, or when moving from paused state.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.