kvm virtio netdevs lose network connectivity under "enough" load

Bug #1325560 reported by Izhar ul Hassan on 2014-06-02
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Linux
Incomplete
Undecided
Unassigned
libvirt
New
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned
qemu-kvm (Ubuntu)
High
Unassigned

Bug Description

Networking breaks after awhile in kvm guests using virtio networking. We run data intensive jobs on our virtual cluster (OpenStack Grizzly Installed on Ubuntu 12.04 Server). The job runs fine on a single worker VM (no data transfer involved). As soon as I add more nodes where the workers need to exchange some data, one of the worker VM goes down. Ping responds with 'host unreachable'. Logging in via the serial console shows no problems: eth0 is up, can ping the local host, but no outside connectivity. Restart the network (/etc/init.d/networking restart) does nothing. Reboot the machine and it comes alive again.

14/06/01 18:30:06 INFO YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done
14/06/01 18:30:06 INFO MemoryStore: ensureFreeSpace(190758) called with curMem=0, maxMem=308713881
14/06/01 18:30:06 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 186.3 KB, free 294.2 MB)
14/06/01 18:30:06 INFO FileInputFormat: Total input paths to process : 1
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-rack/10.20.20.28:50010
14/06/01 18:30:06 INFO NetworkTopology: Adding a new node: /default-rack/10.20.20.23:50010
14/06/01 18:30:06 INFO SparkContext: Starting job: count at hello_spark.py:15
14/06/01 18:30:06 INFO DAGScheduler: Got job 0 (count at hello_spark.py:15) with 2 output partitions (allowLocal=false)
14/06/01 18:30:06 INFO DAGScheduler: Final stage: Stage 0 (count at hello_spark.py:15)
14/06/01 18:30:06 INFO DAGScheduler: Parents of final stage: List()
14/06/01 18:30:06 INFO DAGScheduler: Missing parents: List()
14/06/01 18:30:06 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at count at hello_spark.py:15), which has no missing parents
14/06/01 18:30:07 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[2] at count at hello_spark.py:15)
14/06/01 18:30:07 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2 tasks
14/06/01 18:30:08 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://<email address hidden>:44417/user/Executor#-1352071582] with ID 1
14/06/01 18:30:08 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 1: host-10-20-20-28.novalocal (PROCESS_LOCAL)
14/06/01 18:30:08 INFO TaskSetManager: Serialized task 0.0:0 as 3123 bytes in 14 ms
14/06/01 18:30:09 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager host-10-20-20-28.novalocal:42960 with 588.8 MB RAM
14/06/01 18:30:16 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_0 in memory on host-10-20-20-28.novalocal:42960 (size: 308.2 MB, free: 280.7 MB)
14/06/01 18:30:17 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://<email address hidden>:58126/user/Executor#1079893974] with ID 2
14/06/01 18:30:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 2: host-10-20-20-23.novalocal (PROCESS_LOCAL)
14/06/01 18:30:17 INFO TaskSetManager: Serialized task 0.0:1 as 3123 bytes in 1 ms
14/06/01 18:30:17 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager host-10-20-20-23.novalocal:56776 with 588.8 MB RAM
fj14/06/01 18:31:20 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, host-10-20-20-28.novalocal, 42960, 0) with no recent heart beats: 55828ms exceeds 45000ms
14/06/01 18:42:23 INFO YarnClientSchedulerBackend: Executor 2 disconnected, so removing it
14/06/01 18:42:23 ERROR YarnClientClusterScheduler: Lost executor 2 on host-10-20-20-23.novalocal: remote Akka client disassociated

The same job finishes flawlessly on a single worker.

System Information:
==================

Description: Ubuntu 12.04.4 LTS
Release: 12.04

Linux 3.8.0-35-generic #52~precise1-Ubuntu SMP Thu Jan 30 17:24:40 UTC 2014 x86_64

libvirt-bin:
--------------
  Installed: 1.1.1-0ubuntu8~cloud2
  Candidate: 1.1.1-0ubuntu8.7~cloud1
  Version table:
     1.1.1-0ubuntu8.7~cloud1 0
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu/ precise-updates/havana/main amd64 Packages
 *** 1.1.1-0ubuntu8~cloud2 0
        100 /var/lib/dpkg/status
     0.9.8-2ubuntu17.19 0
        500 http://se.archive.ubuntu.com/ubuntu/ precise-updates/main amd64 Packages
     0.9.8-2ubuntu17.17 0
        500 http://security.ubuntu.com/ubuntu/ precise-security/main amd64 Packages
     0.9.8-2ubuntu17 0
        500 http://se.archive.ubuntu.com/ubuntu/ precise/main amd64 Packages

qemu-kvm:
---------------
  Installed: 1.5.0+dfsg-3ubuntu5~cloud0
  Candidate: 1.5.0+dfsg-3ubuntu5.4~cloud0
  Version table:
     1.5.0+dfsg-3ubuntu5.4~cloud0 0
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu/ precise-updates/havana/main amd64 Packages
 *** 1.5.0+dfsg-3ubuntu5~cloud0 0
        100 /var/lib/dpkg/status
     1.0+noroms-0ubuntu14.15 0
        500 http://se.archive.ubuntu.com/ubuntu/ precise-updates/main amd64 Packages
     1.0+noroms-0ubuntu14.14 0
        500 http://security.ubuntu.com/ubuntu/ precise-security/main amd64 Packages
     1.0+noroms-0ubuntu13 0
        500 http://se.archive.ubuntu.com/ubuntu/ precise/main amd64 Packages

XML DUMP for a VM
-----------------------------
<domain type='kvm' id='7'>
  <name>instance-000001b6</name>
  <uuid>731c2191-fa82-4a38-9f52-e48fb37e92c8</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>OpenStack Foundation</entry>
      <entry name='product'>OpenStack Nova</entry>
      <entry name='version'>2013.2.3</entry>
      <entry name='serial'>01d3d524-32eb-e011-8574-441ea15e3971</entry>
      <entry name='uuid'>731c2191-fa82-4a38-9f52-e48fb37e92c8</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-1.5'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-model'>
    <model fallback='allow'/>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/nova/instances/731c2191-fa82-4a38-9f52-e48fb37e92c8/disk'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci0'/>
    </controller>
    <interface type='bridge'>
      <mac address='fa:16:3e:a7:de:97'/>
      <source bridge='qbr43f8d3a5-e4'/>
      <target dev='tap43f8d3a5-e4'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='file'>
      <source path='/var/lib/nova/instances/731c2191-fa82-4a38-9f52-e48fb37e92c8/console.log'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <serial type='pty'>
      <source path='/dev/pts/6'/>
      <target port='1'/>
      <alias name='serial1'/>
    </serial>
    <console type='file'>
      <source path='/var/lib/nova/instances/731c2191-fa82-4a38-9f52-e48fb37e92c8/console.log'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5904' autoport='yes' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-731c2191-fa82-4a38-9f52-e48fb37e92c8</label>
    <imagelabel>libvirt-731c2191-fa82-4a38-9f52-e48fb37e92c8</imagelabel>
  </seclabel>
</domain>

I am reporting this for spark but this should be valid for any applications that involve fast data transfer between VMs. The bug has been reported in centos forums as well.

http://bugs.centos.org/view.php?id=5526

and an older bug report on launchpad:
https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978?comments=all
---
ApportVersion: 2.0.1-0ubuntu17.6
Architecture: amd64
DistroRelease: Ubuntu 12.04
InstallationMedia: Ubuntu-Server 12.04.3 LTS "Precise Pangolin" - Release amd64 (20130820.2)
MarkForUpload: True
Package: qemu-kvm 1.5.0+dfsg-3ubuntu5~cloud0
PackageArchitecture: amd64
ProcVersionSignature: Ubuntu 3.8.0-29.42~precise1-generic 3.8.13.5
Tags: precise third-party-packages
Uname: Linux 3.8.0-29-generic x86_64
UnreportableReason: This is not an official Ubuntu package. Please remove any third party package and try again.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirtd lpadmin plugdev sambashare sudo

Izhar ul Hassan (ezhaar) on 2014-06-02
no longer affects: qemu

apport information

tags: added: apport-collected precise third-party-packages
description: updated

apport information

Thanks for reporting this bug. When you say "reboot the machine and it comes
alive again", do you mean that a soft reboot from inside the vm guest (i.e. not
restarting qemu at all) brings the guest's network back up?

 status: incomplete
 importance high

Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
status: New → Incomplete

@serge

Yes. Thats correct.

$ virsh reboot instance-name

or a soft reboot from from inside the vm (while logged in through the serial console) does the trick. The VM is back online and accessible. And we can reproduce the process very easily.

Izhar ul Hassan (ezhaar) wrote :

I changed the VM interface from virtio to e1000 and then I do not get this problem and the job finishes perfectly fine. Although e1000 may not be the best solution but at least it doesnt break my setup.

Thanks - both of those seem to suggest there is a bug in the virtio
driver in the guest kernel. Are the guests in both cases on the
same release and same kernel?

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1325560

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux:
status: New → Incomplete

yes, i've used the same virtual machines in both cases. everything is the same except the driver. I switch from virtio to e1000

 glance image-update --property hw_vif_model=e1000 <image-id>

Now I relaunch the virtual cluster and everything works perfectly fine. I've tested with 5 times the load but it doesnt crash anymore.

I have tested the same cluster configuration with virtio on Centos 6.5 kernel 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux and that crashes too.
I will test the centos version with e1000 and see if that works.

Izhar ul Hassan (ezhaar) wrote :

@Brad
I have already sent the apport-collect from the host machine.

I am not sure if I could apport-collect from the guest vm because on the guest I cannot access any network unless I restart the VM.

summary: - kvm vm loses network connectivity under "enough" load
+ kvm virtio netdevs lose network connectivity under "enough" load
Changed in linux (Ubuntu):
importance: Undecided → High
Matt Symonds (msymonds) wrote :

I am seeing a similar issue but instead of the network breaking I get very variable latency. It's like the VM is pausing.

This only happens with virtio networking. Switching to e1000 fixes the issue.

Ubuntu 14.04 3.13.0-30-generic on both the host machines and VM's with bridge networking.

64 bytes from 10.3.0.2: icmp_seq=59 ttl=64 time=0.717 ms
64 bytes from 10.3.0.2: icmp_seq=60 ttl=64 time=0.706 ms
64 bytes from 10.3.0.2: icmp_seq=61 ttl=64 time=0.454 ms
64 bytes from 10.3.0.2: icmp_seq=62 ttl=64 time=0.635 ms
64 bytes from 10.3.0.2: icmp_seq=63 ttl=64 time=0.707 ms
64 bytes from 10.3.0.2: icmp_seq=64 ttl=64 time=2333 ms # Starts here
64 bytes from 10.3.0.2: icmp_seq=65 ttl=64 time=856 ms
64 bytes from 10.3.0.2: icmp_seq=66 ttl=64 time=350 ms
64 bytes from 10.3.0.2: icmp_seq=67 ttl=64 time=80.1 ms
64 bytes from 10.3.0.2: icmp_seq=68 ttl=64 time=12.5 ms
64 bytes from 10.3.0.2: icmp_seq=69 ttl=64 time=2.71 ms
64 bytes from 10.3.0.2: icmp_seq=70 ttl=64 time=1.71 ms
64 bytes from 10.3.0.2: icmp_seq=71 ttl=64 time=0.597 ms
64 bytes from 10.3.0.2: icmp_seq=72 ttl=64 time=0.729 ms
64 bytes from 10.3.0.2: icmp_seq=73 ttl=64 time=0.727 ms
64 bytes from 10.3.0.2: icmp_seq=74 ttl=64 time=0.642 ms
64 bytes from 10.3.0.2: icmp_seq=75 ttl=64 time=0.715 ms
64 bytes from 10.3.0.2: icmp_seq=76 ttl=64 time=0.715 ms
64 bytes from 10.3.0.2: icmp_seq=77 ttl=64 time=0.776 ms
64 bytes from 10.3.0.2: icmp_seq=78 ttl=64 time=0.742 ms
64 bytes from 10.3.0.2: icmp_seq=79 ttl=64 time=0.770 ms

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Alexander (shulikov-n) wrote :

Same trouble as in #10 comment. In addition periodically packet loss 1-5%. Used two virtual machines - if shutdown second maching - first works more stable.

Ivan (i-dychenko) wrote :

I can also confirm this bug.
Periodical packets loss with virtio driver.
Ubuntu Server 14.04 LTS as guest and host.

Matt Symonds (msymonds) wrote :

I've since upgraded to the 3.14.1 kernel that will be in ubuntu 14.04.01 which has fixed the problem for me.

Alexander (shulikov-n) wrote :

#14: Upgraded host or virtual machine or both?

Alexander (shulikov-n) wrote :

#14: Vanilla kernel or from ubuntu git?

Alexander (shulikov-n) wrote :

#17: This kernel is built in april. Fresh version is not available?

Alexander (shulikov-n) wrote :

This kernel (3.14.1) solve trouble. Should expect fix in the official repositories?

Hi Matt, should these packages be upgraded in host or in kvm vm?

Thanks

Sorry, see it. Should on host.

Andreas Ntaflos (daff) wrote :

Installing 3.14.1 as per comment #17 fixed these connectivity issues for us as well, but it doesn't look like the 3.14.1 kernel made it anywhere near the 14.04.1 release. There is also no mention of this or any related bugs in https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes/ChangeSummary/14.04.1.

What else can we do except manually install 3.14.1, which is obviously not a proper solution?

Chris J Arges (arges) wrote :

I believe this is a duplicate of bug 1346917.
A test kernel has been built in comment #1 of that bug.
Please test with this, if it fixes your issue mark this bug as a duplicate of 1346917.
This fix will make it most likely into linux 3.13.0-33.
Thanks,

I'm using Ubuntu 14.04.1 and was using kernel 3.13 and after upgraded to 3.16, no more problems.

Philipp Hahn (pmhahn) wrote :

bug #1346917 only mentions the same issue after updating to the kernel of that specific bug; this issue (network problem) is completely different from that one (same page sharing on NUMA).

Hi, I'm experiencing similar issues in (x)ubuntu xenial. Is this bug still present in 16.04?

Thanks

ChristianEhrhardt (paelzer) wrote :

Hi Daniel,
given that the old issues were either fixed (in the linked kernel bug) and the remaining ones unclear (incomplete here) it is hard to say. Given the reports I've seen over the last years I'd say this particular issue is no more present. Actually quite the opposite, if anything I've seen issues in e1000 that are easily resolved by switching to virtio.

So if you face an issue I'd ask you to open up a new bug (feel free to mention this old one in there), but it certainly needs new debugging.

Thanks. I'll try to debug and log as much as possible, and open a new bug report if needed.

Barry Stokes (ceisc) wrote :

Getting the same here on 17.04 host with 17.04 guest.

Tommi Aropalo (tommi-aropalo) wrote :

Hi, I have similar network connection issues (x)ubuntu xenial. Not sure when this started but coming back from holiday I noticed that connections to VM's lost from time to time. I do not have much load or anything. I have 3 VM's ( two Ubuntu 14.04 and Ubuntu 16.04). Ping will lose anywhere from 30% to 70% of the packets. Pinging other hosts has no packet loses. I have tried to change from virtio to e1000 without getting any better results. Pinging from VM to host is ok. Ping to other VM is not.

My machine is quite old Asus Rampage III with Intel network chip ( product: 82567V-2 Gigabit Network Connection ).

Tommi Aropalo (tommi-aropalo) wrote :

Hi, problem seems to be solved (for me ). Problem was host could not ping guest. At the same time guest were able to ping host and the world. Downgraded qemu-kvm to previous version. That didn't help. Same problem as before. Updated to the latest release. Still same problem. Removed and recreated bridge interface. Now everything seems to be working again.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers