networking stalls between domU using vif

Bug #246789 reported by chris lea
10
Affects Status Importance Assigned to Milestone
xen-3.2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I can't get networking to work properly between any two domU when using vif on hardy 8.04.1 using xen-3.2.

I set up two domU on private IPs in 192.168.1.0/24 using network-nat and vif-nat, and observed the following behavior.

*) Networking to and from the outside world worked just fine.

*) Transferring a large file (50M) from one of the domU to dom0 worked fine using several protocols (ssh, ftp, etc).

*) Transferring a large file (50M) between the two domU stalls out after a few megs have gone through, also seen using any protocol.

As a test, I then set up several domU using public IPs, using network-bridge and vif-bridge. I saw exactly the same behavior as wtih the private IPs. However, with the public IPs, if I stopped using vif, then networking worked just fine. Meaning, if in the .cfg files for the domU, I set

vif = [ '' ]

and just set up the interfaces directly in the domU then I could transfer the large file (50M) between the domU without trouble.

I should also note that if I disable tcp checksumming on the interfaces in the domU by uncommenting the line that reads

# post-up ethtool -K eth0 tx off

then I can transfer the file around. Though this causes all sorts of other things to break.

SUMMARY:

When using vif, I cannot reliably transfer files of any material size between domU using any protocol I know how to test.

Transferring files between domU and dom0, or between domU and any outside server works just fine.

I expected to be able to scp / ftp / etc files between two domU when using vif to bridge.

What I actually see is that if I'm using vif, a few megabytes will transfer and then the transfer will stall, and I see this using many different protocols.

All of this was tested on HP proliant servers using the tg3 network driver.

Revision history for this message
elventear (elventear) wrote :

I am having the same problem with Xen. Have you been able to find any workarounds?

Revision history for this message
chris lea (chris-lea) wrote : Re: [Bug 246789] Re: networking stalls between domU using vif

Man, I wish. But thus far no luck.

On Fri, Aug 22, 2008 at 11:54 PM, elventear <email address hidden> wrote:
> I am having the same problem with Xen. Have you been able to find any
> workarounds?
>
> --
> networking stalls between domU using vif
> https://bugs.launchpad.net/bugs/246789
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in "xen-3.2" source package in Ubuntu: New
>
> Bug description:
> I can't get networking to work properly between any two domU when using vif on hardy 8.04.1 using xen-3.2.
>
> I set up two domU on private IPs in 192.168.1.0/24 using network-nat and vif-nat, and observed the following behavior.
>
> *) Networking to and from the outside world worked just fine.
>
> *) Transferring a large file (50M) from one of the domU to dom0 worked fine using several protocols (ssh, ftp, etc).
>
> *) Transferring a large file (50M) between the two domU stalls out after a few megs have gone through, also seen using any protocol.
>
>
> As a test, I then set up several domU using public IPs, using network-bridge and vif-bridge. I saw exactly the same behavior as wtih the private IPs. However, with the public IPs, if I stopped using vif, then networking worked just fine. Meaning, if in the .cfg files for the domU, I set
>
> vif = [ '' ]
>
> and just set up the interfaces directly in the domU then I could transfer the large file (50M) between the domU without trouble.
>
> I should also note that if I disable tcp checksumming on the interfaces in the domU by uncommenting the line that reads
>
> # post-up ethtool -K eth0 tx off
>
> then I can transfer the file around. Though this causes all sorts of other things to break.
>
>
> SUMMARY:
>
> When using vif, I cannot reliably transfer files of any material size between domU using any protocol I know how to test.
>
> Transferring files between domU and dom0, or between domU and any outside server works just fine.
>
> I expected to be able to scp / ftp / etc files between two domU when using vif to bridge.
>
> What I actually see is that if I'm using vif, a few megabytes will transfer and then the transfer will stall, and I see this using many different protocols.
>
> All of this was tested on HP proliant servers using the tg3 network driver.
>

--
chris lea
310-709-4021
gpg key: http://www.chrislea.com/gpgkey.txt

Revision history for this message
elventear (elventear) wrote :

I haven't been able to solve this problem but at least mitigate it.

Use the 'rate' directive in the configuration files for all the domUs, to limit the network rate. I also used Shorewall's QoS to limit the download traffic to a value lower than our maximum for the public interface for one of our domUs that acts as a firewall/router.

Empirically I found that 5 Mbps is the value I needed to use in my system to avoid problems. To do this what I did is ping some resource, in our case any public server on the internet, that slows down when large traffic occurs. You will notice that as you start transferring the ping times get worse and worse. If it does, change the rate setting to lower value and test again. Do this until the ping times stay stable instead of slowly (Or even rapidly) increasing.

This workaround it really unacceptable, since it limits severely the performance of the Virtual Servers, but in my case it is what I had to do in order to keep my production system working without hiccups.

This is what you get for trying the bleeding edge, even though a higher level of reliability was promised because it is an LTS release.
:-/

Revision history for this message
Caspar Clemens Mierau (leitmedium) wrote :

I can hereby confirm this as I am able to reproduce with a current Ubuntu Hardy. Scp from domU1 to domU2 goes down to a speed of round abound 10KB/s.

Changed in xen-3.2:
status: New → Confirmed
Revision history for this message
elventear (elventear) wrote :

I have corresponded with one of the Xen devs and he suspects the problem is in the netfront module. He reviewed the netback module code and it seemed right to him.

I would hope that Ubuntu devotes a little bit of resources to this issue since it is biting hard to some people that have deployed Ubuntu LTS 8.04 as a production system.

Revision history for this message
Caspar Clemens Mierau (leitmedium) wrote :

Hi elventear,

could be more specific please? What exactly did the Xen developer say? Which code exactly did he review and did he say "yes, the code is the failure" or "yes, the code is right"? Sorry for the questions but getting as much information as possible into this ticket will make it much more easier to fix this issue.

Thank you for taking the time making Ubuntu better.

Revision history for this message
elventear (elventear) wrote :

He reviewed the netback.c code in the latest stable Ubuntu 8.04 kernel
that I pulled for him. This is what he said:

This is what he said:

> Hmmm... It has the patches in that I suspected might be missing. I'm
> at a
> bit of a loss then. I guess I'll diff against our current netback.c
> and see
> if there's anything obvious different. Of course the error could be
> on the
> netfront side. :-(

Probably it would be a good idea if the Ubuntu team initiated contact
on the xen-dev list.

Revision history for this message
Caspar Clemens Mierau (leitmedium) wrote :

I was able to do a workaround now. Please do the following on domU and dom0:

$ aptitude install ethtool
$ ethtool -K eth0 tx off

I found the hint on

http://lists.us.dell.com/pipermail/linux-poweredge/2008-January/034372.html

Let me now, if that works for you.

Revision history for this message
elventear (elventear) wrote :

On Oct 8, 2008, at 3:59 AM, Caspar Clemens Mierau wrote:

> $ aptitude install ethtool
> $ ethtool -K eth0 tx off

That doesn't work for me.

Revision history for this message
chris lea (chris-lea) wrote :

Yes, I noted originally that this does allow you to move big files around. However, it causes all sorts of other things to break since TCP checksumming is needed to make sure all the packets are correct. So it doesn't effectively leave you with a usable system to do the ethtool trick.

Revision history for this message
elventear (elventear) wrote :

Kernel 2.6.24-21-xen seems to improve things a little bit. I can get a higher throughput than before without the network stalling, but definitely when I reach traffic around 6Mbps the network still dies.

Has anything changed in that revision that might have improved things?

Revision history for this message
Jürgen Hammelmann (j-hammelmann) wrote :

I have a similar/same error with a lenovo s10e and ubuntu 8.10: it uses the tg3 driver for the fast ethernet device:
connecting with ssh to the netbook and starting firefox or other big application in this shell, produces an
error "Disconnecting: Corrupted MAC on input." and I'm disconnected from the s10e. In other direction: starting apps from
the s10e on other machines through ssh shell produces no error.

I have found a bugfix: calling 'sudo ethtool -K eth0 tx off rx off', after calling this command no errors are produced!!!

Revision history for this message
chris lea (chris-lea) wrote :

Jürgen -

Yes, but as noted, doing the

ethtool -K eth0 tx off rx off

trick turns off tcp checksumming, which then causes all manner of other things to break horribly.

I find it pretty sad that this bug is still here with 8.10 out. This makes Xen basically unusable for a lot of situations. :(

Revision history for this message
Jürgen Hammelmann (j-hammelmann) wrote :

Hi all and Chris,

I don't use Xen, the tg3 driver or something else produces this error, too.
Yesterday I have found out that disabling AND reenabling checksum validation with
'sudo ethtool -K eth0 tx off rx off; sudo ethtool -K eth0 tx on rx on' fixes the error, too!

Another notebook (HP 530) with another ethernet driver and same distribution (ubuntu 8.10)
don't have these problems.

Ciao

Revision history for this message
elventear (elventear) wrote :

On Jan 27, 2009, at 7:18 PM, chris lea wrote:

> I find it pretty sad that this bug is still here with 8.10 out. This
> makes Xen basically unusable for a lot of situations. :(

It is such a shame that Ubuntu has decided to not support Xen at all,
specially since currently is the best OSS VM out there. I would've
expected a least keep up the commitment to the 8.04 packages ...

Has anybody installed Xen 3.3 on Ubuntu 8.04, with the approved Xen
Kernel for dom0? That is my guess the only solution ...

Pepe

Revision history for this message
Todd Deshane (deshantm) wrote :
Revision history for this message
elventear (elventear) wrote :

On Jan 29, 2009, at 9:29 AM, Todd Deshane wrote:

> http://bderzhavets.wordpress.com/2008/11/13/backport-intrepid-xen-33
> -hypervisor-at-ubuntu-hardy-dom0-2624-21-xen/

I've seen that already. The upgrade was almost painless. xen-utils-3.2
need to be removed before doing an upgrade, or there will be an issue
a mismatched libxen3.

Yet, I was wondering more about the Xen blessed dom0 Kernel. That
needs to be manually compiled, and I was wondering if someone else
might have done that.

Pepe

Revision history for this message
Kevin Elliott (kevin-phunc) wrote :

This is still a severe problem for me as well. I'm surprised this has not yet been resolved anywhere. When deploying code to a Ubuntu VM running on a CentOS 5.3 Xen server, this is the error I get consistently:

 ** [blah.com :: out] remote: Compressing objects: 94% (756/804)
 ** [blah.com :: out] remote: Compressing objects: 95% (764/804)
 ** [blah.com :: out] remote: Compressing objects: 96% (772/804)
 ** [blah.com :: out] remote: Compressing objects: 97% (780/804)
 ** [blah.com :: out] remote: Compressing objects: 98% (788/804)
 ** [blah.com :: out] remote: Compressing objects: 99% (796/804)
 ** [blah.com :: out] remote: Compressing objects: 100% (804/804)
 ** [blah.com :: out] remote: Compressing objects: 100% (804/804), done.
 ** [blah.com :: out] Disconnecting: Corrupted MAC on input.
 ** [blah.com :: out] fatal: The remote end hung up unexpectedly
 ** [blah.com :: out] fatal: early EOF
 ** [blah.com :: out] fatal: index-pack failed

It exists under these conditions and versions:

Host OS
  CentOS 5.3 64-bit

Xen Kernel
  2.6.18-128.1.10.el5xen #1 SMP Thu May 7 11:07:18 EDT 2009 x86_64

Ethernet from Host OS's dmesg
  eth0: Tigon3 [partno(BCM95787) rev b002 PHY(5787)] (PCI Express) 10/100/1000Base-T Ethernet 00:19:99:30:f5:f5
  eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
  eth0: dma_rwctrl[76180000] dma_mask[64-bit]
  tg3: eth0: Link is up at 100 Mbps, full duplex.
  tg3: eth0: Flow control is off for TX and off for RX.

Guest OS
  Ubuntu 8.10

Guest Kernel (inherits from dom0)
  2.6.18-128.1.10.el5xen #1 SMP Thu May 7 11:07:18 EDT 2009 x86_64

Changing the rx/tx checksums inside of the guest do nothing to change the situation.
Disabling rx/tx checksums on the host os do remedy the problem (likely adding other problems to all VMs).

If I run on the host "ethtool -K eth0 tx off rx off" the problem goes away. Then if I start a git deployment to the VM and in the middle run "ethtool -K eth0 tx on rx on" then the git is interrupted with the error above.

I am not yet convinced it's only on Ubuntu VMs that this occurs (but I have yet to attempt this on a non-Ubuntu VM).

Any further progress on this would be valuable to me.

-Kevin

Revision history for this message
agent 8131 (agent-8131) wrote :

This is probably a duplicate of: https://bugs.launchpad.net/ubuntu/+source/xen-3.3/+bug/154271

I always disable tx checksumming on xen domu's to avoid this problem. I would be curious to hear from chris lea as to what exactly breaks when doing this. I have never noticed any problems with incorrect packets being sent out. My understanding is that with tx checksumming off only the hardware assistance is disabled and tx checksumming is then handled by software (by the kernel I assume).

For people who want to see if they are being affected by bug 154271 you should enable tx checksumming on the domU, run tcpdump during a large network transfer, and look for packets larger than 1500 bytes being sent.

Revision history for this message
chris lea (chris-lea) wrote :

Man, it's been a while since I was actually messing with this, but a LOT of things stopped working with the checksumming off IIRC. For starters, while I could scp or ftp a file in and out, the md5 and sha1 sums would frequently not match when comparing the source and destination files. MySQL replication was essentially unusable between two domU servers. And I also recall that my Gearman nodes would sometimes just drop out as far as the Gearman tracker was concerned, even though it was still running.

I've since just moved on and been using Openvz, which "just worked" with respect to using internal private IPs.

Revision history for this message
Marius Muja (mariusmuja) wrote :

For me disabling the tcp segmentation offload made the transfer not stall any more:

ethtool -K eth0 tso off

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.