openvswitch

Directional network performance issues with Neutron + OpenvSwitch

Bug #1252900 reported by Thiago Martins on 2013-11-19

This bug affects 16 people

	Status	Importance	Assigned to
neutron	Won't Fix	Undecided	Unassigned
openstack-manuals	Fix Released	Undecided	Darragh O'Reilly
openvswitch	New	Undecided	Unassigned
Ubuntu	Confirmed	Undecided	Unassigned

Bug Description

Hello!

Currently, Havana L3 Router have a serious issue. Which makes it almost useless (sorry, I do not want to be rude but instead, trying to bring more attention to this problem).

When the tenant network traffic pass trough the L3 Router (Namespace at the Network Node), it becomes very, very slow and intermittent. The issue also affects the traffic that hit a "Floating IP", going into the Tenant subnet.

The affected topology is: "Per-Tenant Router with Private Networks".

As a reference, I'm using the following Grizzly guide for my Havana deployment:

https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst

Extra info:

http://docs.openstack.org/havana/install-guide/install/apt/content/section_networking-routers-with-private-networks.html

The symptoms are:

1- "Slow connection to Canonical or when browsing the web from within a tenant subnet"

aptitude update ; aptitude safe-upgrade

From within a Tenant instance, it will take about 1 hour to finish, on a link capable of finishing it in 2~3 minutes.

2- SSH connection using Floating IPs froze 10 times per minute.

Connecting from the outside world, into a Instance using its Floating IP address, is a pain.

We're talking about this issue at the OpenStack mail list, here is the related thread: http://lists.openstack.org/pipermail/openstack/2013-November/002705.html

Also, I made a video about it, watch it here: http://www.youtube.com/watch?v=jVjiphMuuzM

Tested versions:

* OpenStack Havana on top of Ubuntu 12.04.3 using Ubuntu Cloud Archive

* Tested with Open vSwitch versions (none of it works):

1.10.2 from UCA
1.11.0 compiled for Ubuntu 12.04.3 using "dpkg-buildpackage"
1.9.0 from Ubuntu package "openvswitch-datapath-lts-raring-dkms"

* Not tested (maybe it will work):

Havana with Ubuntu 12.04.1 + OVS 1.4.0 (does not support VXLAN).

* Tenant subnet tested types:

VXLAN
GRE
VLAN

It does not matter the subnet type you choose, it will be always slow.

Apparently, if you upgrade your Grizzly from Ubuntu 12.04.1 + OVS 1.4.0, to Ubuntu 12.04.3 with OVS 1.9.0, it will trigger this problem when with Grizzly too. So, I think that this problem might be related to Open vSwitch itself. But I need more time to check this.

My private cloud computing based on Havana is open for you guys to debug it, just ask for an access! =)

My current plan it to test Havana with OVS 1.4.0 but, I don't have too much time this week to do this job.

I'm not sure if the problem is with OVS or not, I'll try to test it this week.

Also, at my video, you guys can see how I "fixed" it, by starting a Squid proxy-cache server within the Tenant Namespece Router, proving that the problem appear ONLY when you try to establish a connection from a tenant subnet, directly to the External network.

I mean, the connection between a tenant and its router is okay, from its router to the Internet, is also okay but, from a tenant to the Internet, is not. So, Squid was a perfect choice to verify this theory at the Namespace router... And Voialá! "There I fixed it"! =P

Please, let me know what configuration files do you guys will need to be able to reproduce this problem.

Best!
Thiago

See original description

Tags:

Thiago Martins (martinx) on 2013-11-20

description:

updated

Thiago Martins (martinx) on 2013-11-20

description:

updated

Thiago Martins (martinx) on 2013-11-20

tags:

added: l3 namespace neutron openstack openvswitch

Revision history for this message

Geraint Jones (geraint-t) wrote on 2013-11-20:

This also happens in Grizzly 2013.1.3 but not 2013.1.2

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-20:

Geraint,

Can you please, tell me if you are running Grizzly 2013.1.3 with OVS 1.9.0 or with OVS 1.4.0?

Tks!
Thiago

Revision history for this message

Geraint Jones (geraint-t) wrote on 2013-11-20:

OVS 1.11 here. Haven't tried 1.09 and the performance in 1.04 is so bad that it would be very hard to get any reliable numbers from it.

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-20: Re: [Bug 1252900] Re: Directional network performance issues with Neutron + OpenvSwitch

OVS 1.4.0 might be bad but, at least it work. Right?!

I was using OVS 1.4.0 with Grizzly for 6 months without ANY issue (but with
"MTU = 1400" for instances when using GRE tunnels).

I'm starting to thing that this problem seems to be related to newer OVS
versions.

I'll be able to test Havana with OVS 1.4.0 in about 10 days.

Tks!
Thiago

On 19 November 2013 23:15, Geraint Jones <email address hidden> wrote:

> OVS 1.11 here. Haven't tried 1.09 and the performance in 1.04 is so bad
> that it would be very hard to get any reliable numbers from it.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1252900
>
> Title:
> Directional network performance issues with Neutron + OpenvSwitch
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1252900/+subscriptions
>

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-20:

%s/starting to thing/starting to think/

Thiago Martins (martinx) on 2013-11-20

description:

updated

Revision history for this message

GMi (gmi) wrote on 2013-11-20:

Download full text (5.5 KiB)

I think your environment is miss-configured, as I don't experience these performance issues with Havana + Ubuntu 12.04 + OVS 1.10.2 + GRE.

I have two instances running on two compute nodes connected using GRE tunnels.
The instances belong to the same tenant so traffic between them uses the GRE tunnel between the two compute nodes.
Only the traffic destined for outside is sent to the qrouter running on the dedicated network node using the GRE tunnel established between the compute node and the network node.

The topology is "Per-Tenant Router with Private Networks" and basically, the setup looks like this:

Network node Public IP x.x.x.x
Data IP (GRE ) 10.0.20.1

Compute node1 Data IP (GRE) 10.0.20.2
Instance1 tenant IP 10.0.0.2

Compute node2 Data IP (GRE) 10.0.20.3
Instance1 tenant IP 10.0.0.4

The compute nodes as well as the networking node connect to the switch at 1 Gbps.

I ran iperf between the two instances and got 450 Mbps:

[root@host-10-0-0-2 ~]# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.0.2 port 5001 connected with 10.0.0.4 port 53122
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-120.0 sec 6.29 GBytes 450 Mbits/sec

root@host-10-0-0-4 ~]# iperf -c 10.0.0.2 -i 10 -t 120 -w 128K
------------------------------------------------------------
Client connecting to 10.0.0.2, TCP port 5001
TCP window size: 216 KByte (WARNING: requested 128 KByte)
------------------------------------------------------------
[ 3] local 10.0.0.4 port 53122 connected with 10.0.0.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 540 MBytes 453 Mbits/sec
[ 3] 10.0-20.0 sec 537 MBytes 451 Mbits/sec
[ 3] 20.0-30.0 sec 525 MBytes 440 Mbits/sec
[ 3] 30.0-40.0 sec 525 MBytes 440 Mbits/sec
[ 3] 40.0-50.0 sec 541 MBytes 454 Mbits/sec
[ 3] 50.0-60.0 sec 539 MBytes 452 Mbits/sec
[ 3] 60.0-70.0 sec 541 MBytes 454 Mbits/sec
[ 3] 70.0-80.0 sec 540 MBytes 453 Mbits/sec
[ 3] 80.0-90.0 sec 540 MBytes 453 Mbits/sec
[ 3] 90.0-100.0 sec 535 MBytes 449 Mbits/sec
[ 3] 100.0-110.0 sec 539 MBytes 452 Mbits/sec
[ 3] 110.0-120.0 sec 542 MBytes 454 Mbits/sec
[ 3] 0.0-120.0 sec 6.29 GBytes 450 Mbits/sec

I also ran iperf between the two compute nodes using the same physical link (the GRE 10.0.20.X segment) and I got close to wire speed (941 Mbps):

root@compute1:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.20.2 port 5001 connected with 10.0.20.3 port 58015
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-120.0 sec 13.1 GBytes 941 Mbits/sec

root@compute2:~# iperf -c 10.0.20.2 -i 10 -t 120 -w 128K
------------------------------------------------------------
Client connecting to 10.0.20.2, TCP port 5001
TCP window size: 256 KByte (WARN...

I think your environment is miss-configured, as I don't experience these performance issues with Havana + Ubuntu 12.04 + OVS  1.10.2 + GRE.

The topology is "Per-Tenant Router with Private Networks" and basically, the setup looks like this:

Network node Public IP x.x.x.x
                            Data IP (GRE ) 10.0.20.1

Compute node1 Data IP (GRE) 10.0.20.2
                               Instance1 tenant IP 10.0.0.2

Compute node2 Data IP (GRE) 10.0.20.3
                               Instance1 tenant IP 10.0.0.4

The compute nodes as well as the networking node connect to the switch at 1 Gbps.

I ran iperf between the two instances and got 450 Mbps:

[root@host-10-0-0-2 ~]#  iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.4 port 53122
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  6.29 GBytes   450 Mbits/sec

root@host-10-0-0-4 ~]# iperf -c 10.0.0.2 -i 10 -t 120 -w 128K
------------------------------------------------------------
Client connecting to 10.0.0.2, TCP port 5001
TCP window size:  216 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 10.0.0.4 port 53122 connected with 10.0.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   540 MBytes   453 Mbits/sec
[  3] 10.0-20.0 sec   537 MBytes   451 Mbits/sec
[  3] 20.0-30.0 sec   525 MBytes   440 Mbits/sec
[  3] 30.0-40.0 sec   525 MBytes   440 Mbits/sec
[  3] 40.0-50.0 sec   541 MBytes   454 Mbits/sec
[  3] 50.0-60.0 sec   539 MBytes   452 Mbits/sec
[  3] 60.0-70.0 sec   541 MBytes   454 Mbits/sec
[  3] 70.0-80.0 sec   540 MBytes   453 Mbits/sec
[  3] 80.0-90.0 sec   540 MBytes   453 Mbits/sec
[  3] 90.0-100.0 sec   535 MBytes   449 Mbits/sec
[  3] 100.0-110.0 sec   539 MBytes   452 Mbits/sec
[  3] 110.0-120.0 sec   542 MBytes   454 Mbits/sec
[  3]  0.0-120.0 sec  6.29 GBytes   450 Mbits/sec

I also ran iperf between the two compute nodes using the same physical link (the GRE 10.0.20.X segment) and I got close to wire speed (941 Mbps):

root@compute1:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.0.20.2 port 5001 connected with 10.0.20.3 port 58015
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

root@compute2:~# iperf -c 10.0.20.2 -i 10 -t 120 -w 128K
------------------------------------------------------------
Client connecting to 10.0.20.2, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 10.0.20.3 port 58015 connected with 10.0.20.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 10.0-20.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 20.0-30.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 30.0-40.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 40.0-50.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 50.0-60.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 60.0-70.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 70.0-80.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 80.0-90.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 90.0-100.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 100.0-110.0 sec  1.10 GBytes   941 Mbits/sec
[  3] 110.0-120.0 sec  1.10 GBytes   941 Mbits/sec
[  3]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

From instance 2, I downloaded a large ISO (4.92 MB/s):

[root@host-10-0-0-4 ~]# wget http://centos.arcticnetwork.ca/6.4/isos/x86_64/CentOS-6.4-x86_64-minimal.iso
--2013-11-20 11:31:40--  http://centos.arcticnetwork.ca/6.4/isos/x86_64/CentOS-6.4-x86_64-minimal.iso
Resolving centos.arcticnetwork.ca... 64.59.140.91
Connecting to centos.arcticnetwork.ca|64.59.140.91|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358959104 (342M) [application/octet-stream]
Saving to: `CentOS-6.4-x86_64-minimal.iso'

100%[===================================================================================================================>] 358,959,104 5.40M/s   in 70s

2013-11-20 11:32:49 (4.92 MB/s) - `CentOS-6.4-x86_64-minimal.iso' saved [358959104/358959104]

From instance 2, I cloned the nova git ( 5.01 MiB/s):
[root@host-10-0-0-4 ~]# git clone https://github.com/openstack/nova.git
Cloning into 'nova'...
remote: Counting objects: 221982, done.
remote: Compressing objects: 100% (62488/62488), done.
remote: Total 221982 (delta 177528), reused 195110 (delta 152622)
Receiving objects: 100% (221982/221982), 128.48 MiB | 5.01 MiB/s, done.
Resolving deltas: 100% (177528/177528), done.

From instance 2, I ran speedtest:

yum -y install python-pip
pip-python install speedtest-cli
speedtest-cli --server 2565 --share

and the speedtest results were:
Testing download speed........................................
Download: 90.15 Mbit/s
Testing upload speed..................................................
Upload: 43.87 Mbit/s

In conclusion, check again and you might find the issue.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-11-20:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status:	New → Confirmed

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-21:

Hi GMI,

The communication between two instances on different hypervisors (or at the same hypervisor), is not related to this problem. "Intra-cloud-communication" is working just fine.

Also, I don't think that it is a miss-configuration, I already checked it lots of times, with the help from some experts from mail list and Rackspace (yes, one Rackspace Network Engineer connected here at my Network Node + Instance and he said that this issue is new for lots of people), it is not a configuration problem. Anyway, it would be great if this was just "my problem" but, lots of people are popping up claiming that they're facing the very same problem.

There is something wrong at the Network Node from most recent OpenStack / Neutron versions.

Geraint just said at comment #1, that Grizzly 2013.1.3 is also affected, but is not 2013.1.2...

I'll make more performance tests in a few hours.

Tks!
Thiago

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2013-11-21:

The symptoms make me think it could be MTU issue.
Would be interesting to compare MTUs on various devices on the datapath in well-working setup and on setup having the issue.

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-21:

#10

Hi Eugene,

I hardly think that this is a MTU problem. Already check that weeks ago.

Back with Grizzly and OVS 1.4.0, we had to change the Instances MTU to
1400, otherwise, it brings problems, that is possible to see with tcpdump
at the Network Node but this, is a very different problem.

Honestly, I don't know for sure if it is fact, a different MTU problem but,
it does not look likes it.

Tks!
Thiago

On 21 November 2013 02:48, Eugene Nikanorov <email address hidden> wrote:

> The symptoms make me think it could be MTU issue.
> Would be interesting to compare MTUs on various devices on the datapath in
> well-working setup and on setup having the issue.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1252900
>
> Title:
> Directional network performance issues with Neutron + OpenvSwitch
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1252900/+subscriptions
>

Revision history for this message

Darragh O'Reilly (darragh-oreilly) wrote on 2013-11-21:

#11

Hi Martin, when you repeat the tests, can you run tcpdumps in the qrouter namespace on the qg- and qr- interfaces with -w so it saves the output to a file. Then attach the files to this bug or provide a download link so we can look at them in wireshark.

Putting squid in the router namespace takes the natting and routing out of the path. It is the Linux kernel that does that. Can you provide 'uname -a'. Maybe you could try testing with a different kernel - maybe the original 12.04, or the one you used with Grizzly.

Michael H Wilson (geekinutah) on 2013-11-22

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-24:

#12

Tarball file from tcpdump of qr- and qg- Neutron interfaces during the network outage Edit (41.8 MiB, application/x-tar)

Hi Darragh!

Here is the tcpdump data, from the following commands:

---
ip netns exec qrouter-41b95614-abbc-4c55-916c-7494adc37a0b tcpdump -v -n -i qr-b5725a2c-f1 -w qr-b5725a2c-f1.tcpdump
tcpdump: listening on qr-b5725a2c-f1, link-type EN10MB (Ethernet), capture size 65535 bytes
22916 packets captured
22916 packets received by filter
0 packets dropped by kernel

ip netns exec qrouter-41b95614-abbc-4c55-916c-7494adc37a0b tcpdump -v -n -i qg-a1a1a364-c0 -w qg-a1a1a364-c0.tcpdump
tcpdump: listening on qg-a1a1a364-c0, link-type EN10MB (Ethernet), capture size 65535 bytes
23120 packets captured
23120 packets received by filter
0 packets dropped by kernel
---

Within the Instance, I executed:

## tcpdump started at the Network Node:

ping -c 10 google.com

# tcpdump counter: "Got 24" qr- int, "Got 29" qg- int

aptitude update

# it took about 9 minutes to finish... Minimum accepted is about 1 minute/max.

## tcpdump stopped at the Network Node, files attached here.

---

NOTE:

During the "aptitude update", the download speed hits "0 B/s" a few of times AND, every time the "aptitude update" was waiting for the network to "wake up", the two tcpdump instances was stucked at:

* outage #1:

tcpdump of qr- interface stucked at:

"Got 15446"

tcpdump of qg- interface stucked at:

"Got 15489"

* outage #2:

tcpdump of qr- interface stucked again at:

"Got 16139"

tcpdump of qg- interface stucked again at:

"Got 16212"

* outage #3:

tcpdump of qr- interface stucked again at:

"Got 17527"

tcpdump of qg- interface stucked again at:

"Got 17661"

* # 4 - finished

tcpdump of qr- interface finished at:

"Got 22912"

tcpdump of qg- interface finished at:

"Got 23120"

# tcpdump stopped, files attached here.

---

Hope it helps!

Cheers!
Thiago

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-25:

#13

One more info that I forgot:

uname -a
Linux netnode-1 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Open vSwitch 1.10.2 from Ubuntu Cloud Archive.

Unfortunately, my previous working Grizzly setup was deleted... I remember I was using 12.04.1 (Linux 3.2) and Open vSwtich 1.4.0. (Grizzly 2013.1.2).

Tks,
Thiago

Revision history for this message

Darragh O'Reilly (darragh-oreilly) wrote on 2013-11-25:

#14

Hi Thiago,

from the tcpdump I see some packets are not being received and have to be retransmitted. Also some packets are much greater than 1500 bytes, but it seems only about 1500 are ACKed, which results in retransmissions too.

Can you provide the output of 'ethtool -k ethX' where ethX is the one in br-ex. And if any offload stuff is on, disable it and retest.

Can you tell us how your Neutron router is uplinked to the Internet? It seems 189.8.93.65 is the gateway_ip for the neutron external subnet. What kind of device is this? It's mac is 52:54:00:6a:5f:82, which seems to have a vendor prefix used by KVM.

Darragh.

Revision history for this message

Geraint Jones (geraint-t) wrote on 2013-11-25:

#15

Disabling generic-receive-offload with ethtool --offload eth0 gro off has resolved the issue for me :)

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-25:

#16

Hi Darragh,

The ethernet of my "br-ex" interface is "eth2", the output of "ethtool -k
eth2" is:

---
root@netnode-1:~# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off
---

And yes, my uplink router is a "Ubuntu KVM Virtual Machine", I have a valid
public IPv4 (189.8.93.64/28) routed to my own data center. The OpenStack
Network Node is connected to the uplink router (KVM VM) using a "GIGALan
3Com Manageable Switch".

Tks for the tips!

Best,
Thiago

On 25 November 2013 17:06, Darragh O'Reilly <email address hidden>wrote:

> Hi Thiago,
>
> from the tcpdump I see some packets are not being received and have to
> be retransmitted. Also some packets are much greater than 1500 bytes,
> but it seems only about 1500 are ACKed, which results in retransmissions
> too.
>
> Can you provide the output of 'ethtool -k ethX' where ethX is the one in
> br-ex. And if any offload stuff is on, disable it and retest.
>
> Can you tell us how your Neutron router is uplinked to the Internet? It
> seems 189.8.93.65 is the gateway_ip for the neutron external subnet.
> What kind of device is this? It's mac is 52:54:00:6a:5f:82, which seems
> to have a vendor prefix used by KVM.
>
> Darragh.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1252900
>
> Title:
> Directional network performance issues with Neutron + OpenvSwitch
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1252900/+subscriptions
>

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-25:

#17

YAY!!! Finally!!!

"ethtool --offload eth2 gro off"

Fixed the problem!

Is this still a BUG?! Or a miss-configuration or just a lack of
documentation ?

Thank you!
Thiago

On 25 November 2013 19:02, Geraint Jones <email address hidden> wrote:

> Disabling generic-receive-offload with ethtool --offload eth0 gro off
> has resolved the issue for me :)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1252900
>
> Title:
> Directional network performance issues with Neutron + OpenvSwitch
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1252900/+subscriptions
>

Revision history for this message

Darragh O'Reilly (darragh-oreilly) wrote on 2013-11-26:

#18

That's good. Neutron doesn't manage physical nics - so this is not a bug. I'll add a warning to the doc.

Changed in openstack-manuals:
assignee:	nobody → Darragh O'Reilly (darragh-oreilly)

Revision history for this message

GMi (gmi) wrote on 2013-11-26:

#19

Sorry. I was not receiving updates on this bug.

As it can be seen in the last tests I did in comment 6, the download and upload speed were good between an instance and the Internet, not only between two instances running on separate compute nodes.

Also, the physical interface used by br-ex on my network node has GRO turned ON and this doesn't seem to affect the network performance, so I'm not sure that's the issue:

root@quantum-network:~# ovs-vsctl list-ports br-ex
eth0
qg-bb60f501-45
root@quantum-network:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on

Darragh O'Reilly (darragh-oreilly) on 2013-11-26

Changed in openstack-manuals:
status:	New → In Progress

Revision history for this message

Darragh O'Reilly (darragh-oreilly) wrote on 2013-11-26:

#20

Hi GMi, what issue are you referring to? The bug reporter has confirmed that disabling GRO resolves the issue that this bug report is for.

The attached tcpdump shows packets of size greater than 1514 bytes. These hardly came from the Internet site. The internal Neutron network has a smaller MTU on its interfaces, so the Neutron router is not able to forward these large packets properly. TCP on the endpoints struggles to get the job done in spite of this.

TCP is an end-to-end protocol. GRO is okay if the interface is for TCP endpoints - that is why the Squid experiment worked. But the interface in br-ex is for a Neutron router which is not a TCP endpoint. So GRO on this interface interfers and hinders end-to-end TCP comms. Even though I have not tried to recreate this myself, I am confident that this is the root cause.

I don't know why you don't see this problem. Maybe it is your kernel level or nic/driver.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-26: Fix proposed to openstack-manuals (master)

#21

Fix proposed to branch: master
Review: https://review.openstack.org/58606

Revision history for this message

GMi (gmi) wrote on 2013-11-26:

#22

Hi Darragh,

I was referring at the fact that running bandwidth tests from an instance showed good results (downloaded ~300 MB in 70s, or obtained 90.15 Mbit/s during a speedtest) -> see the end of comment 6.

Below are some more details about my network node:

root@quantum-network:~# ovs-vsctl -V
ovs-vsctl (Open vSwitch) 1.10.2
Compiled Oct 8 2013 15:09:03

root@quantum-network:~# uname -a
Linux quantum-network.tor.lab 3.2.0-56-generic #86-Ubuntu SMP Wed Oct 23 09:20:45 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

root@quantum-network:~# ovs-vsctl list-ports br-ex
eth0
qg-bb60f501-45

root@quantum-network:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on

root@quantum-network:~# ethtool -i eth0
driver: bnx2
version: 2.1.11
firmware-version: bc 4.4.1 UMP 1.1.9
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

Maybe the issue was introduced in later kernels and the GRO fix is needed there, but I don't experience this issue in kernel 3.2.0-56.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-27: Fix merged to openstack-manuals (master)

#23

Reviewed: https://review.openstack.org/58606
Committed: http://github.com/openstack/openstack-manuals/commit/3cc8efdf5466750334e912ae8efa4cc8c0354edb
Submitter: Jenkins
Branch: master

commit 3cc8efdf5466750334e912ae8efa4cc8c0354edb
Author: Darragh O'Reilly <email address hidden>
Date: Tue Nov 26 20:00:17 2013 +0000

Add warning about GRO and Neutron routers

    Generic Receive Offload appears to be enabled by default on recent Ubuntu
    kernels. It can have a significant impact on download performance when
    enabled on a Neutron router interface. This patch warns users about that.

Change-Id: I3d3a560b1db55aabd901f27ad5c7bd5777b300da
Closes-bug: 1252900

Changed in openstack-manuals:
status:	In Progress → Fix Released

Revision history for this message

shake.chen (shake-chen) wrote on 2013-11-27:

#24

Thanks，it also solve my problem.

In centos6.4, runing RDO, in GRE.

before, the vm downlaod speed outside is only 20k

after run ethtool --offload eth0 gro off

the br-ex is connect to eth0

the download speed is 105M/s.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-27: Fix proposed to openstack-manuals (stable/havana)

#25

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/58686

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-27: Fix merged to openstack-manuals (stable/havana)

#26

Reviewed: https://review.openstack.org/58686
Committed: http://github.com/openstack/openstack-manuals/commit/5507a386f2b812a1aab4792e11d2a0660bc82568
Submitter: Jenkins
Branch: stable/havana

commit 5507a386f2b812a1aab4792e11d2a0660bc82568
Author: Darragh O'Reilly <email address hidden>
Date: Tue Nov 26 20:00:17 2013 +0000

Add warning about GRO and Neutron routers

    Change-Id: I3d3a560b1db55aabd901f27ad5c7bd5777b300da
    Closes-bug: 1252900
    (cherry picked from commit 3cc8efdf5466750334e912ae8efa4cc8c0354edb)

tags:

added: in-stable-havana

Revision history for this message

Thiago Martins (martinx) wrote on 2013-11-27:

#27

Guys,

I'm still facing a slow connectivity from behind the Neutron router.

My download is at 25.5kB/s, while it should be around ~350.0 kB/s.

I'll do more tests this week, maybe the problem is now at http://nova.clouds.archive.ubuntu.com/ubuntu/ ???

Tks!
Thiago

Revision history for this message

Darragh O'Reilly (darragh-oreilly) wrote on 2013-11-28:

#28

Hi Thiago, can post another tcpdump of the interface in br-ex and the first 100 bytes of each packets only should be enough:
sudo tcpdump -n -s 100 -i eth2 -w capture.pcap

Revision history for this message

Thiago Martins (martinx) wrote on 2013-12-15:

#29

Hi Darragh,

I'm sorry about this delay, too many things to do...

Well, I can confirm that after running "ethtool --offload eth2 gro off" (at the br-ex interface), the problem gets fixed BUT, and this is a huge BUT, the "Network Node" doesn't work anymore as a KVM Virtual Machine.

I mean, if the Network Node is a physical machine, than "ethtool --offload eth2 gro off" fixed this problem BUT, if the Network Node is a KVM Virtual Machine, than "ethtool --offload eth2 gro off" does NOT fix it.

My virtual Network Node is powered by KVM with VirtIO Network Devices.

The command "ethtool --offload eth2 gro off" works when the Network Node is a KVM VM but, the problem persists here.

Any tips?!

I'm doing more tests now, I'll update here later...

Tks!
Thiago

Revision history for this message

Thiago Martins (martinx) wrote on 2013-12-16:

#30

Hi!

I can confirm that, if your Network Node is a KVM Virtual Machine, on top of Ubuntu 12.04.3 + OVS 1.10.2 (from UCA), you'll NEED to run the following command:

"ethtool --offload eth2 gro off"

...also at the hypervisor too!

So, I'm running this "ethtool" command two times now, first one at the hypervisor ethernet interface (eth2, with ovsbr2 on it) and secondly, within the "Virtual Network Node" too.

Now my Network Node is working as a KVM Virtual Machine too! No more network outages.

Tks!
Thiago

Alan Pevec (apevec) on 2013-12-16

tags:

removed: in-stable-havana

Revision history for this message

Tom Fifield (fifieldt) wrote on 2014-12-24:

#31

Hi,

Is this bug still present in later releases (eg Juno)?

Revision history for this message

Eren (erent) wrote on 2015-01-22:

#32

Hello,

I believe so. I hit the same issue and the details are below. I've tried it with updated kernel (3.13.0-44-generic) on Ubuntu Server 14.04. Also disabling gro,tso, it didn't make any change.

ICMP packets are OK but there are a lot of TCP retransmissions.

http://lists.openstack.org/pipermail/openstack/2015-January/011207.html

Any help is appreciated.

Revision history for this message

Thiago Martins (martinx) wrote on 2015-02-20:

#33

Hey guys!

From what I'm seeing, this problem might be solved if we deploy Neutron using a solution from Intel, called Data Plane Development Kit (DPDK).

I'll test it next month.

More info:

http://www.slideshare.net/AlexanderShalimov/ashalimov-neutron-dpdkv1

https://events.linuxfoundation.org/sites/events/files/slides/Openstack-v4_0.pdf

Best,
Thiago

Revision history for this message

Thiago Martins (martinx) wrote on 2015-02-20:

#34

Here is more info about Intel DPDK:

http://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-dpdk-getting-started-guide.pdf

Cheers!

Revision history for this message

Rian (rian-twizer) wrote on 2015-07-28:

#35

Hi,
Can someone tell me if this is a Bug or something that should work?
should TSO be disabled on the network node external interface?

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-12:

#36

This bug is > 172 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status:	Confirmed → Incomplete

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-02-01:

#37

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-24: Fix included in openstack/openstack-manuals 15.0.0

#38

This issue was fixed in the openstack/openstack-manuals 15.0.0 release.

Brian Haley (brian-haley) on 2020-02-07