network/multi-nic fails on some systems

Bug #1329029 reported by Jeff Lane 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Fix Released
High
Rod Smith

Bug Description

From an email conversation:
Hey Guys,

Now that we've got things sorted with iperf running automatically in
the server certification suite, I've noticed a couple of things I
wanted to run by you after a few test runs on some servers here at
Dell. I'd like for you to take a look at this and let me know your
thoughts or suggestions before I file a bug.

1) The multi_nic tests don't seem to handle more than two interfaces very well.

Mark and I have run the latest certification bits from the -dev repo
and noticed that at least 3 machines so far, each with 4 nics, had
identical troubles with the multi_nic tests. Here are the results:

PowerEdge R815
https://certification.canonical.com/hardware/201107-8332/submission/97649/test-results/fail/

PowerEdge R720
https://certification.canonical.com/hardware/201404-14939/submission/97651/test-results/fail/

PowerEdge R620
https://certification.canonical.com/hardware/201210-11906/submission/97647/test-results/fail/

In each case, one NIC seems to survive the tests while the others fall
all over themselves and can't reach the iperf server. Running the
iperf tests by hand on each failed NIC post-testing works OK.

Would something possibly be going wrong during the shutdown / bringup
sequence that's currently in use?

Systems with just two or fewer nics in them, like the M610, don't have
this problem:

https://certification.canonical.com/hardware/201212-12223/submission/97646/test-results/pass/

2) The "ip link set dev <iface>" down / up commands mess up the
default routing information.

I noticed this after a c-c-s run completed. I could no longer send
any results to the C3 site at the end of the test run and I could not
ping machines such as the lab's http proxy that's used to get outside
of Dell. I got a "Network Unreachable" error.

Here's what the routing table looks like before the multi-nic tests on
one machine:

#ip route show
default via 10.0.0.1 dev em1
10.0.0.0/24 dev em1 proto kernel scope link src 10.0.0.36
10.0.0.0/24 dev em2 proto kernel scope link src 10.0.0.48
10.0.0.0/24 dev em3 proto kernel scope link src 10.0.0.46

..and after:

#ip route show
10.0.0.0/24 dev em1 proto kernel scope link src 10.0.0.36
10.0.0.0/24 dev em2 proto kernel scope link src 10.0.0.48
10.0.0.0/24 dev em3 proto kernel scope link src 10.0.0.46

..note the route is gone.

I can manually run "sudo ip route add via 10.0.0.1 dev em1" on the
machine prior to submitting the results from the SUT, and it works.
Otherwise, I'm almost dead in the water. This missing route also
messes situations where you rely on the internet connection to pull
down your kvm images for the virtualization check. Since hitting this
I've been pointing the tests to a local copy of the image that I put
on the hard drive of the SUT prior to testing.

Also, if you set up the "secure_id" parameter in
canonical-certification.conf to automatically send your submissions,
this part will also fail in my lab environment because the default
route is missing.

Mark and I were looking at the man page for iperf and noticed that you
can bind iperf to a particular host or interface using the "-B" flag,
like so:

iperf -c 10.0.0.1 -B 10.0.0.36 -n 1024M

Would it be possible to change the multi-nic tests to somehow glean
the ip address information for each interface and then run iperf for
each nic using the "-B" flag? I'm, however, not an iperf expert, so,
there may be more issues with that approach than meets the eye.

Anyway, please let me know what you guys think and I'll be glad to
help out in any way I can. There's also a strong possibility that I
could have set something up wrong.

Thanks!

Related branches

Revision history for this message
Jeff Lane  (bladernr) wrote :

This is a bug... it seems to happen after bringing interfaces up and
down multiple times... it's an ubuntu bug I think.

I would be all for binding, actually. There is an issue with this
though... Can you guarantee that traffic is going out and coming back
in the same port this way?

The kernel routes using the path of least resistance... so you may
send a packet out eth4, but when it comes back, it could well be eth0
that accepts the return packet in.

Here's an example using 2 nics on a default 14.04 install:

first, this is about where we are:
ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 2864 0 0 0 583 0
0 0 BMRU
eth1 1500 0 58 0 0 0 2035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU

Note eth1 shows 2035 packets out and eth0 shows 2864 comig in.

Now a ping flood of 10000 packets:
sudo ping -I eth1 -f -c 10000 10.0.0.1
PING 10.0.0.1 (10.0.0.1) from 10.0.0.128 eth1: 56(84) bytes of data.

--- 10.0.0.1 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 298ms
rtt min/avg/max/mdev = 0.210/0.258/0.497/0.024 ms, ipg/ewma 0.295/0.260 ms

and recheck netstat
ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 12897 0 0 0 600 0
0 0 BMRU
eth1 1500 0 60 0 0 0 12035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU

Notice that the outgoing 10K packets were on eth1, but the incoming
were on eth0.

One way to fix this would be to restart networking between each NIC...
 that seems to restore the default route properly.

There is SUPPOSED to be a kernel parameter to fix this routing
problem, but I can't remember what it is and it's been forever and a
day since I last had to set it, so I've long since forgotten.

but there is something you can set in /proc/sys that changes the
kernel's routing behaviour.

Revision history for this message
Jeff Lane  (bladernr) wrote :

On Wed, Apr 2, 2014 at 4:32 PM, Jeffrey Lane <email address hidden> wrote:
> This is a bug... it seems to happen after bringing interfaces up and
> down multiple times... it's an ubuntu bug I think.

+1, I never liked this way of "ensuring" traffic goes out the desired interface, but it has seemed to be reliable so far (so far... heh).

> We have several scripts that fiddle with the network, I'm sure at least one of them tries to wait until the connection is
> reestablished to continue, so the next test isn't stuck with a broken connection. I think however this only works if
> NetworkManager is handling things (not the case on servers), and I also think the network test doesn't use the same
> mechanism, instead doing "its own thing". Also, *maybe* the network script tries to wait until the connection is up, but then
> again, I don't really remember.
>
> Perhaps we can come up with a network_state script to centralize this stopping/restarting of connections, to ensure everything
> is brought to the original state, and that works on both server and client. I'm not certain this is the right approach, it certainly
> bears some thinking.

I would be all for binding, actually.

> This would be ideal. My experience with iperf is that -B doesn't really work, but it may be due to what Jeff says about the kernel
> just doing its thing regardless of what iperf wants.

Revision history for this message
Jeff Lane  (bladernr) wrote :

On Wed, Apr 2, 2014 at 4:41 PM, Daniel Manrique
<email address hidden> wrote:

> Perhaps we can come up with a network_state script to centralize this
> stopping/restarting of connections, to ensure everything is brought to the
> original state, and that works on both server and client. I'm not certain
> this is the right approach, it certainly bears some thinking.

Or the nuke from orbit action:

sudo restart networking at the end of each test run when the script
restores interfaces. This will trigger a recreation of the routing
table and restore the default route.

This is the easiest way of doing this, that I've found.

>> I would be all for binding, actually.
>
>
> This would be ideal. My experience with iperf is that -B doesn't really
> work, but it may be due to what Jeff says about the kernel just doing its
> thing regardless of what iperf wants.

Yeah, it doesn't work well, here's an example:

ubuntu@supermicro:~$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:30:48:65:5e:0c
          inet addr:10.0.0.123 Bcast:10.0.0.0 Mask:255.255.255.0
          [SNIP]

eth1 Link encap:Ethernet HWaddr 00:30:48:65:5e:0d
          inet addr:10.0.0.128 Bcast:10.0.0.0 Mask:255.255.255.0
         [SNIP]

So we can see that .128 is eth1 and .123 is eth0.

Now a netstat -ni, followed by an iperf run binding to .128 and a
second netstat -ni:

ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 23200 0 0 0 4471 0
0 0 BMRU
eth1 1500 0 101 0 0 0 12035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU
ubuntu@supermicro:~$ iperf -c 10.0.0.1 -B 10.0.0.128 -n 1024M
------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
Binding to local address 10.0.0.128
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.128 port 5001 connected with 10.0.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 9.1 sec 1.00 GBytes 943 Mbits/sec
ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 64329 0 0 0 746034 0
0 0 BMRU
eth1 1500 0 101 0 0 0 12035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU

Notice that it appears that ALL the iperf traffic went out of eth0
instead of eth1, even though I bound it to eth1.

THAT, however, I am not sure is a bug in iperf or an effect of kernel
best-path routing.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So I have half the equation solved.

The problem outside of iperf with having multiple NICs on the same
subnet is that the kernel routes things funny.

So you get things like this:

ubuntu@critical-maas:~$ sudo arping -I eth0 10.0.0.123
ARPING 10.0.0.123 from 10.0.0.1 eth0
Unicast reply from 10.0.0.123 [00:30:48:65:5E:0C] 0.745ms
Unicast reply from 10.0.0.123 [00:30:48:65:5E:0C] 0.779ms
Unicast reply from 10.0.0.123 [00:30:48:65:5E:0C] 0.757ms

ubuntu@critical-maas:~$ sudo arping -I eth0 10.0.0.128
ARPING 10.0.0.128 from 10.0.0.1 eth0
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0C] 0.887ms
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0C] 0.901ms
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0C] 0.849ms

As you can see, I have 2 ethernet devices on my 1U but when I arping
their addresses from another box, the MAC from eth0 replies... this
poisons the arp table and can cause all sorts of fun when sending a
ton of packets.

So did a LOT of playing around today and found the magical set of proc
settings to fix this:

net.ipv4.conf.all.arp_announce=1
net.ipv4.conf.all.arp_ignore=2

SHOULD work alone on older kernels, hoping the 3.2 in 12.04, maybe the
3.X in 12.04.4.

However that may not be enough... later kernels also changed the
behaviour of rp_filter so you have to set that too:

net.ipv4.conf.all.rp_filter=0

After setting these three on Trusty, we NOW get things correct:
ubuntu@critical-maas:~$ sudo arping -I eth0 10.0.0.128
[sudo] password for ubuntu:
ARPING 10.0.0.128 from 10.0.0.1 eth0
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0D] 0.937ms
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0D] 0.888ms
Unicast reply from 10.0.0.128 [00:30:48:65:5E:0D] 0.844ms

Now the correct physical device is responding to pings... so to confirm this:
ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 47106 0 0 0 1483445 0
0 0 BMRU
eth1 1500 0 42805 0 0 0 8179 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU
ubuntu@supermicro:~$ sudo ping -c 10000 -I eth1 -f 10.0.0.1
PING 10.0.0.1 (10.0.0.1) from 10.0.0.128 eth1: 56(84) bytes of data.

--- 10.0.0.1 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 2496ms
rtt min/avg/max/mdev = 0.166/0.216/0.367/0.010 ms, ipg/ewma 0.249/0.216 ms
ubuntu@supermicro:~$ netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 57046 0 0 0 1503462 0
0 0 BMRU
eth1 1500 0 52811 0 0 0 18179 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU

Notice NOW that when I ping out eth1, all outgoing and incoming
packets are on eth1, no longer split between eth0 and eth1.

The next problem is iperf binding... I've tried a couple times with -B
but all outgoing packets STILL seem to be going out eth0 (which is why
you see the TX-OK count for eth0 so high.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Just a followup on this test issue...

I haven't had much time to play with it in a while, and to be honest,
the failures seem to be network related (e.g. it fails in some
networks and works in others) and I really don't know why.

The routing table thing was mostly a red herring as we don't need a
default route to go from point a to point b so long as both points are
on the same LAN segment.

I have a user story created to look at this test a little closer and
see if we can find a better way of handling it that restores
functionality as it brings the interfaces back up. I still think
using ifconfig on servers is the way to go, but we also need to make
sure that NetworkManager functionality is preserved for client as that
uses this test too, I believe.

Revision history for this message
Daniel Manrique (roadmr) wrote :

I'll set this as confirmed, as we know it is happening (been tested extensively by Jeff), not yet Triaged because we don't have a clear action plan.

FWIW, the network script needs extensive overhauling, as we've seen in recent fixes the code is a bit hard to understand, the way it reads options and configuration is confusing, and now the way it handles multiple interfaces is suspect of causing trouble.

Changed in plainbox-provider-checkbox:
milestone: none → 0.7
status: New → Confirmed
importance: Undecided → High
Changed in plainbox-provider-checkbox:
milestone: 0.7 → 0.8
Changed in plainbox-provider-checkbox:
milestone: 0.8 → 0.10
Rod Smith (rodsmith)
Changed in plainbox-provider-checkbox:
assignee: nobody → Roderick Smith (rodsmith)
status: Confirmed → Fix Committed
Daniel Manrique (roadmr)
Changed in plainbox-provider-checkbox:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.