network/multi-nic fails on some systems
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Checkbox Provider - Base |
Fix Released
|
High
|
Rod Smith |
Bug Description
From an email conversation:
Hey Guys,
Now that we've got things sorted with iperf running automatically in
the server certification suite, I've noticed a couple of things I
wanted to run by you after a few test runs on some servers here at
Dell. I'd like for you to take a look at this and let me know your
thoughts or suggestions before I file a bug.
1) The multi_nic tests don't seem to handle more than two interfaces very well.
Mark and I have run the latest certification bits from the -dev repo
and noticed that at least 3 machines so far, each with 4 nics, had
identical troubles with the multi_nic tests. Here are the results:
PowerEdge R815
https:/
PowerEdge R720
https:/
PowerEdge R620
https:/
In each case, one NIC seems to survive the tests while the others fall
all over themselves and can't reach the iperf server. Running the
iperf tests by hand on each failed NIC post-testing works OK.
Would something possibly be going wrong during the shutdown / bringup
sequence that's currently in use?
Systems with just two or fewer nics in them, like the M610, don't have
this problem:
https:/
2) The "ip link set dev <iface>" down / up commands mess up the
default routing information.
I noticed this after a c-c-s run completed. I could no longer send
any results to the C3 site at the end of the test run and I could not
ping machines such as the lab's http proxy that's used to get outside
of Dell. I got a "Network Unreachable" error.
Here's what the routing table looks like before the multi-nic tests on
one machine:
#ip route show
default via 10.0.0.1 dev em1
10.0.0.0/24 dev em1 proto kernel scope link src 10.0.0.36
10.0.0.0/24 dev em2 proto kernel scope link src 10.0.0.48
10.0.0.0/24 dev em3 proto kernel scope link src 10.0.0.46
..and after:
#ip route show
10.0.0.0/24 dev em1 proto kernel scope link src 10.0.0.36
10.0.0.0/24 dev em2 proto kernel scope link src 10.0.0.48
10.0.0.0/24 dev em3 proto kernel scope link src 10.0.0.46
..note the route is gone.
I can manually run "sudo ip route add via 10.0.0.1 dev em1" on the
machine prior to submitting the results from the SUT, and it works.
Otherwise, I'm almost dead in the water. This missing route also
messes situations where you rely on the internet connection to pull
down your kvm images for the virtualization check. Since hitting this
I've been pointing the tests to a local copy of the image that I put
on the hard drive of the SUT prior to testing.
Also, if you set up the "secure_id" parameter in
canonical-
this part will also fail in my lab environment because the default
route is missing.
Mark and I were looking at the man page for iperf and noticed that you
can bind iperf to a particular host or interface using the "-B" flag,
like so:
iperf -c 10.0.0.1 -B 10.0.0.36 -n 1024M
Would it be possible to change the multi-nic tests to somehow glean
the ip address information for each interface and then run iperf for
each nic using the "-B" flag? I'm, however, not an iperf expert, so,
there may be more issues with that approach than meets the eye.
Anyway, please let me know what you guys think and I'll be glad to
help out in any way I can. There's also a strong possibility that I
could have set something up wrong.
Thanks!
Related branches
- Daniel Manrique (community): Approve
- Zygmunt Krynicki (community): Needs Fixing
-
Diff: 201 lines (+66/-25)1 file modifiedproviders/plainbox-provider-checkbox/bin/network (+66/-25)
Changed in plainbox-provider-checkbox: | |
milestone: | 0.7 → 0.8 |
Changed in plainbox-provider-checkbox: | |
milestone: | 0.8 → 0.10 |
Changed in plainbox-provider-checkbox: | |
assignee: | nobody → Roderick Smith (rodsmith) |
status: | Confirmed → Fix Committed |
Changed in plainbox-provider-checkbox: | |
status: | Fix Committed → Fix Released |
This is a bug... it seems to happen after bringing interfaces up and
down multiple times... it's an ubuntu bug I think.
I would be all for binding, actually. There is an issue with this
though... Can you guarantee that traffic is going out and coming back
in the same port this way?
The kernel routes using the path of least resistance... so you may
send a packet out eth4, but when it comes back, it could well be eth0
that accepts the return packet in.
Here's an example using 2 nics on a default 14.04 install:
first, this is about where we are: supermicro: ~$ netstat -ni
ubuntu@
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 2864 0 0 0 583 0
0 0 BMRU
eth1 1500 0 58 0 0 0 2035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU
Note eth1 shows 2035 packets out and eth0 shows 2864 comig in.
Now a ping flood of 10000 packets:
sudo ping -I eth1 -f -c 10000 10.0.0.1
PING 10.0.0.1 (10.0.0.1) from 10.0.0.128 eth1: 56(84) bytes of data.
--- 10.0.0.1 ping statistics --- 258/0.497/ 0.024 ms, ipg/ewma 0.295/0.260 ms
10000 packets transmitted, 10000 received, 0% packet loss, time 298ms
rtt min/avg/max/mdev = 0.210/0.
and recheck netstat supermicro: ~$ netstat -ni
ubuntu@
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 12897 0 0 0 600 0
0 0 BMRU
eth1 1500 0 60 0 0 0 12035 0
0 0 BMRU
lo 65536 0 0 0 0 0 0 0
0 0 LRU
Notice that the outgoing 10K packets were on eth1, but the incoming
were on eth0.
One way to fix this would be to restart networking between each NIC...
that seems to restore the default route properly.
There is SUPPOSED to be a kernel parameter to fix this routing
problem, but I can't remember what it is and it's been forever and a
day since I last had to set it, so I've long since forgotten.
but there is something you can set in /proc/sys that changes the
kernel's routing behaviour.