Bug #1317811 “Dropped packets on EC2, “xen_netfront: xennet: skb...” : Bugs : linux package : Ubuntu

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09:

#1

BootDmesg.txt Edit (16.7 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (1.3 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (3.1 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (1.4 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (118 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.8 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (629 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (36.0 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (93.0 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (28.6 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2014-05-09: Missing required logs.

#2

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1317811

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09:

#3

The machine is no longer running, but I can run apport-collect from a similar machine. The only difference being that we've since added a line to our startup script to reduce the MTU to 1500.

tags:	added: apport-collected
description:	updated

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: BootDmesg.txt

#4

BootDmesg.txt Edit (16.7 KiB, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: ProcCpuinfo.txt

#5

ProcCpuinfo.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: ProcEnviron.txt

#6

ProcEnviron.txt Edit (118 bytes, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: ProcInterrupts.txt

#7

ProcInterrupts.txt Edit (1.8 KiB, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: ProcModules.txt

#8

ProcModules.txt Edit (629 bytes, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: UdevDb.txt

#9

UdevDb.txt Edit (36.0 KiB, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: UdevLog.txt

#10

UdevLog.txt Edit (93.8 KiB, text/plain)

apport information

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-09: WifiSyslog.txt

#11

WifiSyslog.txt Edit (27.1 KiB, text/plain)

apport information

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Joseph Salisbury (jsalisbury) on 2014-05-09

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-10:

#12

For what it's worth, the MTU appears to differ per instance type. At least c3.large has an MTU of 9000 by default, while m1.small has a normal MTU of 1500.

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#13

Could be interesting to find out whether on a m1.small the issue does not occur (although that still could be resulting from other differences in the setup than mtu). Not sure how AWS manages to cause the instance to come up with a different mtu either. In my experiments I had a normal bridge on the host set to 9000 and the guest still had 1500. Though I do not know how the network is set up in EC2 in detail (could be openvswitch).
Generally the issue is that something seems to cause packets with a large data buffer. One slot in the xen-net driver is a 4k page. The limit is 18 slots. Anything above that causes the observed message and the packet to be dropped. The host side would have another limit of (usually) 20 slots on which it would assume a malicious guest and disrupts the connection. But since the guest drops at 17 or above the host should never see that number.
Unfortunately I am not that deeply understanding the network code, so I will have to ask upstream. As far as I understand a socket buffer can consist of of multiple fragments (kind of a scatter gather list). There is a definition in the code that sets a limit to the number of fragments based on a maximum frame size of 64K. This results in 17 frags (for 4K pages that is 16 + 1 to handle data not starting at page boundary). The Xen driver counts the length of the memory area in all frags (if data in a frag starts at an offset that is added, the code does that for every frag, the question would be whether in theory each frag would be allowed to have an offset because that might add up to more than one page). To the number of pages needed for the frags, the driver then adds the number of pages (can that be more than one?) needed for the header. If that is bigger than 18 (17 for frag + 1 for header?) the rides the rocket error happens.
This leaves a few question marks for me: the memory associated with a frag can be a compound page, so I would think that the length might be greater than 4K. I have no clue, yet, how compound pages exactly come into play. Is the 64K limit still enforced by a limit of the number of frags? Can each frag data begin at some offset (and end with more than one page of overall overhead)? Apparently the header can start at some offset, too. So worst case (assuming header length to be less than 4K), if the offset is quite big, that could end up requiring 2 pages. Then if the frag data happens to use up its 17 pages limit, we just would end up hitting the 19 pages failure size.

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#14

Thinking about this, I could build a debug kernel to which I add code to print out the layout of the socket buffer when the size check fails. Stéphan, would you be able to run that on a setup that shows the failures?

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-13:

#15

I can't comment on the driver implementation details, but I can give some further details about our experience.

The app in question was a second screen app for the dutch public broadcasting network for the Eurovision Song Contest. The app was live for two semi-finals on tuesday the 6th and thursday the 8th, as well as the finals saturday the 10th. Load was lowest on the thursday, when the Netherlands did not perform, and highest saturday during the finals. We ran c3.large instances for all shows.

During the first run on tuesday was when we first noticed the issue.

Shortly before the second run on thursday we noticed the high MTU setting as a possible cause, and changed it to 1500 on half of our machines in the redundant setup. There was a clear difference in connection stability between these machines.

For the third run on saturday, we had all machines on the normal MTU of 1500, as we adjusted our startup scripts to force the setting. We had zero connection issues that night, and clean kernel logs, even though this night saw the highest network load of all three.

We have several m1.small instances running 24/7 as well, and these have clean kernel logs, but their network load is quite low. The MTU on these has always been untouched, and is a normal 1500, apparently by default.

In the instance type list, EC2 shows Compute Optimized instances as having Enhanced Networking. Even though we don't qualify for it, perhaps the networking setup is different for these instances. https://aws.amazon.com/ec2/instance-types/

About a custom kernel, we'd have to look into deploying it, or reproducing the issue on a smaller test setup. I'd prefer looking into the latter, because maybe we can reproduce it between just two instances with stress tools.

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#16

Thanks for the additional info. Definitely the relation to MTU size sounds quite plausible. The checking is on traffic from the guest out and that I would expect to be affected by MTU together with GSO support. And yes, preferably we find a reproducer that does not require a production system to suffer. And ideally for me if I could do so on a local test system to understand the host side.
I will try to figure out more details from a stock c3.large if that is possible and maybe whether something like iperf can trigger it there.

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-13:

#17

So I have a smaller test case. Basically, install Redis (from apt) on one machine, and Node.js (binaries from nodejs.org) with the below scripts on the other. Run pub.js once, and sub.js twice, this quickly triggers the error. The first arg to each script is the address of the redis machine; I use the internal 10.0.0.0/8 address.

https://gist.github.com/stephank/764e3414d57bc3bcb6b3

I initially tried to do this using openbsd-inetd echo and several netcat processes, but that doesn't seem to trigger it. Maybe it's something specific about the way Redis distributes pubsub messages to its subscribers?

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-13:

#18

If you'd like me to run this on EC2, I can give it a try. A custom kernel would simply be a replacement package?

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#19

Yes, the kernel would be a set of dpkg files to be installed via 'dpkg -i'. Of course I still have to code that up. If I can reproduce it with your instructions locally then even better (would cut down turnaround times). Otherwise I can start up some EC2 instances, too. Good to have a simple way to trigger it. Not like some other issues that only happen under production conditions.

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#20

Good news, the reproducer works on my local system, too. Thanks. :)

Stefan Bader (smb) on 2014-05-13

Changed in linux (Ubuntu):
assignee:	nobody → Stefan Bader (smb)
status:	Confirmed → In Progress

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-13:

#21

So with the added debugging and running the reproducer with the outside bridge (and so the vifs) and the PV guests eth0 set to 9001 (as seen on EC2), I get the following (format is <length>@<offset>):

[ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 698.108134] header 1490@238 -> 1 slots
[ 698.108139] frag #0 1614@2164 -> + 1 pages
[ 698.108143] frag #1 3038@1296 -> + 2 pages
[ 698.108147] frag #2 6076@1852 -> + 2 pages
[ 698.108151] frag #3 6076@292 -> + 2 pages
[ 698.108156] frag #4 6076@2828 -> + 3 pages
[ 698.108160] frag #5 3038@1268 -> + 2 pages
[ 698.108164] frag #6 2272@1824 -> + 1 pages
[ 698.108168] frag #7 3804@0 -> + 1 pages
[ 698.108172] frag #8 6076@264 -> + 2 pages
[ 698.108177] frag #9 3946@2800 -> + 2 pages
[ 698.108180] frags adding 18 slots

So multiple frags can point to a compound page and also start at an offset. Which makes either the assumption about the size required to handle N frags is wrong or whatever creates that buffer...

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-15:

#22

Playing around with this, I actually found an even simpler way to trigger the issue:

PV guest #1: Install redis-server (and enable eth0 ip in config)
PV guest #2: Install redis-tools and run 'redis-benchmark -q -h <PV guest #1 IP> -d 1000'

The MTU size turns out to be irrelevant, this even happens with 1500 during the batch request tests. What does make a difference is to prevent scatter gather as it was reported in another bug about this (on any host that sees the "rides the rocket" message:

sudo ethtool -K eth0 sg off

I discussed the issue upstream and the driver should handle this case without dropping the request. It might be a bit complicated so I cannot give an ETA on the fix right now.

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-05-16:

#23

Thanks for the continued help fixing this!

I couldn't reproduce it using redis-benchmark on EC2, but that's okay.

Scatter/gather IO is solely a performance flag in the driver? As in, it won't affect applications?

The only effect I noticed after disabling it is that it's apparently required for jumbo framing:

vif vif-0 eth0: Reducing MTU because no SG offload

And it dropped to 1500. But I can live with that.

Also, do you have a link to the upstream discussion?

Revision history for this message

Stefan Bader (smb) wrote on 2014-05-16:

#24

Oh, ok. It does work quite well on my local guests that come up with 1500 MTU. Maybe the EC2 guests would need a bigger data size value than 1000. But yeah, as long as I have some way to verify whatever comes up to fix this, it is ok.

Yes, the loss of jumbo frames was expected. As long as high throughput is not critical it is at least good enough as a work-around.

About a upstream discussion: http://www.spinics.net/lists/netdev/msg282340.html

Basically it looks like the problem was kind of known but probably did not happens often enough. Or actually complicated to fix. It appears that other drivers will not have that issue as long as the limit is in the actual transfer size and not in the number of pages required to accommodate the frags/scatter gather list. Unfortunately Xen has a limit there that guests have to impose because otherwise the host side driver would shut down the connection completely.

Revision history for this message

Ran Rubisntein (ran-cld) wrote on 2014-07-03:

#25

I am getting this error on Ubuntu 14.04 with latest kernel 3.13.0-30-generic running on c3.2xlarge instances on EC2 PV.

Changing MTU to 1500 didn't help.

Any other suggestions? We are getting 10-20 dropped packets a day (out of millions)

Revision history for this message

Stéphan Kochen (stephank) wrote on 2014-07-03:

#26

As Stefan Bader mentions in #22, the current workaround is:

sudo ethtool -K eth0 sg off

Revision history for this message

Stefan Bader (smb) wrote on 2014-07-03:

#27

Right, unfortunately a real fix without the need to disable scatter gather will unlikely happen soon. None of the approaches discussed until now seem to find the agreement of everybody as they all would not be perfect.

Revision history for this message

Carl Hörberg (carl-hoerberg) wrote on 2014-12-01:

#28

HVM instances does not seem to have this issue, only PV/paravirtual instances.

Revision history for this message

Stefan Bader (smb) wrote on 2014-12-01:

#29

HVM instance would have the same issue when using PV network drivers (which usually they do for performance). However one also needs to cause fragmented skbs which contain multiple compound page fragments. And that depends on many factors which may not always be easy to meet.

By now, there actually seems to be a work-around that has been applied to upstream v3.17. Looks like we have to pick the following (or actually get it into the stable process):

commit 97a6d1bb2b658ac85ed88205ccd1ab809899884d
Author: Zoltan Kiss <email address hidden>
Date: Mon Aug 11 18:32:23 2014 +0100

xen-netfront: Fix handling packets on compound pages with skb_linearize

Changed in linux (Ubuntu Trusty):
importance:	Undecided → Medium
status:	New → Triaged
Changed in linux (Ubuntu Utopic):
importance:	Undecided → Medium
status:	New → Triaged

Stefan Bader (smb) on 2014-12-01

description:	updated
tags:	added: kernel-bug-break-fix

Andy Whitcroft (apw) on 2014-12-04

Changed in linux (Ubuntu Trusty):
status:	Triaged → Confirmed
Changed in linux (Ubuntu Utopic):
status:	Triaged → Confirmed
Changed in linux (Ubuntu):
status:	In Progress → Fix Committed

Andy Whitcroft (apw) on 2015-01-07

Changed in linux (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

Diego Rodriguez (habaner0) wrote on 2015-01-16:

#30

I'm still seeing this issue in Ubuntu 14.04 on Ec2, despite using the latest kernel release:

3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Here are some of the logs I found:

kern.log:1634:Jan 15 00:22:59 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [10070.280631] xen_netfront: xennet: skb rides the rocket: 22 slots
kern.log:3523:Jan 15 20:01:23 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [80773.747470] xen_netfront: xennet: skb rides the rocket: 19 slots
kern.log:3524:Jan 15 20:01:23 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [80773.791014] xen_netfront: xennet: skb rides the rocket: 19 slots
kern.log:3525:Jan 15 20:02:14 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [80824.734485] xen_netfront: xennet: skb rides the rocket: 19 slots
kern.log:3526:Jan 15 20:02:22 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [80833.403077] xen_netfront: xennet: skb rides the rocket: 19 slots
kern.log:3871:Jan 15 23:39:20 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [93850.874250] xen_netfront: xennet: skb rides the rocket: 20 slots
kern.log:3872:Jan 15 23:39:20 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [93851.452640] xen_netfront: xennet: skb rides the rocket: 19 slots
kern.log:3873:Jan 15 23:39:20 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [93851.453131] xen_netfront: xennet: skb rides the rocket: 20 slots
kern.log:3874:Jan 15 23:39:21 staging-cool-load-balancer-20150114-10-36-6-94 kernel: [93851.695471] xen_netfront: xennet: skb rides the rocket: 19 slots

Revision history for this message

Stefan Bader (smb) wrote on 2015-01-16:

#31

Not surprising as we held back for Trusty and Utopic after being told that there was a regression. And as the task status show this only became fixed in current development (Vivid). But Now that both parts are there it is time to get back to stable.

Stefan Bader (smb) on 2015-01-16

description:

updated

Revision history for this message

Jon Schewe (jpschewe) wrote on 2015-01-20:

#32

So will this be fixed in 14.04 at all? I just upgraded to kernel 3.13.0-44 and I'm seeing more of these messages then before. This is on a system that does NAT and DNS.

Revision history for this message

Brian Scholl (btscholl) wrote on 2015-01-21:

#33

Just chiming in with Jon, I'm using 14.04.1 LTS on EC2 hs1.8xlarge with kernel 3.16.0-29 and I can still reliably produce this error. I thought that this was fixed in 3.14+ but no such luck. Under a particular load the server becomes unresponsive to network requests.

I've tried turning off tso and sg on eth0 but this did not resolve the issue. I'm not sure if there is another feature that is causing this in my configuration but I'd be willing to test for it if someone could point me at documentation.

Also if there are any logs I can provide to help diagnose this issue please let me know, I'm really eager to see this bug resolved.

Revision history for this message

Stefan Bader (smb) wrote on 2015-01-21:

#34

This will be fixed in Utopic and Trusty. This was only delayed because the upstream fix was found to cause another regression just about when it would have been picked up. I just re-submitted that and the fix for the regression for getting picked up by our stable trees.

Revision history for this message

dragosr (dragosr) wrote on 2015-01-22:

#35

Any updates on when will the fix come out ?

Revision history for this message

Durzo (durzo) wrote on 2015-02-04:

#36

also waiting on this

Revision history for this message

Stefan Bader (smb) wrote on 2015-02-05:

#37

Unfortunately I cannot speed up the process. The fixes have been picked into our stable trees and got a chance to move over to the distro trees next week (which would get them into the next update). Meanwhile you could be working around it by disabling scatter gather (see comment #22).

Andy Whitcroft (apw) on 2015-02-09

Changed in linux (Ubuntu Utopic):
status:	Confirmed → Fix Committed

Seth Forshee (sforshee) on 2015-02-09

Changed in linux (Ubuntu Trusty):
status:	Confirmed → Fix Committed

Revision history for this message

Brad Figg (brad-figg) wrote on 2015-02-13:

#38

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-trusty

Revision history for this message

Stéphan Kochen (stephank) wrote on 2015-02-16:

#39

I believe my test case is flawed, so I cannot verify with certainty if the issue is fixed or not. This is the same test case as I used before, for which I posted code in a gist: https://gist.github.com/stephank/764e3414d57bc3bcb6b3

Here's what I tried:

- I started two new c3.large machines from ami-69e76c1e (eu-west-1 HVM 64-bit trusty with instance store)

- I downloaded io.js 1.2.0 on machine A, together with the pub.js and sub.js scripts from my gist.

- I installed redis-server on machine B and reconfigured redis to bind on to the internal IP (in 10.x.x.x)

- The machines were initially running linux-virtual 3.13.0.45.52. I reproduced the issue in this setup by running sub.js twice, then pub.js once on machine A, connecting them to redis on machine B. The 'rides the rocket' message showed up in the logs, and the subs lost their connection.

- I enabled trusty proposed on both machines with a pin, and selectively upgraded linux-virtual on both machines. Then rebooted on both. The kernel on both machines is now linux-virtual 3.13.0.46.53.

- I ran the same test again, sub.js twice, pub.js once on machine A, connecting to machine B. There were no 'rides the rocket' messages, but the subs still lose their connections. I sporadically get 'net_ratelimit: x callbacks suppressed', but not on every test run.

- I disabled scather/gather on both machines, which also dropped their MTU to 1500, and ran the test again several times. There were no more 'net_ratelimit' messages, but the subs still lose their connections.

- I installed redis-server on machine A the same way, listening on the internal IP, and ran the same test on machine A, but this time connecting to itself on the internal IP. The test now runs indefinitely. (But this probably doesn't touch the driver.)

So I'm not sure what to take away from this. I suppose I could continue by trying to fix my test case to run properly without scather/gather, before again enabling it. Or find a way to trigger it using a different test, such as with redis-benchmark.

Stefan, is it sufficient verification if your own testing now shows it fixed?