xennet driver reports "skb rides the rocket" under moderate load

Bug #1195474 reported by Geraint Jones
72
This bug affects 14 people
Affects Status Importance Assigned to Milestone
linux-lts-raring (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We are using Ubuntu server in AWS running linux-image-3.8.0-19-generic when an instance has moderate network load (around 30mbit) we start to see :

[31333817.179933] xennet: skb rides the rocket: 19 slots
[31334587.454365] xennet: skb rides the rocket: 21 slots
[31334772.157791] xennet: skb rides the rocket: 20 slots
[31335254.431489] xennet: skb rides the rocket: 19 slots
[31336785.643018] xennet: skb rides the rocket: 19 slots
[31337438.686311] xennet: skb rides the rocket: 21 slots

This then translates into packet loss

eth0 Link encap:Ethernet HWaddr 0e:cd:f0:69:b1:29
          inet addr:10.0.7.254 Bcast:10.0.7.255 Mask:255.255.255.0
          inet6 addr: fe80::ccd:f0ff:fe69:b129/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1913740 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1619310 errors:0 dropped:6 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1196646593 (1.1 GB) TX bytes:234005040 (234.0 MB)
          Interrupt:48

Tags: kernel-bug
Revision history for this message
Brandon (lordnynex-6) wrote :

I am seeing this as well. I'm wondering if you've worked passed the issue. I'm thinking a different kernel more suitable for AWS?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-raring (Ubuntu):
status: New → Confirmed
Revision history for this message
Matt Whitlock (whitslack) wrote :

I'm seeing this on 3.8.0-19-generic at Amazon EC2 as well. It's very reproducible (happens every time), and it causes a bizarre (theoretically impossible) state for my sockets. I have a client and a server on separate machines, both running Ubuntu 13.04 on EC2, and shortly into a stress test, all traffic ceases. I can watch the output from netstat on the two machines, and there is no change in the sockets' queue lengths. The client's socket has zero bytes in both its send queue and its receive queue. The server's socket has zero bytes in its receive queue and a few hundred thousand bytes in its send queue. Theoretically this state should be impossible unless the client has reduced its receive window size to zero. (I don't know any reason why it would be doing that, so I'm ruling that out.) Given that there is available space in the client's receive queue and pending bytes in the server's send queue, bytes should be moving from the server to the client, but this is not happening. The entire test hangs at this point until eventually the connection times out (after several minutes).

Revision history for this message
Matt Whitlock (whitslack) wrote :

Oh, and for what it's worth, the same test runs fine on Amazon Linux, which is running kernel 3.4.57-48.42.amzn1.x86_64. There are no "skb rides the rocket" messages when the test is run on Amazon Linux, only when it's run on Ubuntu.

Revision history for this message
Stefan Bader (smb) wrote :

The kernel version 3.8.0-19 sounds like the initially released kernel. There has been an inquiry about some xen related network patches by Matt Wilson on the xen-devel mailing list this morning. I looked for the patches and those have been applied and released with the 3.8.0-28.41 kernel (or higher).
Btw, the Raring release recently went out of support (the first of the non-LTS releases with reduced duration of maintenance). The kernel currently gets a bit longer support but only as a hardware enablement option under Precise / 12.04).

Revision history for this message
Sean Gifts (sgiftsm) wrote :

Having the same issue.

Log: 2014-03-25T15:39:21.750+00:00 kern/alert(1) kernel[]: [330160.244029] xen_netfront: xennet: skb rides the rocket: 19 slots

running kernel version: 3.11.0-18-generic

Is their any updates to this bug on a fix?

Revision history for this message
Stefan Bader (smb) wrote :

Not to my knowledge. Also depends on the exact issue. Reading through the report there might be two separate issues.
1. Packet loss when this message appears. This is the the expected behaviour when trying to transmit
    a packet that would require more than 16+1 slots or 64kB.
    Compared to the Amazon kernel mentioned all kernels later than 3.7 have the following change:
    * xen/netfront: handle compound page fragments on transmit
   If there is something wrong with that, I have not yet seen a patch for it. There might be a related bug
   (bug #1275879) but hitting that would cause a BUG stacktrace in the guest.

2. That invalid socket state mentioned in comment #3. I wonder whether that could hint the netback driver
    running on the host side has closed the connection. There could be some messages to help in the host's
    dmesg but we cannot get to those (only Amazon can). If that is the case we should ensure those two
    patches are included in the kernel running in dom0 on the host:
      * xen-netback: don't disconnect frontend when seeing oversize packet
      * xen-netback: coalesce slots in TX path and fix regressions

Revision history for this message
Brian Moyles (bmoyles) wrote :

FWIW, we ran into the same messages and in researching, I came across this page
https://silenteh.com/sysadmin/2013/08/08/amazon-ec2-xennet-skb-rides-the-rocket.html
which suggests disabling TCP offload functionality using
ethtool -K eth0 rx off tx off sg off tso off ufo off gso off gro off lro off
to work around the problem. Disabling offload does in fact silence the messages (but also has the unfortunate side effect of dropping the MTU on boxes using jumbo frames)
Given that that happens to quiet things down, figured it might help narrow down what patch or patches are responsible for a fix...

Revision history for this message
Stefan Bader (smb) wrote :

We finally have a way to reproduce this at will. It turns out that any kernel newer than 3.7 (which has the change to handle compound pages that I mentioned in comment #7) will suffer from this problem. The work-around that Brian found in the previous comment will indeed work as it prevents the use of of fragments (A "ethtool -K eth0 sg off" should be enough).
I will mark this bug as a duplicate of the newer one since that already has more detail info.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.