TCP stale transfer with erroneous SACK information

Bug #1388786 reported by Jose Manuel Pasamar
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

Cisco PIX/FWSM changes TCP sequence numbers but doesn't change numbers in SACK TCP options.

When this erroneous information comes to Linux server there is some corruption in TCP stack in some circunstances with CUBIC TCP congestion algorithm and transfer stales.

Problem can be reproduced in Ubuntu Server 14.04 when a Cisco FWSM is changing sequence numbers (default configuration) and a big file (30MB, for example) is being transfered.

Can be solved deactivating SACK:
sysctl -w net.ipv4.tcp_sack=0

We have solved it also with this configuration:
sysctl -w net.ipv4.tcp_congestion_control=reno
sysctl -w net.ipv4.tcp_frto=1
sysctl -w net.ipv4.tcp_early_retrans=1

We can also fix it by changing firewall configuration.

Find attached a wireshark capture where you can see at 16613 frame how client requests segment 853521869 and server (158.42.250.128) resends again a previous segment for 87 seconds until it stops transfer.

Thanks

Revision history for this message
Jose Manuel Pasamar (jpasamar) wrote :
Revision history for this message
William Grant (wgrant) wrote :

This is the bugtracker for the Launchpad.net software development collaboration website. You'll need to contact Cisco support for bugs in Cisco devices.

Changed in launchpad:
status: New → Invalid
Revision history for this message
Jose Manuel Pasamar (jpasamar) wrote :

Hello William

I know there is a problem generated by Cisco firewall, but I'm sure that Linux kernel is not working fine in that situation.
I think TCP stack may discard erroneous information and not send the same packet again for a lot of time.

There is some not working fine at TCP stack.

Thanks

Revision history for this message
William Grant (wgrant) wrote :

Ah, in that case you probably wanted to report a bug against the linux package in Ubuntu. I'll move it across.

affects: launchpad → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Invalid → New
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1388786

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Uwe Schindler (uwe-thetaphi) wrote :

I can confirm this bug with the 3.13 kernel shipped with Ubuntu 14.04. The 3.2 kernel shipped with Ubuntu 12.04 did not have the problem, all download were succeeding. The problem started after upgrade to 14.04. The problem also solved after downgrading the kernel to 3.2 by downloading the latest 12.04 (Linux 3.2) one from the PPA.

Disabling TCP SACK also resolved the issue. I also tried newer PPA kernels, at least the next one for release with next Ubuntu Update (3.14) does not solve the problem yet.

In our case the problems happens mostly with far-away or slower DSL connections, trying to download large files from our webserver. The download starts fine, but suddenly slows down to 0 bytes/sec. It then sometimes recovers, but fails again after a short while.

The tcpdump showed, that the client is sending SACKs over and over, but the Ubuntu kernel does not understand them. In our case we also have Cisco hardware inbetween.

Revision history for this message
Uwe Schindler (uwe-thetaphi) wrote :

This link may also be related, it tells about same problems, but without CISCO hardware: https://askubuntu.com/questions/475700/application-stuck-in-tcp-retransmit/563984

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19-rc7-vivid/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Jose Manuel Pasamar (jpasamar) wrote :

We have tested the kernel 3.19-rc7 and found some improvements.

Communication does not stop and the file can be finally downloaded, but it still takes a long time.

We have tested a 100 MB file in an Ubuntu server with kernel 3.19-rc7 across the Cisco firewall changing sequence numbers with following results:
- With TCP SACK disabled (sysctl -w net.ipv4.tcp_sack=0) 62 seconds
- With TCP SACK enabled (default configuration) 462 seconds

It seems that, even though the communication is not completely stalled with this kernel version, the problem is not solved yet.

tags: added: kernel-bug-exists-upstream
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Changed in linux (Ubuntu):
status: Expired → In Progress
Revision history for this message
hkais (r-2) wrote :

I can reproduce this error too.
The environment is a full CISCO network with vmware ESXi hosts and ubuntu 14.04 guests.

Also here downloads to about 2MB are going somehow fine, but all which is taking longer (or more MB to transfer) is dropping to a very low bandwidth. Very often without any bits transmitted at all.

The issue is, that it looks like if pakets are received in random (not serial order) the SACK seems to be too aggressive and kicks the high speed of the transmission

Revision history for this message
Guru Evi (vanooste) wrote :

This can be fixed by turning off TCP sequence reordering on the Cisco appliance. Please note this also affects your Mac, BSD and Windows machines. You can turn off SACK on your host if you don't care about performance.

This feature was enabled by Cisco to protect Windows 95 hosts from TCP sequence prediction attacks (yeah, don't fix the problem, just break the network). However Cisco doesn't translate the SACK ranges it has modified the sequences for so your host gets back the 'wrong' range in the SACK response and simply ignores it because it doesn't match anything it sent.

https://supportforums.cisco.com/document/48551/single-tcp-flow-performance-firewall-services-module-fwsm

Changed in linux (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.