random silent corruption of TCP data

Bug #568616 reported by Bogdan Butnaru
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Hello! I’m having a very strange problem.

I’m the proud reporter of bug #554749, and I think I found something that might explain it. The short of that bug is that I’m using SSHFS to mount some shares from my server on my desktop; randomly (a few times each day) something goes wrong, and every program using that mount-point freezes. (I have to do a complex evil ritual to re-mount it without rebooting the computer.) While trying to debug it I discovered some occasional “Corrupted MAC on input” errors. I googled a bit for it, without much success; anyway, a post somewhere suggested I check for network corruption with netcat.

So, I cat’ed together two movie files, obtaining a 1.4 GB file filled with mostly random data. And I started shuttling it between the two computers, using netcat (via the default TCP). I did a dozen transfers, and exactly one of them was corrupted (the second, actually). Interestingly, the corruption was exactly 128 bytes long; the replaced data doesn’t have any obvious relationship to what was there originally.

According to ifconfig,

bogdanb@mabelode:~/tests$ ifconfig eth0 |grep errors
          RX packets:9487952 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6132714 errors:0 dropped:0 overruns:0 carrier:2
bogdanb@tanelorn:~/tests$ ifconfig eth0|grep errors
          RX packets:149100044 errors:0 dropped:0 overruns:0 frame:0
          TX packets:135620981 errors:0 dropped:0 overruns:0 carrier:0

there haven’t been any transmission errors, so this being just something that randomly passed undetected through the TCP checksum is _really_ unlikely. There’s also the suspicious length of the error.

I’d expect a tiny bug in some of the routines that shuttle data between the NIC’s buffer and the application’s. I’ve no idea how to debug this further, please help!

A few more notes:
*) all this happens via Ethernet; the two computers are both linked to a switch with short cables. Anyway, given the above, it doesn’t look like line errors.
*) the server runs Karmic, the desktop runs Lucid.
*) I’ve had similar (but not identical) problems with SSHFS ever since I had these two computers (around Feisty, I think); it’s likely that whatever is causing the corruption was there since the beginning, but the way SSHFS handles occurrences of the bug changed.
*) whatever it is, it’s very random. As the test showed, I got a single error after 2 GB, then no other error for the next 15 GB of transferred files. However, the SSHFS error (which I’m pretty sure is caused by this) sometimes happens after 15 minutes, sometimes I have no problems for a full day.
*) I tried reporting this with ubuntu-bug, but Launchpad timed out on me several times in a row. Please tell me whatever information you think I should add.

---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: amd64
AudioDevicesInUse:
 Cannot stat file /proc/19634/fd/3: Transport endpoint is not connected
                      USER PID ACCESS COMMAND
 /dev/snd/controlC1: bogdanb 1604 F.... pulseaudio
 /dev/snd/controlC0: bogdanb 1604 F.... pulseaudio
 /dev/snd/pcmC0D0p: bogdanb 1604 F...m pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf9ff8000 irq 22'
   Mixer name : 'Realtek ALC1200'
   Components : 'HDA:10ec0888,104382fe,00100101'
   Controls : 40
   Simple ctrls : 22
Card1.Amixer.info:
 Card hw:1 'Headset'/'Logitech Logitech Wireless Headset at usb-0000:00:1d.0-2, full speed'
   Mixer name : 'USB Mixer'
   Components : 'USB046d:0a12'
   Controls : 4
   Simple ctrls : 2
DistroRelease: Ubuntu 10.04
EcryptfsInUse: Yes
Frequency: Once a day.
HibernationDevice: RESUME=/dev/sdb2
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
MachineType: System manufacturer P5Q-PRO
NonfreeKernelModules: nvidia
Package: linux (not installed)
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-21-generic root=/dev/sda1 ro nomodeset
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-21.32-generic 2.6.32.11+drm33.2
Regression: Yes
RelatedPackageVersions: linux-firmware 1.34
Reproducible: No
RfKill:

Tags: lucid networking regression-potential needs-upstream-testing
Uname: Linux 2.6.32-21-generic x86_64
UserAsoundrc:
 # ALSA library configuration file

 # Include settings that are under the control of asoundconf(1).
 # (To disable these settings, comment out this line.)
 </home/bogdanb/.asoundrc.asoundconf>
UserGroups: adm admin audio cdrom dialout floppy fuse lpadmin netdev plugdev sambashare scanner staff video
WpaSupplicantLog:

dmi.bios.date: 11/04/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1501
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: P5Q-PRO
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1501:bd11/04/2008:svnSystemmanufacturer:pnP5Q-PRO:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP5Q-PRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: P5Q-PRO
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Bogdan Butnaru (bogdanb)
description: updated
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Bogdan,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 568616

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Bogdan Butnaru (bogdanb) wrote : AlsaDevices.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Bogdan Butnaru (bogdanb) wrote : AplayDevices.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : ArecordDevices.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : BootDmesg.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : Card1.Amixer.values.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : Lspci.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : Lsusb.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : PciMultimedia.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : ProcModules.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : UdevDb.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : UdevLog.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : UserAsoundrcAsoundconf.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote : WifiSyslog.txt

apport information

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

Hi Jeremy,

I’ve done the apport-collect thing and it worked this time. It wasn’t done with an ISO, just with my “normal” Lucid installation.

(I’m not sure if the ISO CD–part was just because of the automated message or not; it’s quite hard to test with a CD due to low reproducibility, let me know if it’s important.)

I’m looking at trying a mainline kernel right now, will keep you posted.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

I reopened this, I’m not sure why it was marked incomplete. Perhaps I just forgot to change it after adding all the info above.

Anyhow, I still get this as far as I can tell. It happens randomly; during normal usage it can happen* quite often. When I tried testing it myself with netcat and similar tools, I only got errors a few time per tens of gigabytes.

(*: that is, bugs that I think are triggered by this happen.)

As I said above, I don’t quite know how to deal with this.

Changed in linux (Ubuntu):
status: Expired → New
Brad Figg (brad-figg)
tags: added: acpi-table-checksum
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
kripton (kripton-r) wrote :
Revision history for this message
penalvch (penalvch) wrote :

Bogdan Butnaru, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.

Please let us know your results. Thanks in advance.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: lucid
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.