random silent corruption of TCP data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Hello! I’m having a very strange problem.
I’m the proud reporter of bug #554749, and I think I found something that might explain it. The short of that bug is that I’m using SSHFS to mount some shares from my server on my desktop; randomly (a few times each day) something goes wrong, and every program using that mount-point freezes. (I have to do a complex evil ritual to re-mount it without rebooting the computer.) While trying to debug it I discovered some occasional “Corrupted MAC on input” errors. I googled a bit for it, without much success; anyway, a post somewhere suggested I check for network corruption with netcat.
So, I cat’ed together two movie files, obtaining a 1.4 GB file filled with mostly random data. And I started shuttling it between the two computers, using netcat (via the default TCP). I did a dozen transfers, and exactly one of them was corrupted (the second, actually). Interestingly, the corruption was exactly 128 bytes long; the replaced data doesn’t have any obvious relationship to what was there originally.
According to ifconfig,
bogdanb@
RX packets:9487952 errors:0 dropped:0 overruns:0 frame:0
TX packets:6132714 errors:0 dropped:0 overruns:0 carrier:2
bogdanb@
RX packets:149100044 errors:0 dropped:0 overruns:0 frame:0
TX packets:135620981 errors:0 dropped:0 overruns:0 carrier:0
there haven’t been any transmission errors, so this being just something that randomly passed undetected through the TCP checksum is _really_ unlikely. There’s also the suspicious length of the error.
I’d expect a tiny bug in some of the routines that shuttle data between the NIC’s buffer and the application’s. I’ve no idea how to debug this further, please help!
A few more notes:
*) all this happens via Ethernet; the two computers are both linked to a switch with short cables. Anyway, given the above, it doesn’t look like line errors.
*) the server runs Karmic, the desktop runs Lucid.
*) I’ve had similar (but not identical) problems with SSHFS ever since I had these two computers (around Feisty, I think); it’s likely that whatever is causing the corruption was there since the beginning, but the way SSHFS handles occurrences of the bug changed.
*) whatever it is, it’s very random. As the test showed, I got a single error after 2 GB, then no other error for the next 15 GB of transferred files. However, the SSHFS error (which I’m pretty sure is caused by this) sometimes happens after 15 minutes, sometimes I have no problems for a full day.
*) I tried reporting this with ubuntu-bug, but Launchpad timed out on me several times in a row. Please tell me whatever information you think I should add.
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: amd64
AudioDevicesInUse:
Cannot stat file /proc/19634/fd/3: Transport endpoint is not connected
/dev/snd/
/dev/snd/
/dev/snd/pcmC0D0p: bogdanb 1604 F...m pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
Card hw:0 'Intel'/'HDA Intel at 0xf9ff8000 irq 22'
Mixer name : 'Realtek ALC1200'
Components : 'HDA:10ec0888,
Controls : 40
Simple ctrls : 22
Card1.Amixer.info:
Card hw:1 'Headset'/'Logitech Logitech Wireless Headset at usb-0000:00:1d.0-2, full speed'
Mixer name : 'USB Mixer'
Components : 'USB046d:0a12'
Controls : 4
Simple ctrls : 2
DistroRelease: Ubuntu 10.04
EcryptfsInUse: Yes
Frequency: Once a day.
HibernationDevice: RESUME=/dev/sdb2
IwConfig:
lo no wireless extensions.
eth0 no wireless extensions.
MachineType: System manufacturer P5Q-PRO
NonfreeKernelMo
Package: linux (not installed)
ProcCmdLine: BOOT_IMAGE=
ProcEnviron:
LANGUAGE=en_US:en
PATH=(custom, user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcVersionSign
Regression: Yes
RelatedPackageV
Reproducible: No
RfKill:
Tags: lucid networking regression-
Uname: Linux 2.6.32-21-generic x86_64
UserAsoundrc:
# ALSA library configuration file
# Include settings that are under the control of asoundconf(1).
# (To disable these settings, comment out this line.)
</home/
UserGroups: adm admin audio cdrom dialout floppy fuse lpadmin netdev plugdev sambashare scanner staff video
WpaSupplicantLog:
dmi.bios.date: 11/04/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1501
dmi.board.
dmi.board.name: P5Q-PRO
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.
dmi.modalias: dmi:bvnAmerican
dmi.product.name: P5Q-PRO
dmi.product.
dmi.sys.vendor: System manufacturer
description: | updated |
tags: | added: acpi-table-checksum |
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
Hi Bogdan,
Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http:// cdimage. ubuntu. com/releases/ . If the issue remains, please run the following command from a Terminal (Applications- >Accessories- >Terminal) . It will automatically gather and attach updated debug information to this report.
apport-collect -p linux 568616
Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https:/ /wiki.ubuntu. com/KernelMainl ineBuilds . Once you've tested the upstream kernel, please remove the 'needs- upstream- testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs- upstream- testing' text. Please let us know your results.
Thanks in advance.
[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]