Ubuntu

Large file transfer gives error: Corrupted MAC on input

Reported by Mika Fischer on 2006-09-16
140
This bug affects 19 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Nominated for Karmic by Dan Kegel
linux-source-2.6.17 (Ubuntu)
Undecided
Unassigned
Nominated for Karmic by Dan Kegel

Bug Description

I have a box running dapper and one running edgy. If I transfer large files from the dapper box to my edgy box (i.e. ssh-server on dapper box, client on edgy-box), it gives me the following error message:

mika@lt-mf:~$ scp 192.168.2.101:/home/sina/Desktop/Zeug/*.MOV .
mika@192.168.2.101's password:
PICT0015.MOV 9% 1248KB 611.2KB/s 00:20 ETA
Disconnecting: Corrupted MAC on input.
lost connection

This can be reliably reproduced.

The problem does not occur if I tranfer it by "pushing" (ie. ssh-client on dapper-box and ssh-server on edgy box).
The problem does also not occur with any other ssh-server.

Let me know if I can provide more info.

Mika Fischer (zoop) wrote :

Well, I can't seem to reproduce this after the latest kernel update of dapper.

So you can consider this fixed.

Mika Fischer (zoop) wrote :

Correction: I can reproduce it, it just takes longer until the error occurs. So it's not fixed after all...

Brian Murray (brian-murray) wrote :

Thanks for your bug report. I was wondering if this is still an issue for you. If it is approximately how large of a file are you copying? Furthermore do you notice anything in 'dmesg' when the connection is lost? Thanks in advance.

Changed in openssh:
assignee: nobody → brian-murray
status: Unconfirmed → Needs Info
Mika Fischer (zoop) wrote :

Unfortunately it's still an issue.

I first suspected the WLAN router but now the machines are directly connected by a switch and the problem still occurs...

I initially noticed the problem with a video file taken with a digicam. I don't know exactly how big it was, probably around 50 MB. But the problem occurs rather randomly. The longer the file the higher the probability that the error occurs. As it is, I cannot even transmit a 20 MB file this way.

Well. If you know a way to debug exactly where the MAC gets corrupted, I could try this. Other than that I don't know how this can be resolved.

I'll also see if I can get my hand on another network adapter and see if that changes anything.

In the meantime thanks for your time!

Brian Murray (brian-murray) wrote :

Perhaps using scp in verbose mode would more informative. The switch for verbose is '-v' so could you try 'scp -v'? Also is there anything in your kernel log around the time when these errors occur?

Mika Fischer (zoop) wrote :

Ah, sorry. I forgot to say that there's nothing in the kernel log, neither on the client nor on the server.

I've tried running the server and the client at LogLevel DEBUG3 and will attach the logs.

I also inserted some debug output which gives this additional information (as an example, the actual MACs differ of course each time) from the client.

Expected MAC: fe 89 0d f5 b8 a4 32 2e ff 50 c3 32 38 62 4c 84
Received MAC: 9d df fb 36 7b d3 7d 43 51 e9 92 9f 74 1e 20 c2

I can't really find a pattern here...

So maybe something really fishy is going on hardware-wise. I'll try replacing the NIC and checking the RAM of that machine...

Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Brian Murray (brian-murray) wrote :

Could you please add information regarding the type of network adapter on both machines? ('lspci -vvn') Additionally if you could add 'sudo ethtool eth0' where eth0 is the network interface being used in the file transfer that would help. Thanks again.

Mika Fischer (zoop) wrote :

On the server the NIC has PCI id 00:0a.0.

On the client I actually have no idea which one of the PCI devices corresponds to the NIC because it's an onboard one...

Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Brian Murray (brian-murray) wrote :

Thanks for updating the bug report. Come to find out the output of 'ethtool -k' would be more informative. Could you add that too? I apologize for the mistake.

Mika Fischer (zoop) wrote :

No problem :)

Mika Fischer (zoop) wrote :
Kyle McMartin (kyle) wrote :

This is extremely strange. Could you attach the dmesg from both the client and server, and the output of ifconfig from both? (Feel free to edit out any private IPs or anything like that)

You say you saw this when the client was on wireless and also tried plugging the client into the switch (so it's probably not a driver problem on the client then?)

Cheers,
  Kyle

Mika Fischer (zoop) wrote :

Come to think about it, I can pretty much rule out the client in this. This is a completely different machine than the one I used when I first reported this bug...

It also can't be something in the network infrastructure because it was the same when I wasn't using the switch but the server was directly connected to the WLAN router...

So the problem has to lie on the server-side. My guess is still some hardware issue...

I'll attach the info you asked for.

Regards,
 Mika

Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Mika Fischer (zoop) wrote :
Brian Murray (brian-murray) wrote :

To eliminate ssh from the equation I was wondering if you could test doing a file transfer with netcat. Here is an example of that:

At the server console:

$ nc -v -w 30 -p 5600 -l > filename.back

and on the client side:

$ nc -v -w 2 10.0.1.1 5600 < filename

The file named filename is being sent from the client to the server on port 5600 and the server is writing it to disk as filename.back. You could read more about using netcat in this little article:

http://www.oreillynet.com/pub/h/1058

Please let us know what you find out. Thanks in advance.

Mika Fischer (zoop) wrote :

Very good idea!

As it turns out you're right. The same thing happens with netcat. Also only when the broken computer acts as server. The other way round works fine...

05fab97be7fd5e7c9229187c24c89ea0 test.bin.orig
05fab97be7fd5e7c9229187c24c89ea0 test.bin.m2s
7dcb7bef6d1af049bd63fcf6d180685e test.bin.s2m

I guess I'll just get a new NIC and see if this helps...

Any idea what else could be the cause of this?

And thanks a lot for helping me debug this!

Mika Fischer (zoop) wrote :

I switched the NIC with another one of the same type and put it into another PCI slot. Didn't change anything...

Then I let memtest86+ run and it also didn't detect anything.

I then lowered the bus clock frequency from 133 to 100 MHz. Also no effect.

So I'm quite stuck here. The only thing I can say is that it's not openssh related...

If you have any ideas what else I could try, please let me know...

Changed in openssh:
assignee: brian-murray → nobody
status: Needs Info → Confirmed
Mika Fischer (zoop) wrote :

OK. I also tried only using one memory module at a time. Still no luck.

I then tried to rule out the obvious by connecting the computers using a crossover-cable. No change...

I then tried transmitting a file consisting only of zero bytes. This surprisingly worked.

I then discovered vbindiff and used that to check what exactly had changed in the corrupted file.

The corruptions occur at different places each time. But it's always complete words (I'm using 32 bit Ubuntu) that get corrupted. In total it's about 4-6 corruptions in a 50MB file.

Is there another kernel with that I could try without messing the system up too much?

Or anything else I could try?

Mika Fischer (zoop) wrote :

The kernel used is actually 2.6.17-10-generic version 2.6.17.1-10.34 from edgy

Hi,

I have the same issue with a dapper server. In my case not only happen with scp to another machine. Sometimes just after login in via ssh.

In the same switch there are anothers ubuntu dapper server without problem, so I think is a driver issue. The machine have two same nics and in both occurs the same issue.

The nics are Intel 82546EB Gigabit Ethernet in a Dell Poweredget 650 and the driver is e1000

Brian Murray (brian-murray) wrote :

Mika - I don't easily see what network driver you are using. Is it the e1000 also? Thanks in advance.

Mika Fischer (zoop) wrote :

I'm afraid it's a different card: 3Com PCI 3c905C Tornado

Mika Fischer (zoop) wrote :

Oh, and the driver is 3c59x.

stefanolodi (slodi) wrote :

On of the host I manage has been affected by the problem discussed in this thread for months now. It occurs both when transferring files with scp and during ssh sessions, however at a time from the start of transfer or of session that appears to be random. It's avery annying problem: some days it virtually impossible to keep connected for more than a fes minutes.

OS:

Linux myhost 2.6.18-4-k7 #1 SMP Mon Mar 26 17:57:15 UTC 2007 i686 GNU/Linux

As to NIC hardware

> dmesg | grep 3C
0000:00:0b.0: 3Com PCI 3c905B Cyclone 100baseTx at f081c000.

> lsmod |grep 3c5
3c59x 40808 0
mii 5696 1 3c59x

If anyone could suggest a test I'd be happy to carry it out. (I have not tried netcat yet; asap I will).

I had the same problem with my LOM NIC Marvell 88E8001. I´m quite sure it was a hardware/driver issue, but in my case I solved it disabling the offload checksum on the NIC with the following command:

ethtool -K eth0 rx off tx off

I hope this could help

Mika Fischer (zoop) wrote :

Unfortunately my card does not support this :(

$ sudo ethtool -K eth0 tx off
Cannot set device tx csum settings: Operation not supported
$ sudo ethtool -K eth0 rx off
Cannot set device rx csum settings: Operation not supported

Brian Murray (brian-murray) wrote :

I am assigning this bug to the 'ubuntu-kernel-team' per their bug policy. For future reference you can learn more about their bug policy at https://wiki.ubuntu.com/KernelTeamBugPolicies .

Changed in linux-source-2.6.17:
assignee: nobody → ubuntu-kernel-team
Johan Christiansen (johandc) wrote :

I have this same problem on a Hardy laptop running:

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751M Gigabit Ethernet PCI Express (rev 11)
 Subsystem: IBM Unknown device 0577
 Flags: bus master, fast devsel, latency 0, IRQ 17
 Memory at a0100000 (64-bit, non-prefetchable) [size=64K]
 Expansion ROM at <ignored> [disabled]
 Capabilities: [48] Power Management version 2
 Capabilities: [50] Vital Product Data
 Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable-
 Capabilities: [d0] Express Endpoint IRQ 0

Using:

johan@johan-laptop:~$ sudo ethtool -i eth0
driver: tg3
version: 3.86
firmware-version: 5751m-v3.40a
bus-info: 0000:02:00.0

with:

johan@johan-laptop:~$ sudo ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

The problem only exists when i'm using my broadcom interface, when i use the built in wireless interface everything seems okay. I'll try turning off some of the offloading and reporting back if it worked.

Johan Christiansen (johandc) wrote :

I can confirm that turning offload parameters to "off" solves the issue. What might be wrong here?

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: New → Triaged
Sergio Zanchetta (primes2h) wrote :

The 18 month support period for Edgy Eft 6.10 has reached it's end of life. As a result, we are closing the linux-source-2.6.17 Edgy Eft kernel task. However, please note that this report will remain open against the actively developed kernel. Thank you for your continued support and help as we debug this issue.

Changed in linux-source-2.6.17:
status: Confirmed → Invalid

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Johan Christiansen (johandc) wrote :

This bug is still present in Intrepid Ibex.

running:
sudo ethtool -K eth0 rx off tx off

fixes the problem.

I am running Ubuntu 8.04 here at the moment. And I have this "Corrupted MAC on input" issue here too when using ssh. I even had it when ssh'ing to the same machine. I did

> ssh -X otheruser@localhost

and the ssh connection over the loop back device also got disconnected due to the MAC issue. Hope this helps to pin down the source of the problem.

henrikkirk (henrik-busywait) wrote :

Running sudo ethtool -K eth0 rx off tx of only gives me an error

henrik@qui-gon:~$ sudo ethtool -K eth0 rx off tx off
Cannot set device rx csum settings: Operation not supported

Im not sure what this does exatly, so im sorry I cant give any more details.

Upgrade today
henrik@qui-gon:~$ ssh -V
OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL 0.9.8g 19 Oct 2007
henrik@qui-gon:~$ uname -r
2.6.27-7-generic

Stille gives the same problems as recorded above. When doing the transfer to localhost instead of a different machine, it works nice and smooth.

henrik@qui-gon:~$ scp -rv local_music/* obi:/home/henrik/torrent/files/
Executing: program /usr/bin/ssh host obi, user (unspecified), command scp -v -r -d -t /home/henrik/torrent/files/
OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL 0.9.8g 19 Oct 2007
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to obi [XX.XXX.XX.XXX] port 22.
debug1: Connection established.
debug1: identity file /home/henrik/.ssh/identity type -1
debug1: identity file /home/henrik/.ssh/id_rsa type -1
debug1: identity file /home/henrik/.ssh/id_dsa type 2
debug1: Checking blacklist file /usr/share/ssh/blacklist.DSA-1024
debug1: Checking blacklist file /etc/ssh/blacklist.DSA-1024
debug1: Remote protocol version 2.0, remote software version OpenSSH_4.3p2 Debian-9etch3
debug1: match: OpenSSH_4.3p2 Debian-9etch3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.1p1 Debian-3ubuntu1
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Host 'obi' is known and matches the RSA host key.
debug1: Found key in /home/henrik/.ssh/known_hosts:14
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Trying private key: /home/henrik/.ssh/identity
debug1: Trying private key: /home/henrik/.ssh/id_rsa
debug1: Offering public key: /home/henrik/.ssh/id_dsa
debug1: Server accepts key: pkalg ssh-dss blen 435
debug1: read PEM private key done: type DSA
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Requesting <email address hidden>
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = en_DK.utf8
debug1: Sending command: scp -v -r -d -t /home/henrik/torrent/files/
Sending file modes: C0644 188109766 Beatnik Beats 11.15.08.mp3
Sink: C0644 188109766 Beatnik Beats 11.15.08.mp3
Beatnik Beats 11.15.08.mp3 7% 13MB 851.8KB/s 03:19 ETAReceived disconnect from XX.XXX.XX.XXX: 2: Corrupted MAC on input.
lost connection

Hope this helps.

Best regards
/Henrik Kirk

Brian C (brianwc) wrote :

I get this problem when running rdiff-backup (which uses ssh) between two machines both running Debian Lenny, with kernel 2.6.26-1-amd64 #1 SMP Wed Nov 26 18:26:02 UTC 2008 x86_64 GNU/Linux. The server machine has a Macronix ethernet device using the tulip driver. I also can solve the problem by doing ethtool -K eth0 rx off tx off although I don't know what that does and whether I should be worried about turning that off.

Anyway, whatever the larger issue is, it occurs on both Ubuntu and Debian.

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Vladimír Lapáček (vil) wrote :

I get the same problem: "Corrupted MAC on input" when running on Intrepid on Lenovo Ideapad S10e connected via ethernet.

tuxo (beat-fasel) wrote :

I have the same problem on Jaunty Jackalope Beta 9.04 with the following network card:
Ethernet controller: Attansic Technology Corp. L1e Gigabit Ethernet Adapter (rev b0).

I stumbled upon this error while doing a large file transfer using scp.

mindfuck (mindfuck) wrote :

I can also confirm this bug when running Intrepid on an Lenovo Ideapad S10e and trying to move files to another computer via ssh over the ethernet interface. The command "sudo ethtool -K eth0 rx off tx off" fixes the problem.

Great finding. I can confirm that the command from gloawu fixes the problem.
My curiosity drives me to ask how did you find this out?

Is there possibly anything that we can do to get this fixed in the upstream?

Thanks.

On Tue, Apr 7, 2009 at 7:15 PM, gloawu <email address hidden> wrote:

> I can also confirm this bug when running Intrepid on an Lenovo Ideapad
> S10e and trying to move files to another computer via ssh over the
> ethernet interface. The command "sudo ethtool -K eth0 rx off tx off"
> fixes the problem.
>
> --
> Large file transfer gives error: Corrupted MAC on input
> https://bugs.launchpad.net/bugs/60764
> You received this bug notification because you are a direct subscriber
> of the bug.
>

mindfuck (mindfuck) wrote :

Unfortunately I didn't figure it out myself. The command was posted here by Johan Christiansen on 2008-09-08 . Maybe he can go into more detail on this.
You're welcome

Johan Christiansen (johandc) wrote :

Yes, this is indeed a very embarrassing driver problem, where the hardware TCP offloading in the driver seems to corrupt frames that use SSL. The bug has been reported over 2½ year ago, and nobody seem to have what it takes to get it fixed, or the guts to increase the priority so it will reach the right people.

This affects many PC's, i therefore think that TCP offloading should be disabled per. default until the bug gets fixed in the kernel.

Manoj Iyer (manjo) on 2009-04-21
tags: added: ct-rev
Dan Kegel (dank) wrote :

I'm seeing this in Jaunty on a Lenovo laptop. lspci says
02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express (rev 02)
The symptoms are so severe that running google-chrome over ssh dies within a few seconds.
Happily, the workaround "sudo ethtool -K eth0 rx off tx off" seems to work.

I solved the problem changing the minipciex card on my Acer AspireOne with an Intel 3945ABG. Yesterday, for the first time and _without_ errors, I transferred a 1.7 G. I'm thinking that this bug is left opened intentionally. I'm agree with Johan Christiansen, two years and a half is enough to solve this problem that is not affecting the Window$ environment.. mmmhhh strange ;-)
I tried to apply the latest workaround (TCP offloading), but without success. My old card was a Atheros 5007.

SunBlade (septimus-severus) wrote :

I an experiencing the same problem with one of my Systems
I am using a rather old Laptop (Pentium MMx-233) with Debian Lenny as download server.
Originally this Laptop only had a 10 MBit D-Link PCMCIA-NIC. With this card there were no problems at all.
Since i replaced it with a (rather slow) Realtec 100MBit NIC, the problem is arising sporadically, but seldom and only during transfers of the downloaded data to may bigger Machines at full speed. The problem is arising during SSH-tranfers, as well as during NFS transfers. While downloadiung from the Internet wqith speed of max. 200 kB/s the problem never will show.

Usually i am starting the ssh-client on the Laptop. For NFS the laptop is exporting its inbound directories to the other machines. This bug can be reproduced transferrind data to any of may other machines (2 PCs w. Debian Lenny/Squeeze) and a few SUNs.

When i tried to replace the Realtec-NIC with a really fast 3COM 3CXFE575, the problem got worse.
I still could log in remotely, but transfers would fail.

I suppose a buffering problem, i.e. a ring-buffer overflow in the kernel-code.

burianek (burianek) wrote :

I made this automatic workaround for my firend with Lenovo S10. It's based on altering eth with ethtool.

Place this script in
   /etc/network/if-up.d/broadfix
Make it executable
   sudo chmod +x /etc/network/if-up.d/broadfix
Restart

---
You may want to specify which eth you want to alter.
eth0 and eth1 are default in condition
   ...
   if [[ "$IFACE" == eth[01] ]]; then
   ...

burianek (burianek) wrote :

I made this automatic workaround for my firend with Lenovo S10. It's based on altering eth with ethtool.

Place this script in
   /etc/network/if-up.d/broadfix
Make it executable
   sudo chmod +x /etc/network/if-up.d/broadfix
Restart

---
You may want to specify which eth you want to alter.
eth0 and eth1 are default in condition
   ...
   if [[ "$IFACE" == eth[01] ]]; then
   ...

drew einhorn (drew-einhorn) wrote :

Hmm. I seeing this problem with scp of large files to a jaunty box.

[ 0.000000] Linux version 2.6.28-15-generic (buildd@palmer) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #49-Ubuntu SMP Tue Aug 18 18:40:08 UTC 2009 (Ubuntu 2.6.28-15.49-generic)

Unfortunately the turning off checksumming does not work.

drew@test:~$ sudo ethtool -K eth0 tx off
Cannot set device tx csum settings: Operation not supported
drew@test:~$ sudo ethtool -K eth0 rx off
Cannot set device rx csum settings: Operation not supported
drew@test:~$

Could it be that the kernel is hardcoded to us hardware checksumming,
and turning it off is not a option?

This is still an issue in Karmic (RC). On my Lenovo S10e I'm still getting errors when copying large files to my Jaunty fileserver via the machine's LAN interface using SSH(FS). Fortunately the workaround suggested by burianek on 2009-08-19 helps in Karmic too. (BTW, a restart is not required after creating the script in /etc/network/if-up.d/; simply re-plugging the network cable will do.)

Lloyd (lloyd-reijers) wrote :

I can confirm that this issue still exists in Karmic (9.10) [actual, not RC] on a lenovo S10e

Fortunately for me the fix suggested by Johan Christiansen on 2008-09-08 works on this hardware.

chckcc (t-steenkamp) on 2009-11-18
Changed in linux (Ubuntu):
status: Triaged → Confirmed
Sergio Zanchetta (primes2h) wrote :

Please don't change status if you don't know what you are doing.
https://wiki.ubuntu.com/Bugs/Status
Thank you.

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Tero Jänkä (graytron) wrote :
Download full text (4.9 KiB)

I can confirm this "Corrupted MAC on input" bug on i386 desktop version of Ubuntu 9.10 karmic. Silent file corruption also happens when downloading files using HTTP or FTP protocols. This bug is easily reproducible.

The computer on which this bug manifests itself is an Asus Eee PC 1000HE with 2 GB of RAM and an Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (1969:1026). I tried upgrading the 1000HE AMI BIOS from version 0607 to 1002, but that didn't help.

- $ uname -a
Linux eeepc 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:54:29 UTC 2009 i686 GNU/Linux

- $ sudo lspci -vvvn
03:00.0 0200: 1969:1026 (rev b0)
        Subsystem: 1043:8324
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 27
        Region 0: Memory at fbfc0000 (64-bit, non-prefetchable) [size=256K]
        Region 2: I/O ports at ec00 [size=128]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0300c Data: 417a
        Capabilities: [58] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag- AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited
                        ClockPM- Suprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [6c] Vital Product Data <?>
        Capabilities: [100] Advanced Error Reporting <?>
        Capabilities: [180] Device Serial Number <EDITED OUT>
        Kernel driver in use: ATL1E
        Kernel modules: atl1e

- $ sudo ethtool -i eth0
driver: ATL1E
version: 1.0.0.7-NAPI
firmware-version: L1e
bus-info: 0000:03:00.0

- $ sudo ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Tran...

Read more...

Peter P. (peter-p-launchpad) wrote :

I am seeing this problem using scp to copy large files from a Karmic (ubuntu 64 bit, Intel DG45ID board, Q9550 CPU) system to a Jaunty 64 bit system.

Tero Jänkä (graytron) wrote :

Peter P.: If you download Linux kernel source from www.kernel.org and then check the md5sum of the downloaded file, do you get different results on the computers? If not, suspend, resume, download and check md5sum again to see if that changes anything.

Tero Jänkä (graytron) wrote :

Ignore my comment #62.

The bug is more difficult to reproduce when downloading random files from the Internet, but it does occur every once in a while. This difficulty may have something to do with the speed or bandwidth of the download, ie. the probability of the bug showing up is less likely with low download speeds.

@Peter P.

Could you try this instead:

On machine 1: $ nc -l -p 8080 -q 10 -v -v < ubuntu-9.10-server-amd64.iso
On machine 2: $ nc <machine 1 address> 8080 > ubuntu-9.10-server-amd64.iso

And then compare the md5sums. Try also reversing the test.

Peter P. (peter-p-launchpad) wrote :

The workaround of burianek seems to work fork for me. I had several checksum errors even with ubuntu updates before applying the update. Seems to work fine now. Did not perform the nc test yet.

Tero Jänkä (graytron) wrote :

Unfortunately the workaround does not work on Asus Eee PC 1000HE with an Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (1969:1026) and atl1e driver.

Peter P. (peter-p-launchpad) wrote :

I found out that it was a RAM issue on my machine! I noticed that the workaround of burianek resolved the "Corrupted MAC" messaged, however md5 sums of large files often mismatched after trandfer! Also noticed that ubuntu CD/DVD media verification often turned up a lot of errors on this machine but not on other machines. Made a manual RAM configuration in BIOS setup and all problems seem to be gone.

Peter - keep pushing lots of data to be sure. I had the same problem and 'solution' on an ASUS mobo, but I found that every once in a while the Corrupted MAC error comes back on atl1e , though much more infrequently. Switching to a different gigabit network card/driver results in zero errors, so I think main memory is OK. My basic test wound up being:

1) dd if=/dev/urandom of=bigfile bs=1G count=60
2) md5sum bigfile > bigfile.md5
3) cp bigfile* /mnt/usbdrive
4) nc the file over on gigabit, like comment #63
5) on the other computer verify the md5sum (prove the USB transfer) and: cmp bigfile /mnt/usbdrive/bigfile
6) and if there's a failure, use vbindiff to find them

I was contacted by someone at Atheros over the summer to verify the fault but haven't heard back since sending my report.

I also have the ASUS 1000HE with this fault and no memory BIOS to tweak (also no faults on -n wireless NIC, SATA, or USB busses at full throttle).

Peter P. (peter-p-launchpad) wrote :

Bill, thanks for your advice. I was double checking every large transfer for several days now. I didn't discover any md5 mismatch since I changed the RAM settings manually to the recommended for my modules. I guess that the voltage was not set correctly. Greetings Peter

Download full text (3.4 KiB)

Today I switched back to my Atheros card and I transferred _without_ error in scp more times an iso image without any error. I'm using the kernel 2.6.32-020632-generic from the Ubuntu repository.
Here's some tech details. I'm really positively surprised and I'm still using this wifi card in order to check any further issue.

09:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)
 Subsystem: Foxconn International, Inc. Device e01f
 Flags: bus master, fast devsel, latency 0, IRQ 19
 Memory at f0000000 (64-bit, non-prefetchable) [size=64K]
 Capabilities: <access denied>
 Kernel driver in use: ath9k
 Kernel modules: ath9k

root@angelo-laptop:~# scp -rv "angelo@server:~/incoming/iso.iso" .
Executing: program /usr/bin/ssh host server, user angelo, command scp -v -r -f ~/incoming/iso.iso
OpenSSH_5.1p1 Debian-6ubuntu2, OpenSSL 0.9.8g 19 Oct 2007
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to server [xxx.xxx.xxx.xxx] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/identity type -1
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1p1 Debian-6ubuntu2
debug1: match: OpenSSH_5.1p1 Debian-6ubuntu2 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.1p1 Debian-6ubuntu2
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Host 'server' is known and matches the RSA host key.
debug1: Found key in /root/.ssh/known_hosts:1
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Trying private key: /root/.ssh/identity
debug1: Trying private key: /root/.ssh/id_rsa
debug1: Trying private key: /root/.ssh/id_dsa
debug1: Next authentication method: password
angelo@server's password:
debug1: Authentication succeeded (password).
debug1: channel 0: new [client-session]
debug1: Requesting <email address hidden>
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = it_IT.UTF-8
debug1: Sending command: scp -v -r -f ~/incoming/iso.iso
Sending file modes: C0664 719807740 Up - Pixar Disney 2009 iTA spledido.avi
Sink: C0664 719807740 iso.iso
iso.iso 100% 686MB 2.0MB/s 05:36
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: client_input_channel_r...

Read more...

I'm still getting the same old transfer errors in Lucid Alpha 3 on my Lenovo S10e (clean install, fully updated). The broadfix workaround still does it, though.

RussNelson (nelson-crynwr) wrote :

I'm getting this error ON MY VERIZON WIRELESS NOVATEL USB760 which of course uses ppp. Natch, the broadfix doesn't change anything. Neither does typing the command using only my left hand. My expectation of either of those fixing the problem was very low.

Have never seen this problem using 9.10 nbr on the machine on my right. Always happens with a fresh 10.04 beta (updated an hour ago) install on the machine to my left. Both machines are Lenovo S10e.

Will try installing 9.10. If it fails on one but not the other .... that points at dodgy hardware.

RussNelson (nelson-crynwr) wrote :

9.10 on the S10e on my left? Corrupted MAC on input.
9.10 on the S10e on my right? Works a champ.

We've got dodgy hardware, friends.

RussNelson (nelson-crynwr) wrote :

Argh, no, the 9.10 on my right isn't working either. I'd merely assumed that since I had never seen that problem before, and I was able to fetch those files in the first place, that the corrupted mac problem wasn't present. But when I tried to transfer the files away to a known-good machine, crasho-blammo.

Same problem on an Acer Aspire One.

SamTzu (sami-mattila) wrote :

It's possible that this could be caused by faulty hardware, but I doubt it. This seems to come up every other Ubuntu/Debian upgrade reinstall I do. I noticed this with Mint8 fresh install. After upgrade it went away. Now after trying Ubuntu 10.04 it's back. On my part it's definitely software issue and I also begin to believe like Angelo that it's intentional. Can someone please do audit on the people who are supposed to oversee this code?

-Angelo Corsaro wrote on 2009-07-03
 "I'm thinking that this bug is left opened intentionally. I'm agree with Johan Christiansen, two years and a half is enough to solve this problem that is not affecting the Window$ environment.. mmmhhh strange"

drew einhorn (drew-einhorn) wrote :

Also affects Karmic Server on Lenovo S10e,

     sudo ethtool -K eth0 tx off

on server was sufficient to solve this problem.

If it is impossible to get it properly fixed in the kernel,
how about an error handler that does the equivalent
of the ethtool fix, and automatically submits a bug report
that includes the hardware, kernel, etc. versions that
will give the developers a real feel for how many folks
are being affected. I am certain many of the folks affected
never find their way to launchpad. And many never
succeed in resolving the issue.

Oops. I see why my suggestion is bogus. The error
is detected on the client, but needs to be fixed on the
server.

The client could do a better job of issuing a
comprehensible error message that points the
user to a possible solution.

This is certainly not the first time I have bumped into
this problem, and it probably won't be the last.

I got the same problem in ubuntu 10.04. The server runs OpenSuse 11.2.
I used to get Corrupted MAC on input very often. After running ethtool in the server as commented above, I'm getting the errors much less often (but they still appear).
Running it on the client gives me Operation not permitted.

Server:
Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04)
Client:
Atheros Communications Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (rev b0)

I seem to be the only one getting it with the server tx off.

hoover (uwe-schuerkamp) wrote :

I can confirm this bug is still present in Maverick (10.10), 32 bit system, copying files over scp to an eeepc 1005H running Mint 8:

I am getting this error both with rx / tx set to "off" as suggested by others, or without running "ethtool" first.

scp testfile.dat eeepc:
eeepc password:
testfile.dat 57% 191MB 11.3MB/s 00:12 ETAReceived disconnect from xxx.xxx.xxx.xxx: 2: Packet corrupt

uname -a
Linux lotus 2.6.35-22-generic-pae #35-Ubuntu SMP Sat Oct 16 22:16:51 UTC 2010 i686 GNU/Linux

hoover (uwe-schuerkamp) wrote :

Sorry, forgot to include some info on the ethernet controller:

lspci | grep eth -i
00:19.0 Ethernet controller: Intel Corporation 82567LM-3 Gigabit Network Connection (rev 02)

Kit Scuzz (kitsczud) wrote :
Download full text (3.3 KiB)

I'm also suffering from this bug, and I'm willing to do as much as is humanly possible to fix it in a reasonable time frame.
---------------------------------------------------------------------------
So first and foremost, I am suffering from this bug on two separate platforms: I have an Ubuntu 10.04 laptop (32bit). Information relevant to the laptop:
kit@kacertop:~$ uname -a
Linux kacertop 2.6.32-28-generic #55-Ubuntu SMP Mon Jan 10 21:21:01 UTC 2011 i686 GNU/Linux
kit@kacertop:~$ lspci | grep -i eth
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8071 PCI-E Gigabit Ethernet Controller (rev 16)

And a file server running Debian squeeze (64bit). Information relevant to the laptop:
kit@AlfredTCP:~$ uname -a
Linux AlfredTCP 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux
kit@AlfredTCP:~$ lspci | grep -i eth
03:07.0 Ethernet controller: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) (rev 11)
---------------------------------------------------------------------------
As you can see, both of these NICs use Marvell chipsets, and both are gigabit. I had the exact same issue with a Realtek gigabit chipset (who's model number I have forgotten).

I've tested this under a number of conditions and like the others here it seems to depend on network saturation (speeds > 2Mbit). The period between these failures is highly variable. And seem to be related more to reception than to transmission. I've tried to proposed fix of turning off TCP checksumming, which did not solve the issue.

I have replaced all of the components in the chain; the router, switches, MoCA, cables, and ethernet card, and I still have the issue, so if it is a hardware issue than it is with the wiring in my house, and it's propagating through the MoCA.

So as I've been trying to understand the issue, I've whipped up a couple tests. Netcat will transfer a whole file, but the file will regularly contain corruption (and also different areas corrupted) when transferring large volumes at high speeds. I checked using md5sums on both ends of the transfer. Rsync and scp will fail with "corrupted mac on input" or "connection reset by peer" depending on which end of the transfer you're on (the computers I listed above always see the "corrupted mac on input"). I ended up creating the following program to try and hunt down the corruption. It causes both computers to transfer blocks of data with a crc32 at the end of each packet. When the machine detects a corrupt packet it prints the contents of the packet and dies. Anyone interested in taking a look can download it here: http://www.scuzzstuff.org/temp/check_network_interface.zip

Sloshing through a packet which should only contain either 0xDEADBEEF 0xABADBABE 0xCAFEF00D or 0xDEFEC8ED I received the following:
"6d6435736d6435736d6435736d6435736d6435736d6435736d643573
6d6435736d6435736d6435736d6435736d6435736d6435736d643573
6d643573aaaa30aaaa30aaaa30aaaa306d6435736d6435736d643573
6d6435736d6435736d6435736d6435736d6435736d6435736d643573
6d6435736d6435736d643573"
I'm uncertain why I have "0x6d643573" but the corruption is obvious at the 0xaaaa

If there's anyone who can help me in trying to ...

Read more...

Peter P. (peter-p-launchpad) wrote :

Just an update on my earlier posts. It was definitely a RAM issue on my machine. First I thought I fixed it by changing voltage in BIOS. But after a while the problem showed up again -- seldom but very annoying. So I replaced RAM more than half a year ago. Since then everything is fine.

My advice: consider replacing the RAM and see if the problem persists.

Kit Scuzz (kitsczud) wrote :

Well I'm definitely willing to believe this is a hardware error but at this point it is wholly bewildering to me. I'll have to try new sticks of ram though I'm reluctant to believe that is the issue as the single stick of 4gig ram which is in the machine has the correct voltages and timings which it is rated for and it has made it through 16 consecutive passes of Memtest86+ v4.20.

I decided to try and really track down this issue to no avail. I have changed NICs twice now, I have removed my cheap NVIDIA video card (which put me back on the ATI 4250 built onto the motherboard). I have also disabled the use of the "hidden cores" on my AMD64 cpu. None of this has changed the corruption I'm seeing, which is frequently 0xAA sprinkled randomly throughout the packet. Once right at the beginning 0xAAxxAAxx where xx are random (normally either 0x55 or random) (this is always the first part).

I think my next step here is going to have to be trying this in two different scenarios:
1) I will be to take the server to a different house and see if I get the same corruption problem (this would rule out my router/wiring as being the issue)
2) Actually get new RAM for the machine (though I'm currently too broke to afford it, it will have to wait)

I guess I'm hoping that someone can provide a method of ruling out software/kernel/driver issues. As it stands I'm trying more and more elaborate hardware-based solutions but I haven't had any good method of ruling out non-functioning software.

Romulus (launchpad-keithtyler) wrote :

BTW, On an Asus EEE PC, "ethtool -K eth0 rx off" is rejected. However, "ethtool -K eth0 tx off" will in fact turn off *both* TX and RX checksumming (and vice versa).

romulus@meatwad:~$ sudo ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
romulus@meatwad:~$ sudo ethtool -K eth0 rx off
Cannot set device rx csum settings: Operation not supported
romulus@meatwad:~$ sudo ethtool -K eth0 tx off
romulus@meatwad:~$ sudo ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

Kit Scuzz (kitsczud) wrote :

So I wrote up the following application to help me debug: https://github.com/kitscuzz/n_stress (please note that the CRC32 doesn't work quite how it's meant to, but it has caught corruption pretty consistently)

I have now confirmed that it is not any of my networking equipment, or specific to my machine (which I suppose should have been obvious from the existence of this thread).

I have now seen this problem happen on two completely different machines than the three used in the original test, and over a network connection which was in a different part of the state, so that's not the issue.

The other machine which appears to have the issue is also using a completely different ethernet controller (though also a gigabit), which would seem to rule out a specific driver issue.

I still have not replaced the RAM, but it made it through 72 consecutive passes in RAM test (almost three days) so I'm fairly certain that the ram is good.

I think this is explicitly a receive error, as a web server machine running Red Hat 4.1.2 (kernel version 1.6.18) can cause the error in the affected machines, but not others. I have to confirm this by hooking one of them up to a hub or switch which has a windows machine sniffing to see if they both get the corruption.

I've attached the lspci -vvv output from all three machines involved.

Any help would be immensely appreciated, even if it was just ideas on how to get through the ~6Gb packet dump in wireshark or tcpdump.

Kit Scuzz (kitsczud) wrote :

Here is the 64bit debian server output of lspci -vvv

Kit Scuzz (kitsczud) wrote :

And last but least here is the 32bit ubuntu machine's lspci -vvv output

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix

I just got bitten once again by this bug on my Lenovo IdeaPad S10e with a freshly installed Oneiric. The old broadfix woraround is still working, from what I can tell so far. So please re-open this bug.

Hellmark (spamtrap-hellmark) wrote :

I'm getting the same problem. Never had the problem before, but now I get it constantly when I attempt to use SFTP, and it started after I upgraded to Oneiric the other day.

Whit Blauvelt (whit-launchpad) wrote :

I realize this is closed. Just adding a few notes from my own experience since the discussion here has been useful to me. In my case the error shows up in running "ssh -Y" to a second system, and then starting "dosemu" on the remote system (which will tunnel to my desktop). The remote system has an Intel e1000e PCI-E NIC. The local system has a Marvell PCI-E card.

"sudo ethtool -K eth1 rx off tx off" on the local (Marvell) box avoids the error, but leaves me wondering if I'm just tolerating corruption. Does OpenSSH have its own error correction protocol for safety here? The remote box had had some problems that seemed associated with a different NIC (an old LinkSys tulip), but that had been a second card on a different interface, seemingly fixed by swapping in an Intel card. The remote system has tested find in memtest86 and other stress tests. But memtest86 doesn't hit 100% of RAM. Makes me wonder if there's RAM going bad in the range assigned to PCI cards that's simply outside of the range memtest86 can see - and that's certainly outside the range that can be tested while booted into Linux.

Whit Blauvelt (whit-launchpad) wrote :

Here's helpful background: https://blogs.oracle.com/janp/entry/ssh_messages_code_bad_packet

It doesn't look to me like turning off rx and tx should in any way lessen OpenSSH's resistance to corruption, so should be safe as far as this goes. The other side of this is that it looks like somehow having rx and/or tx on is producing corruption, from OpenSSH's POV. Whether that's driver, or card, or system RAM implicated, I've no clue.

Whit Blauvelt (whit-launchpad) wrote :

Here was my problem, it looks like: A firewall rule on the remote box (which is on my LAN) which rejected traffic to IP 224.0.0.1, which is part of the Local Network Control Block. That's evidently a channel required by the two host systems in connection with rx and tx delay requests.

To post a comment you must log in.