Comment 83 for bug 60764

Revision history for this message
Kit Scuzz (kitsczud) wrote :

So I wrote up the following application to help me debug: https://github.com/kitscuzz/n_stress (please note that the CRC32 doesn't work quite how it's meant to, but it has caught corruption pretty consistently)

I have now confirmed that it is not any of my networking equipment, or specific to my machine (which I suppose should have been obvious from the existence of this thread).

I have now seen this problem happen on two completely different machines than the three used in the original test, and over a network connection which was in a different part of the state, so that's not the issue.

The other machine which appears to have the issue is also using a completely different ethernet controller (though also a gigabit), which would seem to rule out a specific driver issue.

I still have not replaced the RAM, but it made it through 72 consecutive passes in RAM test (almost three days) so I'm fairly certain that the ram is good.

I think this is explicitly a receive error, as a web server machine running Red Hat 4.1.2 (kernel version 1.6.18) can cause the error in the affected machines, but not others. I have to confirm this by hooking one of them up to a hub or switch which has a windows machine sniffing to see if they both get the corruption.

I've attached the lspci -vvv output from all three machines involved.

Any help would be immensely appreciated, even if it was just ideas on how to get through the ~6Gb packet dump in wireshark or tcpdump.