Comment 47 for bug 297058

Revision history for this message
Rob Jacobson (rob104) wrote :

Please forgive this intrusion, but on reading through this entire thread, I could not but help seeing how little good information must be out there, to help users interpret exceptions correctly. And I thought that perhaps I could offer a little guidance and a few suggestions, that hopefully will improve your own diagnostic efforts, as well as improve the issue reporting quality, and thereby possibly improve the general stability of the kernel and device code. I am not an expert, but have read a lot of syslogs, and tried to help a number of users.

An exception is just the report of something that appears unusual, could be nothing, or could be a symptom of something wrong. An exception handler has kicked in, and will try to report as much as it can, and may also attempt a few actions to resolve the issue, if it appears warranted. The reporting is a sequence of lines that start with a line beginning with "exception", includes various lines with additional information about the issue, and ends with a line with "EH complete" (Error Handler is finished). If there were error flags reported to it, then a verbose version of those flags will be listed. If there were SATA link errors (SErr is non-zero), then they will also be expanded in the following lines. A great resource for these is http://ata.wiki.kernel.org/index.php/Libata_error_messages.

So any exception is analogous to hearing an unusual noise from your car. Something may be wrong, but you need more data, and possibly an experienced mechanic to interpret whatever symptoms you have detected. Some of the messages are exactly what they sound like. For example, "link is slow to respond" and "timeout" and "frozen" just mean that a response did not occur within the normal time frame. They aren't bugs, just symptoms, an indication that something may be wrong. Analogy: your car unexpectedly feels sluggish, not responding as quickly as usual.

Unfortunately, many of the reports above do not have any errors reported, only symptoms of 'sluggishness' or a loss of communications. Something may very well be wrong, but it is not obvious from these reports, and there are a *lot* of very different causes. It could be the device itself (bad media, buggy firmware, too hot, etc), could be the cabling or connections (bad cable, bad or loose connectors, loose backplane, faulty power splitter, etc), could be the controller chipset, could be over-heated chipsets, could be power issues in the device, could be general power issues, could be a mis-configured device, could be incompatible hardware, could be a buggy 'driver' module, could even be bad memory, etc.

A last tip, the single most common (in my experience) issue, and the easiest to fix, is faulty cables. If you see the word ICRC and/or BadCRC within the error handler exception report, then replacing the cable with a good quality cable will (I believe) fix over 80% of these exceptions (perhaps over 95%). I doubt there is a more common reason for RMA'ing drives wrongly, than drive exceptions that actually were caused by bad cables. The next easy fix is check for loose connections in both the data and power cables and any splitters used, and in any backplanes used.

I think I only saw one ICRC above, so most of the problems reported above are probably more complex, but they don't have enough info to really help unfortunately. A very occasional report of "frozen" is not uncommon unfortunately. However, I have noticed that in general these kinds of reports have been diminishing, with the more recent kernels. Developers on both sides (firmware and kernel) are constantly tweaking the communications between devices and kernel. Occasionally, a tweak results in new issues, but then is improved in subsequent releases (firmware and/or kernel).