Comment 126 for bug 1063354

Revision history for this message
NW (ubuntu327) wrote :

I have an old tower which I use to test multiple operating systems. Each OS lives on a separate drive in a removable tray, so the drives can be swapped as needed. Once in a while the system would hang when the BIOS was set to auto-detect the drives at every boot, or I would see an occasional failure to mount the ATA boot device when Linux was started in verbose mode--and Windows would simply freeze randomly. The problem was traced to the power connector on a drive tray: I had to extract the pins from the connector with a special tool, cut off the wires, soak the pins in contact cleaner, and solder them back on, because the crimped connection and the corrosion made it unreliable.

http://en.wikipedia.org/wiki/Molex_connector#Disk_drive_connector_.28AMP_MATE-N-LOK_1-480424-0_Power_Connector.29

http://www.molex.com/molex/products/family?key=disk_drive_power_connector&channel=PRODUCTS&chanName=family&pageTitle=Introduction

I never had a problem with these connectors before, except for the ones in the Enermax trays (which seem to be made of the cheapest materials they could find.) Before I repaired the power connector, I encountered that read-only bug in Ubuntu. When this occurred, ALL physical volumes attached to the machine became read-only, including other hard drives and all external USB storage devices. Even new USB devices attached later were not writable. The only thing I could write to was a network share. If this happens on all affected platforms, it might give developers some idea of what to look for in the source code. I also wonder if some power management feature could be involved:

GRUB_CMDLINE_LINUX="libata.dma=0 libata.noacpi=1"
http://ubuntuforums.org/showthread.php?t=1892483

I believe this bug can be triggered by other things too, such as system BIOS bug or AHCI preference, drive firmware bug, defective electrolytic capacitors on a old mainboard, bad solder joints just about anywhere, a defective (or overloaded) power supply. But in the case of SSD drives it could also be a latency issue:

Why Solid-State Drives Slow Down As You Fill Them Up (Ubuntu should warn about this)
 "When filling up an empty drive, they found high write performance very early in the process and a significant drop as the write operations continued to fill up the drive... If you have a solid-state drive, you should try to avoid using more than 75% of its capacity."
http://www.howtogeek.com/165542/why-solid-state-drives-slow-down-as-you-fill-them-up/

(for general reference on dual-boot systems):
12 Things You Must Do When Running a Solid State Drive in Windows 7
http://www.maketecheasier.com/12-things-you-must-do-when-running-a-solid-state-drive-in-windows-7/

I suspect that people who experience read-only issues today were experiencing silent write retries in previous kernel versions and simply did not notice because the retry was successful. It seems like the common thread is that the drive was not ready to accept writes for some reason, and the kernel did not detect this condition. I tried to simulate this by removing power to the drive momentarily. During this time, CPU usage was very high, but it returned to normal when power was applied, and the read-only bug was not triggered.

On various other platforms I have seen S.M.A.R.T. drives which are NOT defective logging an "Interface CRC error" when a 'READ DMA EXT' command was issued, due to a cable or connector fault. When the drive was moved to another system, the errors stopped. So the drive is not necessarily failing just because you see the error count going up.

I think that a S.M.A.R.T. status monitor should be included with the base installation: the S.M.A.R.T. feature is not only useful to diagnose faults within the drive, it sometimes permits you to infer something about the quality of the power & data connection over time. If you can consistently correlate some particular S.M.A.R.T. error code with the behavior that causes the volume to turn read-only, then you may have found a way to distinguish a cable fault from a kernel or firmware bug, and the OS could use it to generate more helpful error messages. So it might be good to report which (if any) of the drives S.M.A.R.T. counters were incremented when you experience that read-only problem.

I am not too familiar with the specifications, but developers might also want to investigate the possibility of using the System Management bus or Power Management bus to assist in characterizing these failures if the platform collects any useful information. For those who solved the problem by disabling NCQ: there was an NCQ drive blacklist for the Linux kernel until (I believe) 2.6.24. This implies some incompatibility with particular models.

"there are drives with firmware bugs that deliberately lie about when data has been physically written."
http://serverfault.com/questions/460864/safety-of-write-cache-on-sata-drives-with-barriers
_____

"One little-known feature of NCQ is that the host can specify whether it wants to be notified of completion when the data hits the disk's platters or when it hits the disk's buffer (on-board cache)." (Does the kernel do this correctly?)

"NCQ can negatively interfere with the operating system's I/O scheduler, actually decreasing performance; this has been observed in practice on Linux with RAID-5. There is no mechanism in NCQ for the host to specify any sort of deadlines for an I/O, like how many times a request can be ignored in favor of others. In theory, a NCQ-ed request can be delayed by the drive an arbitrary amount of time while it is serving other (possibly new) requests under I/O pressure. Since the algorithms used inside drive firmware for NCQ dispatch ordering are generally not publicly known, this introduces another level of uncertainty for hardware/firmware performance. Tests at Google around 2008 have shown that NCQ can delay an I/O for up to 1-2 seconds."

http://en.wikipedia.org/wiki/Native_Command_Queuing
_____

Test if NCQ is enabled: dmesg | grep -i ncq
Write-protect & cache status: dmesg | grep sda
_____

Operational theory / Educational resources:

Modern disk write caches and how they get dealt with
http://utcc.utoronto.ca/~cks/space/blog/tech/ModernDiskWriteCaches

How to force a disk write cache flush operation on Linux
http://utcc.utoronto.ca/~cks/space/blog/linux/ForceDiskFlushes