sata hard drive connection fails with link is slow to respond, please be patient (ready=0) ; SRST failed (errno=-16) ; hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
When using new 1 TB sata hard drive on my nv based mainboard, dual core, 6 GB ram, amd64 ubuntu 9.04,
I get 2 types of errors:
1) after some times I get device-lost errors in dmesg and device stops working till reboot
2) I always see one more device named /dev/sdd (sdc is the new hard drive) that shows up as 2 TB device
Let me go into some details:
1) device gets losts, that is shows in dmesg errors like:
[ 5764.955034] sd 5:0:0:0: [sdc] Result: hostbyte=
[ 5764.955039] end_request: I/O error, dev sdc, sector 413808352
(...)
[ 5872.227484] Buffer I/O error on device sdc, logical block 52804512
[ 5872.227486] lost page write due to I/O error on sdc
and all the time, then the device is dead - all I/O reports errors, hdparm -i can not see it and so on
sometimes this is preceded with:
[ 4841.804585] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 4841.804602] ata6.00: cmd 35/00:00:
[ 4841.804604] res 40/00:ff:
[ 4841.804610] ata6.00: status: { DRDY }
[ 4841.804646] ata6: soft resetting link
[ 4847.016549] ata6: link is slow to respond, please be patient (ready=0)
[ 4851.852058] ata6: SRST failed (errno=-16)
[ 4851.852074] ata6: soft resetting link
This happens after 2-10 hours of intensive hard drive use (shred, or dd if=/dev/zero of=device and so on)
Hapend 3 times so far (at least once with the above ata6.00 exception error)
It seem to occur when computer was overheating (sensors of I think cpu and mobo, showing 80 C).
Nothing worked to fix it:
- I put big fans next to computer, temp goes to just 60-65 instead 80
- I used other sata cable
- I plug it into other on board sata controller (3rd instead 2nd)
Also
- badblocks report no problems
- sata log and sata --test long doesnt show any problems
- disc stays at max 35 C temperature
Disc gets speeds up to 150 MiB/s
So I guess it could be a hardware issue (faulty mobo?) - but perhaps it is a driver / kernel problem?
I found similar reports of link is slow to respond, please be patient related to installing Ubuntu - but there it seemed to happen always not just sometimes, although it was also NV based mobo chip afair.
2)
The second problem is the extra unknown device.
Below technical details follow.
AFTER THE FAILURE (after I/O errors and DID_BAD_TARGET etc stuff):
# hdparm -I /dev/sdc
/dev/sdc:
HDIO_DRIVE_
# hdparm -i /dev/sdc
/dev/sdc:
HDIO_GET_IDENTITY failed: No message of desired type
# hdparm -a /dev/sdc
/dev/sdc:
readahead = 256 (on)
# hdparm /dev/sdc
/dev/sdc:
IO_support = 0 (default)
readonly = 0 (off)
readahead = 256 (on)
geometry = 56065/255/63, sectors = 1953525168, start = 0
And the extra ghost device (2 TB) that is appearing for unknown reason:
root@lcwood:~# hdparm -i /dev/sdd
/dev/sdd:
HDIO_GET_IDENTITY failed: Invalid argument
root@lcwood:~# hdparm -I /dev/sdd
/dev/sdd:
HDIO_DRIVE_
root@lcwood:~# hdparm /dev/sdd
/dev/sdd:
readonly = 0 (off)
readahead = 256 (on)
geometry = 65535/255/63, sectors = 4294967296, start = 0
root@lcwood:~# fdisk /dev/sdd
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xc3abd5d3.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.
The number of cylinders for this disk is set to 267349.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): p
Disk /dev/sdd: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xc3abd5d3
Device Boot Start End Blocks Id System
Command (m for help): q
Yes, I do NOT have any 2 TB devices here, this one is some ghost O_o.
# uname -a
Linux lcwood 2.6.28-13-generic #45-Ubuntu SMP Tue Jun 30 22:12:12 UTC 2009 x86_64 GNU/Linux
Please write what more tests I could do.
More info:
I did run badblok -w and so on after the errors, all is fine.
There are tons of this I/O errors - always after the device fails it totally stops working, and after reboot it works fully.
If the device do not die in between, I can write/read entire hard-drive till last sector without any problems.
Therefore it is NOT a device-media / surface problem, must be SATA link/driver/etc problem.
more examples of reports from logs:
Aug 5 15:05:18 lcwood kernel: [75189.807205] ata3: hard resetting link DID_BAD_ TARGET driverbyte= DRIVER_ OK,SUGGEST_ OK DID_BAD_ TARGET driverbyte= DRIVER_ OK,SUGGEST_ OK DID_BAD_ TARGET driverbyte= DRIVER_ OK,SUGGEST_ OK
Aug 5 15:05:28 lcwood kernel: [75199.816547] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 5 15:05:33 lcwood kernel: [75204.816528] ata3.00: qc timeout (cmd 0xec)
Aug 5 15:05:33 lcwood kernel: [75204.816547] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Aug 5 15:05:33 lcwood kernel: [75204.816555] ata3: hard resetting link
Aug 5 15:05:43 lcwood kernel: [75214.820531] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 5 15:05:53 lcwood kernel: [75224.820525] ata3.00: qc timeout (cmd 0xec)
Aug 5 15:05:53 lcwood kernel: [75224.820539] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Aug 5 15:05:53 lcwood kernel: [75224.820547] ata3: limiting SATA link speed to 1.5 Gbps
Aug 5 15:05:53 lcwood kernel: [75224.820551] ata3: hard resetting link
Aug 5 15:06:03 lcwood kernel: [75234.828041] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Aug 5 15:06:33 lcwood kernel: [75264.828042] ata3.00: qc timeout (cmd 0xec)
Aug 5 15:06:33 lcwood kernel: [75264.828062] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Aug 5 15:06:33 lcwood kernel: [75264.828074] ata3.00: disabled
Aug 5 15:06:33 lcwood kernel: [75264.828215] ata3: hard resetting link
Aug 5 15:06:43 lcwood kernel: [75274.872552] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Aug 5 15:06:43 lcwood kernel: [75274.872617] ata3: EH complete
Aug 5 15:06:43 lcwood kernel: [75274.872683] sd 2:0:0:0: [sdc] Result: hostbyte=
Aug 5 15:06:43 lcwood kernel: [75274.872704] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872722] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872732] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872741] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872750] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872759] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872768] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872777] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872787] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.872796] lost page write due to I/O error on sdc
Aug 5 15:06:43 lcwood kernel: [75274.873139] sd 2:0:0:0: [sdc] Result: hostbyte=
Aug 5 15:06:43 lcwood kernel: [75274.873511] sd 2:0:0:0: [sdc] Result: hostbyte=
Aug 5 15:06:43 lcwood kernel: [75274.873828] sd 2:0:0:0: [sdc] Result:...