READ/Write FPDMA QUEUED failures

Bug #986321 reported by Rami Al-Rfou' on 2012-04-20
62
This bug affects 13 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High

Bug Description

We have a setup of
1- 3 Sata PCI cards (Syba PCI Express SATA II 4-Port RAID Controller Card SY-PEX40008) 2- connected to 9 backplanes (CFI-B53PM 5 Port Backplane (SiI3726)).
Exact replica of this configuration http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/.

We tried Ubuntu Server 12.04/Debian Stable/BSD7/BSD8 with ext4/zfs and all give us such read/write errors.
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Apr 20 14:10 seq
 crw-rw---T 1 root audio 116, 33 Apr 20 14:10 timer
AplayDevices: aplay: device_list:252: no soundcards found...
ApportVersion: 2.0.1-0ubuntu2
Architecture: amd64
ArecordDevices: arecord: device_list:252: no soundcards found...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 12.04
HibernationDevice: RESUME=UUID=16877cc2-98af-4f72-b1bc-7f1ee98f6fcd
InstallationMedia: Ubuntu-Server 12.04 LTS "Precise Pangolin" - Beta amd64 (20120413)
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
MachineType: Supermicro X8SIL
NonfreeKernelModules: zfs zcommon znvpair zavl zunicode
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-23-generic root=UUID=de87ad5f-e76f-4180-a7ce-9b29c9443b9f ro
ProcVersionSignature: Ubuntu 3.2.0-23.36-generic 3.2.14
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory /home/localadmin not ours.
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-23-generic N/A
 linux-backports-modules-3.2.0-23-generic N/A
 linux-firmware 1.79
RfKill:

SourcePackage: linux
Tags: precise precise
Uname: Linux 3.2.0-23-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 05/27/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.1
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: X8SIL
dmi.board.vendor: Supermicro
dmi.board.version: 0123456789
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 24
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.1:bd05/27/2010:svnSupermicro:pnX8SIL:pvr0123456789:rvnSupermicro:rnX8SIL:rvr0123456789:cvnSupermicro:ct24:cvr0123456789:
dmi.product.name: X8SIL
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 986321

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium

apport information

tags: added: apport-collected precise
description: updated
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
description: updated
Joseph Salisbury (jsalisbury) wrote :

This bug seems similar to bug 984127

Joseph Salisbury (jsalisbury) wrote :

 SError: { Handshk }

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Also, some times this error can be caused by faulty cabling, so you may want to check that out as well.

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc3-precise/

Joseph Salisbury (jsalisbury) wrote :

ata9.03: status: { Busy }
ata9.03: error: { ABRT }

Rami Al-Rfou' (rmyeid) wrote :

localadmin@backblaze:~$ uname -a
Linux backblaze 3.4.0-030400rc3-generic #201204152235 SMP Mon Apr 16 02:36:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I tested 3.4rc3 kernel and no difference AFAIK. I attached the syslog though.

tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

Can you test some earlier kernels to see if this is a new issue or a regression?

Oneiric
https://launchpad.net/ubuntu/+source/linux/3.0.0-19.33

Natty:
https://launchpad.net/ubuntu/+source/linux/2.6.38-14.58

Look for the "Builds" section and select your arch.

Rami Al-Rfou' (rmyeid) wrote :

Disabling "Native Command Queuing" (NCQ) did not work.
http://ubuntuforums.org/showpost.php?p=9684933&postcount=12
To try the above solution, do not forget to run update-grub. This is the result after running the update

localadmin@backblaze:~/mnt$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.2.0-23-generic root=UUID=de87ad5f-e76f-4180-a7ce-9b29c9443b9f ro libata.force=noncq

Apparently NCQ will reduce the rate of writing so the rate of errors will be decreased.
NCQ will force the sys/block/sda$i/device/queue_depth value to drop from 31 to 1.

Rami Al-Rfou' (rmyeid) wrote :

I tried to boot the old kernel, but they did not boot.

This bug report https://bugzilla.kernel.org/show_bug.cgi?id=32682#c29 discussed three solutions:
1- libata.force=noncq
2- pcie_aspm=off (This could help you to understand what this is http://smackerelofopinion.blogspot.com/2011/03/making-sense-of-pcie-aspm.html)
3- libata.force=1.5Gbps

Only the third option worked for me up to now.

Changed in linux:
importance: Unknown → High
status: Unknown → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Triaged
importance: Medium → High
PieroCampa (piero-campa) wrote :

as suggested by developers, I'm running beta kernel linux-image-3. 3.4.0-030400rc these days.
Again, I encountered an infinite boot.
I cannot reproduce the situation, it just happens sometimes, often.

I attach the dmesg: as before the delay is related to sda5 problems!

    [ 3.829577] EXT4-fs (sda5): INFO: recovery required on readonly filesystem
    [ 3.829584] EXT4-fs (sda5): write access will be enabled during recovery
    [ 24.397335] EXT4-fs (sda5): recovery complete
    [ 24.445944] EXT4-fs (sda5): mounted filesystem with ordered data mode. Opts: (null)
    [ 165.341174] udevd[391]: starting version 173
    [ 166.441159] Adding 6254588k swap on /dev/sda6. Priority:-1 extents:1 across:6254588k
    [ 168.407688] lp: driver loaded but no devices found
    [ 170.301904] EXT4-fs (sda5): re-mounted. Opts: errors=remount-ro

PieroCampa (piero-campa) wrote :

Here is the attachment.

Sergey (sku) wrote :

The same bug with Ubuntu 12.04: I installed ubuntu on SATA drive WDC WD1200BEVS-75UST0 on Dell Inspiron 1501. After install, update from repos and reboot I have a bunch of READ FPDMA QUEUED in dmesg.
Test for bad blocks by MHDD showed near 50 bad blocks. I just executed full by-sector erase of surface in MHDD, after that all bad blocks are gone. Overall hard drive state as showed by SMART was Good.
Then I installed Win7 on the same disk and it worked as a charm day or two. But I wanted Ubuntu after all, so I tried it again. I've formatted drive, checked it for bad blocks - there were not any! Then I installed Ubuntu and after two reboots I've got a whole bunch of READ FPDMA QUEUED and neverending boot. Drive is being erased now.

Changed in linux (Ubuntu):
status: Triaged → Confirmed
Sergey (sku) wrote :

Sorry for changing bug status from Triaged to Confirmed, it was conicidence.

Sergey (sku) wrote :

Typical bug:
ata 1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata 1.00: irq_stat 0x40000008
ata 1.00: failed command: READ FPDMA QUEUED
ata 1.00: cmd 60/08:00:88:2b:8d/00:00:00:00:00/40 tag 1 ncq 4096 in
ata 1.00: res 41/40:00:89:2b:8d/00:00:00:00:00/40 Emask 0x409 (media error) <F>
ata 1.00: status: { DRDY ERR }
ata 1.00: error: { UNC }
ata 1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata 1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata 1.00: configured for UDMA/133
ata 1.00: EH complete

Sergey (sku) wrote :

And it seems that Ubuntu is creating them during boot process: at that time me remapped bad blocks without erasing and tried to boot into Ubuntu, and new blocks were appeared.

Sergey (sku) wrote :

Note: They are not appearing after stress-test by DOS programs or under freshly installed Windows, only Ubuntu has such an effect.

tags: added: kernel-da-key
PieroCampa (piero-campa) wrote :

Upgraded to Ubuntu 12.04 and after a while I got yet another overly long boot with READ/WRITE FPDMA QUEUED failures.
I attach my dmesg.

$ cat /etc/lsb-release
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=12.04
    DISTRIB_CODENAME=precise
    DISTRIB_DESCRIPTION="Ubuntu 12.04 LTS"
$ uname -r
    3.0.0-19-generic

PieroCampa (piero-campa) wrote :

Here it is.

Changed in linux:
status: Confirmed → Invalid
Sergey (sku) wrote :

Just for your information: I installed Linux Mint LMDE (3.2.0-2-amd64) on that same machine and it works for ten days without any errors. So that thing is not about cable or hard drive, it is clearly about software.

Alain Kalker (miki4242) wrote :

People, please read this!

First, this seems to be the gathering point for people who are affected by the read/write failures, but please note:
- the OP states clearly that s/he is running quite an impressive *ng piece of kit, which I'm very sceptical is the very same hardware that other people who state they're affected are using. Perhaps it would be better if it can be made clear in the bug description that this is to be a 'tracking bug' for a generic problem.
- Please, please provide at least an `lspci -vv` with your confirmation, this could be important in determining whether this thing is generic or hardware-specific.
- Check if you are using any proprietary, binary-only drivers. It can be very difficult to debug interactions between proprietary and non-proprietary drivers. Try to boot and operate your system without any proprietary (video, network, WLAN) drivers whatsoever, and report back if the problem persists or goes away, even if this may be quite an inconvenience to you.

PieroCampa (piero-campa) wrote :

After a long while without boots problem, this morning it happened again,
The boot was close to infinite, unluckily, and also logging in was very slow.

Right now the system is ok.

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.1 LTS"
$ uname -r
    3.2.0-35-generic

Attaching dmesg and lspci output!
Thanks.

Krister (thekswenson) wrote :

This happens to me a few times a day while working:

Nov 14 10:00:25 praxis kernel: [18679.709773] ata1.00: failed command: WRITE FPDMA QUEUED
Nov 14 10:00:25 praxis kernel: [18679.709777] ata1.00: cmd 61/08:68:68:88:22/00:00:27:00:00/40 tag 13 ncq 4096 out
Nov 14 10:00:25 praxis kernel: [18679.709777] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 14 10:00:25 praxis kernel: [18679.709779] ata1.00: status: { DRDY }

Please let me know if there is anything I can do to help.

Rami Al-Rfou', this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/ubuntu-server/daily/current/ .

If it remains an issue, could you please just make a comment to this.

If reproducible, could you also please test the latest upstream kernel available (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.13-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

no longer affects: linux (Ubuntu)
affects: linux → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Invalid → Incomplete
Ken Pratt (kenpratt) wrote :

btrfs raid10 setups cause the hardware reset to happen very frequently. I have a eSATA enclosue (SansDigital) that has two eSATA ports and contains 2 banks of 4 drives. 4 per eSATA port (port multiplier). I encounter no problems when using RAID1 with BTRFS across 8 drives. However, when I use the btrfs balance start -dconvert=raid10 -mconvert=raid10 to convert the set of 8 drives from a RAID1 to a RAID10, it all falls apart with SATA hardware resets. I am guessing that the btrfs driver sits above all this and that the error is in the SATA code and not the btrfs code which relies on it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.